Friday, May 2, 2008

Top-n Query ----Using Rownum and Rank()

Top-n query tuning in huge OLTP applications

The task of retrieving the top or bottom N rows from a database table is referred to as a "top-N query". The most straightforward, but inefficient, way of accomplishing such a query is by retrieving all rows from the database table(s), sorting them by specified criteria, scanning from the top, and then selecting the top N rows.

But there are two efficient ways to perform a top-N query. You can do so either by using the ROWNUM pseudocolumn available in several versions of Oracle or by utilizing new analytic functions: RANK() and DENSE_RANK().

1. Rank(): RANK calculates the rank of a value in a group of values. The return type is NUMBER.

Rows with equal values for the ranking criteria receive the same rank. Oracle Database then adds the number of tied rows to the tied rank to calculate the next rank. Therefore, the ranks may not be consecutive numbers. This function is useful for top-N and bottom-N reporting.

· As an aggregate function, RANK calculates the rank of a hypothetical row identified by the arguments of the function with respect to a given sort specification. The arguments of the function must all evaluate to constant expressions within each aggregate group, because they identify a single row within each group. The constant argument expressions and the expressions in the ORDER BY clause of the aggregate match by position. Therefore, the number of arguments must be the same and their types must be compatible.

· As an analytic function, RANK computes the rank of each row returned from a query with respect to the other rows returned by the query, based on the values of the value_exprs in the order_by_clause.

Aggregate Syntax

rank_aggregate::=

RANK(expr [, expr ]...) WITHIN GROUP
   (ORDER BY
    expr [ DESC | ASC ]
         [ NULLS { FIRST | LAST } ]
    [, expr [ DESC | ASC ]
            [ NULLS { FIRST | LAST } ]
    ]...
   )
 
eg. 

SELECT RANK(1982) WITHIN GROUP

(ORDER BY sal DESC) "Rank of 1982"

FROM emp;

 
Rank of 1982
--------------
             7

Analytic Syntax

rank_analytic::=

 RANK( )
   OVER ([ query_partition_clause ] order_by_clause)

e.g

SELECT deptno, ename, sal, comm, RANK() OVER (PARTITION BY deptno ORDER BY sal DESC, comm) "Rank" FROM emp WHERE deptno in (10,20);

DEPTNO

ENAME

SAL

COMM

Rank

10

KING

5,000.00


1

10

CLARK

2,450.00


2

10

MILLER

1,300.00


3

20

SCOTT

3,000.00


1

20

FORD

3,000.00


1

20

JONES

2,975.00


3

20

ADAMS

1,100.00


4

20

SMITH

800.00


5

2. dense_rank(): DENSE_RANK computes the rank of a row in an ordered group of rows and returns the rank as a NUMBER. The ranks are consecutive integers beginning with 1. The largest rank value is the number of unique values returned by the query. Rank values are not skipped in the event of ties. Rows with equal values for the ranking criteria receive the same rank. This function is useful for top-N and bottom-N reporting.

This function accepts as arguments any numeric datatype and returns NUMBER.

· As an aggregate function, DENSE_RANK calculates the dense rank of a hypothetical row identified by the arguments of the function with respect to a given sort specification. The arguments of the function must all evaluate to constant expressions within each aggregate group, because they identify a single row within each group. The constant argument expressions and the expressions in the order_by_clause of the aggregate match by position. Therefore, the number of arguments must be the same and types must be compatible.

· As an analytic function, DENSE_RANK computes the rank of each row returned from a query with respect to the other rows, based on the values of the value_exprs in the order_by_clause.

Aggregate Syntax

dense_rank_aggregate::= Description of dense_rank_aggregate.gif follows





DENSE_RANK(expr [, expr ]...) WITHIN GROUP
  (ORDER BY expr [ DESC | ASC ]
                 [ NULLS { FIRST | LAST } ]
            [,expr [ DESC | ASC ]
                   [ NULLS { FIRST | LAST } ]
            ]...
  )

e.g

SELECT DENSE_RANK (1982, .05) WITHIN GROUP (ORDER BY sal DESC, comm) "Dense Rank"

  FROM emp;
 
         Dense Rank
-------------------
                  6


Analytic Syntax

dense_rank_analytic::=

Description of dense_rank_analytic.gif follows
DENSE_RANK( )

   OVER([ query_partition_clause ] order_by_clause)
 

SELECT d.deptno, e.ename, e.sal, DENSE_RANK () OVER (PARTITION BY e.deptno ORDER BY e.sal) AS "Dense Rank" FROM emp e, dept d WHERE e.deptno = d.deptno AND d.deptno IN (10, 30);

DEPTNO

ENAME

SAL

Dense Rank

10

MILLER

1,300.00

10

10

CLARK

2,450.00

2

10

KING

5,000.00

3

30

JAMES

950.00

1

30

MARTIN

1,250.00

2

30

WARD

1,250.00

2

30

TURNER

1,500.00

3

30

ALLEN

1,600.00

4

30

BLAKE

2,850.00

5

Using the ROWNUM Pseudocolumn

For each row returned by a query, the ROWNUM pseudocolumn returns a number indicating the order in which Oracle selects the row from a table or set of joined rows. The first row selected has a ROWNUM of 1, the second has 2, and so on.You can use ROWNUM to limit the number of rows returned by a query, as in this example:

            SELECT * FROM employees WHERE ROWNUM <>

If a SELECT statement includes an ORDER BY clause, ROWNUMs are assigned to the retrieved rows before the sort is done; use a subselect to get the first n sorted rows. The value of ROWNUM increases only when a row is retrieved, so the only meaningful uses of ROWNUM in a WHERE clause are:



... WHERE ROWNUM <>

... WHERE ROWNUM <= constant;


Top-n Query is one of the more advanced SQL problems. How does one retrieve N first (or least) rows from a record set? For example, how does one find the top five highest-paid employees in a given department?

People with no experience in such situation usually try to solve the problem with a query like this:

SELECT e.*

FROM emp e

WHERE ROWNUM < 6

ORDER BY sal DESC

Of course, this is the incorrect solution. The 'where rownum<6' condition is executed BEFORE all the rows are sorted. This means that instead of the top five highest-paid employees, you get a random set of five employees sorted by their salaries. Below is the correct solution:

SELECT *

FROM (SELECT ROWNUM,

e.*

FROM emp e

ORDER BY sal DESC)

WHERE ROWNUM <>

Similar queries are often used in various reports and sometimes have much more complicated structure - for instance:

SELECT ename,

sal,

dname,

manager

FROM (SELECT ROWNUM,

e.ename,

e.sal,

d.dname,

e2.ename AS manager

FROM emp e,

dept d,

emp e2

WHERE e.deptno = d.deptno

AND e.mgrno = e2.empno

AND e.job = 'SALESMAN'

ORDER BY e.sal DESC)

WHERE ROWNUM <>

In this case, the Oracle optimizer has to create a whole Cartesian product described by the inline view, then sort it out and finally retrieve the first five rows. This query is executed very quickly on SCOTT's tables, but the problem becomes complicated if the table EMP consists of one milion rows. The problem also becomes even more complicated if (for example) besides just the name of employee and his/her manager, you also need to get the employee's address, phone number, shop's address, etc. All these data must be read from another tables and joined with the inline view before sorting. It may cause the query to execute very slowly.
Is there any other way to do it?

Yes, there is. Just modify the starting query:

SELECT   e.ename,

e.sal,

d.dname,

e2.ename

FROM (SELECT *

FROM (SELECT empno,

ename,

sal,

mgrno,

deptno

FROM emp

WHERE job = 'SALESMAN'

ORDER BY emp.sal DESC)

WHERE ROWNUM < 6) e,

dept d,

emp e2

WHERE e.deptno = d.deptno

AND e.mgrno = e2.empno

ORDER BY sal DESC

At first glance it looks like an unnecessary complication. But in fact the structure of the query forces the optimizer to find the top five highest-paid employees, retrieve their data, and just after that join them with other tables. The benefit is obvious: the join operation is executed for only five rows, instead of hundreds of thousands. The proper use of top-n queries may also increase efficiency when the application retrieves many rows for the given conditions, but the user is actually interested only in a few of them. For example, the application user might need to look through the invoices (with all the details, products, prices, etc.) paid by a selected customer. The user is usually more interested in the most recent invoices, so the data must be sorted in descending order by invoice date - for example:

SELECT   c.custno,

custname,

i.invno,

invdate,

detno,

p.prodno,

proddesc,

detamount,

prodprice

FROM customers c,

invoices i,

details d,

products p

WHERE c.custno = i.custno

AND i.invno = d.invno

AND d.prodno = p.prodno

AND c.custno = 123

ORDER BY i.invdate DESC;

The real application using a query like this worked very fast at first, but after several months it slowed down, and users began to complain. Searching for the data of 'heavy users' with thousands of invoices took a lot of time and became very irritating. After a few meetings with the application users it was found that, in 95% of the cases, information about the last 20 invoices was sufficient. This data is easy to retrieve by the query:

SELECT c.custno,

custname,

i.invno,

invdate,

detno,

p.prodno,

proddesc,

detamount,

prodprice

FROM (SELECT *

FROM (SELECT custno,

invno,

invdate

FROM invoices

WHERE custno = 123

ORDER BY invdate DESC)

WHERE ROWNUM < 21) i,

details d,

products p,

customers c

WHERE c.custno = i.custno

AND i.invno = d.invno

AND p.prodno = d.prodno;

This query sorts data only in the INVOICES table, retrieves just 20 rows, and joins them with other tables. Thanks to this, the query is always executed in less than 0,1 seconds. Of course, the application user has also the opportunity to search more than the last 20
invoices. In such a case, the user has to check the proper check box on the form. This causes the query to execute in the original, full (and slow) version - but at least the user is aware of this.

The examples described above may seem very trivial to many users, but I could give many similar cases: sorting customers by the amount of bought goods, by the number of executed money transfers, by the value
of paid bills; printing 20 operations per bank statement, and so on. In all these situations, moving the conditions and "rownum" clause to the inner query may increase efficiency significantly. For these who are not yet convinced, here are two simple scripts that illustrate the described problems. Both scripts were tested on Oracle9i Enterprise Edition Release 9.2.0.6.0 on HP-UX.

Example 1: Setup

CREATE TABLE emp (

empno NUMBER(10),

ename VARCHAR2(100),

deptno NUMBER(10),

job VARCHAR2(20),

sal NUMBER(10,2),

mgrno NUMBER(10));



CREATE TABLE dept (

deptno NUMBER(10),

dname VARCHAR2(100));



CREATE SEQUENCE emp_seq START WITH 1;

CREATE SEQUENCE dept_seq START WITH 1;



CREATE INDEX emp_idx1 ON emp (

empno);



CREATE INDEX emp_idx2 ON emp (

job,

sal);



CREATE INDEX dept_idx1 ON dept (

deptno);



CREATE OR REPLACE PROCEDURE test1

AS

salary NUMBER(10,2);

manager NUMBER(10,2);

BEGIN

dbms_random.Initialize(1000);

-- departments



FOR i IN 1.. 100 LOOP

INSERT INTO dept (deptno,dname )

VALUES (dept_seq.nextval,

'DEPT' || To_char(dept_seq.currval));

END LOOP;

-- employees in every department



FOR j IN 1.. 100 LOOP

FOR i IN 1.. 1000 LOOP

salary := Abs(Mod(dbms_random.random,



100000));



manager := Abs(Mod(dbms_random.random,

100000));



INSERT INTO emp (empno,ename,deptno,job,sal,mgrno )

VALUES (emp_seq.nextval,

'SCOTT' || To_char(emp_seq.nextval),

j,

'CLERK',

salary,

manager);



INSERT INTO emp (empno,ename,deptno,job,sal,mgrno )

VALUES (emp_seq.nextval,

'SCOTT' || To_char(emp_seq.nextval),

j,

'ANALYST',

salary,

manager);



INSERT INTO emp (empno,ename,deptno,job,sal,mgrno )

VALUES (emp_seq.nextval,

'SCOTT' || To_char(emp_seq.nextval),

j,

'CLERK',

salary,

manager);



INSERT INTO emp (empno,ename,deptno,job,sal,mgrno )

VALUES (emp_seq.nextval,

'SCOTT' || To_char(emp_seq.nextval),

j,

'SALESMAN',

salary,

manager);



INSERT INTO emp (empno,ename,deptno,job,sal,mgrno )



VALUES (emp_seq.nextval,

'SCOTT' || To_char(emp_seq.nextval),

j,

'MANAGER',

salary,

manager);

END LOOP;



COMMIT;

END LOOP;

END;

/



EXEC test1;

ANALYZE TABLE dept COMPUTE STATISTICS;

ANALYZE TABLE emp COMPUTE STATISTICS;

Query 1: Slow

SELECT ename,

sal,

dname,

manager

FROM (SELECT ROWNUM,

e.ename,

e.sal,

d.dname,

e2.ename AS manager

FROM emp e,

dept d,

emp e2

WHERE e.deptno = d.deptno

AND e.mgrno = e2.empno

AND e.job = 'SALESMAN'

ORDER BY e.sal DESC)

WHERE ROWNUM < 6;



Time: 6,125 s

Query 2: Fast

SELECT   e.ename,

e.sal,

d.dname,

e2.ename

FROM (SELECT *

FROM (SELECT empno,

ename,

sal,

mgrno,

deptno

FROM emp

WHERE job = 'SALESMAN'

ORDER BY emp.sal DESC)

WHERE ROWNUM < 6) e,

dept d,

emp e2

WHERE e.deptno = d.deptno

AND e.mgrno = e2.empno

ORDER BY sal DESC



Time: 0,65 s

Example 2: Setup

CREATE TABLE customers (

custno NUMBER(10),

custname VARCHAR2(20));



CREATE TABLE invoices (

invno NUMBER(10),

custno NUMBER(10),

invdate DATE);



CREATE TABLE details (

detno NUMBER(10),

invno NUMBER(10),

prodno NUMBER(10),

detamount NUMBER(10));



CREATE TABLE products (

prodno NUMBER(10),

proddesc VARCHAR2(20),

prodprice NUMBER(10,2));



CREATE SEQUENCE cust_seq START WITH 1;

CREATE SEQUENCE inv_seq START WITH 1;

CREATE SEQUENCE det_seq START WITH 1;

CREATE SEQUENCE prod_seq START WITH 1;



CREATE INDEX cust_idx1 ON customers (

custno);



CREATE INDEX inv_idx1 ON invoices (

invno);



CREATE INDEX inv_idx2 ON invoices (

custno,

invdate);



CREATE INDEX det_idx1 ON details (

detno);



CREATE INDEX det_idx2 ON details (

invno);



CREATE INDEX prod_idx1 ON products (

prodno);



CREATE OR REPLACE PROCEDURE test2

AS

customer NUMBER(10);

product NUMBER(10);

BEGIN

dbms_random.Initialize(1000);

-- customers



FOR i IN 1.. 1000 LOOP

INSERT INTO customers

VALUES (cust_seq.nextval,

'CUSTOMER ' || To_char(cust_seq.currval));

END LOOP;

-- products



FOR i IN 1.. 1000 LOOP

INSERT INTO products

VALUES (prod_seq.nextval,

'PRODUCT' || To_char(prod_seq.currval),

prod_seq.currval);

END LOOP;

-- invoices



FOR i IN 1.. 100000 LOOP

customer := Abs(Mod(dbms_random.random,

1000));





INSERT INTO invoices

VALUES (inv_seq.nextval,

customer,

SYSDATE + Abs(Mod(inv_seq.currval,

100)));

-- details



FOR j IN 1.. 10 LOOP

product := Abs(Mod(dbms_random.random,

1000));



INSERT INTO details

VALUES (det_seq.nextval,

inv_seq.currval,

product,

Mod(j,

5));

END LOOP;



COMMIT;

END LOOP;

END;

/



EXEC test2;



ANALYZE TABLE customers COMPUTE STATISTICS;

ANALYZE TABLE products COMPUTE STATISTICS;

ANALYZE TABLE invoices COMPUTE STATISTICS;

ANALYZE TABLE details COMPUTE STATISTICS;

Query 1: Slow

SELECT   c.custno,

custname,

i.invno,

invdate,

detno,

p.prodno,

proddesc,

detamount,

prodprice

FROM customers c,

invoices i,

details d,

products p

WHERE c.custno = i.custno

AND i.invno = d.invno

AND d.prodno = p.prodno

AND c.custno = 123

ORDER BY i.invdate DESC;



Time: 0,578 s

Query 2: Fast

SELECT c.custno,

custname,

i.invno,

invdate,

detno,

p.prodno,

proddesc,

detamount,

prodprice

FROM (SELECT *

FROM (SELECT custno,

invno,

invdate

FROM invoices

WHERE custno = 123

ORDER BY invdate DESC)

WHERE ROWNUM < 21) i,

details d,

products p,

customers c

WHERE c.custno = i.custno

AND i.invno = d.invno

AND p.prodno = d.prodno;



Time: 0,093 s
 
 

Using RANK() to Obtain a Top-N Query

Now suppose the two employ has the same salary. So the value obtained from the above sql query will return the values where ROWNUM is like the condition. But the above result is false.

Another way to perform a top-N query uses the new Oracle feature called "analytic functions." The SQL language, while extremely capable in many areas, has never provided strong support for analytic tasks, such as computing rankings, cumulative and moving averages, lead/lag comparisons, and reporting. These tasks require extensive PL/SQL programming, often with performance issues. Oracle now provides a new wide set of analytic functions that address this need.

For a top-N query you can use two ranking functions: RANK and DENSE_RANK. Both allow you to rank items in a group—for example, finding the top-five employees by salary, which is exactly what we need to achieve.

The difference between RANK() and DENSE_RANK() is that RANK() leaves gaps in the ranking sequence when there are ties. In our case, Scott and Ford tie for second place with a $3,000 salary; Jones' $2,975 salary brings him in third place using DENSE_RANK() but only fourth place using RANK():

SELECT Empno, Ename, Job, Mgr, Sal,
   RANK() OVER
      (ORDER BY SAL Desc NULLS LAST) AS Rank,
   DENSE_RANK() OVER
      (ORDER BY SAL Desc NULLS LAST) AS Drank
   FROM Emp
   ORDER BY SAL Desc NULLS LAST;
 

Output of the above code.

EMPNO ENAME      JOB              MGR        SAL       RANK      DRANK
------ ---------- --------- ---------- ---------- ---------- ----------
  7839 KING       PRESIDENT                  5000          1          1
  7788 SCOTT      ANALYST         7566       3000          2          2
  7902 FORD       ANALYST         7566       3000          2          2
  7566 JONES      MANAGER         7839       2975          4          3
  7698 BLAKE      MANAGER         7839       2850          5          4
  7782 CLARK      MANAGER         7839       2450          6          5
  7499 ALLEN      SALESMAN        7698       1600          7          6
  7844 TURNER     SALESMAN        7698       1500          8          7
  7934 MILLER     CLERK           7782       1300          9          8
  7521 WARD       SALESMAN        7698       1250         10          9
  7654 MARTIN     SALESMAN        7698       1250         10          9
  7876 ADAMS      CLERK           7788       1100         12         10
  7369 SMITH      CLERK           7902        800         13         11
  7900 JAMES      CLERK           7698                    14   

The NULLS FIRST | NULLS LAST clause determines the position of rows with NULL values in the ordered query.

If the sequence is in descending order, then NULLS LAST implies that NULL values are smaller than non-NULL ones and rows with NULLs will appear at the bottom of the list. If the NULLS FIRST | NULLS LAST clause is omitted, then NULL values are considered larger than any other values and their ordering position depends on the ASC | DESC arguments.

If the ordering sequence is ascending (ASC), then rows with NULLs will appear last; if the sequence is descending (DESC), then rows with NULLs will appear first. NULLs are considered equal to other NULLs and, therefore, the order in which rows with NULLs are presented is nondeterministic.

To obtain a top-N query, use RANK() in a subquery and then apply a filter condition outside the subquery:

SELECT Empno, Ename, Job, Mgr, Hiredate, Sal
   FROM
   (SELECT Empno, Ename, Job, Mgr, Hiredate, Sal,
      RANK() OVER
         (ORDER BY SAL Desc NULLS LAST) AS Emp_Rank
      FROM Emp
      ORDER BY SAL Desc NULLS LAST)
   WHERE Emp_Rank <>
 

The output of the above code.

EMPNO  ENAME      JOB              MGR HIREDATE         SAL
------ ---------- --------- ---------- --------- ----------
  7839 KING       PRESIDENT            17-NOV-81       5000
  7788 SCOTT      ANALYST         7566 19-APR-87       3000
  7902 FORD       ANALYST         7566 03-DEC-81       3000
  7566 JONES      MANAGER         7839 02-APR-81       2975
  7698 BLAKE      MANAGER         7839 01-MAY-81       2850

Using the same technique, you can retrieve the bottom-five employees by salary:

SELECT Empno, Ename, Job, Mgr, Hiredate, Sal
   FROM
   (SELECT Empno, Ename, Job, Mgr, Hiredate, Sal,
      RANK() OVER
         (ORDER BY SAL ASC NULLS FIRST) AS Emp_Rank
      FROM Emp
      ORDER BY SAL ASC NULLS FIRST)
   WHERE Emp_Rank <>
 

Output of the above code.

            EMPNO  ENAME      JOB            MGR HIREDATE         SAL
 ------ ---------- --------- ---------- ---------   ----------
  7839 KING       PRESIDENT            17-NOV-81       5000
  7788 SCOTT      ANALYST         7566 19-APR-87       3000
  7902 FORD       ANALYST         7566 03-DEC-81       3000
  7566 JONES      MANAGER         7839 02-APR-81       2975
  7698 BLAKE      MANAGER         7839 01-MAY-81       2850

Ranking functions can be used to operate within groups, too—that is, the rank value gets reset whenever the group changes. This is achieved with a PARTION BY subclause. Here is the syntax to retrieve the top employee by salary per manager group:

SELECT Empno, Ename, Job, Mgr, Hiredate, Sal
   FROM
   (SELECT Empno, Ename, Job, Mgr, Hiredate, Sal,
      RANK() OVER
         (PARTITION BY MGR ORDER BY MGR, SAL DESC NULLS LAST) AS Emp_Rank
      FROM Emp
      ORDER BY MGR, SAL DESC NULLS LAST)
   WHERE Emp_Rank = 1;
            EMPNO  ENAME      JOB              MGR HIREDATE         SAL
------ ---------- --------- ---------- --------- ----------
  7788 SCOTT      ANALYST         7566 19-APR-87       3000
  7902 FORD       ANALYST         7566 03-DEC-81       3000
  7499 ALLEN      SALESMAN        7698 20-FEB-81       1600
  7934 MILLER     CLERK           7782 23-JAN-82       1300
  7876 ADAMS      CLERK           7788 23-MAY-87       1100
  7566 JONES      MANAGER         7839 02-APR-81       2975
  7369 SMITH      CLERK           7902 17-DEC-80        800
  7839 KING       PRESIDENT            17-NOV-81       5000
 

The FIRST and LAST functions can be used to return the first or last value from an ordered sequence. Say we want to display the salary of each employee, along with the lowest and highest within their department we may use something like:

SELECT deptno,
       ename,
       sal,
       MIN(sal) KEEP (DENSE_RANK FIRST ORDER BY sal) OVER (PARTITION BY deptno) "Lowest",
       MAX(sal) KEEP (DENSE_RANK LAST ORDER BY sal) OVER (PARTITION BY deptno) "Highest"
FROM   emp
ORDER BY deptno, sal;
 
    DEPTNO ENAME             SAL     Lowest    Highest
---------- ---------- ---------- ---------- ----------
        10 MILLER           1300       1300       5000
        10 CLARK            2450       1300       5000
        10 KING             5000       1300       5000
        20 SMITH             800        800       3000
        20 ADAMS            1100        800       3000
        20 JONES            2975        800       3000
        20 SCOTT            3000        800       3000
        20 FORD             3000        800       3000
        30 JAMES             950        950       2850
        30 WARD             1250        950       2850
        30 MARTIN           1250        950       2850
 
    DEPTNO ENAME             SAL     Lowest    Highest
---------- ---------- ---------- ---------- ----------
        30 TURNER           1500        950       2850
        30 ALLEN            1600        950       2850
        30 BLAKE            2850        950       2850

No comments: