Top-n query tuning in huge OLTP applications
The task of retrieving the top or bottom N rows from a database table is referred to as a "top-N query". The most straightforward, but inefficient, way of accomplishing such a query is by retrieving all rows from the database table(s), sorting them by specified criteria, scanning from the top, and then selecting the top N rows.
But there are two efficient ways to perform a top-N query. You can do so either by using the ROWNUM pseudocolumn available in several versions of Oracle or by utilizing new analytic functions: RANK() and DENSE_RANK().
1. Rank(): RANK
calculates the rank of a value in a group of values. The return type is NUMBER
.
Rows with equal values for the ranking criteria receive the same rank. Oracle Database then adds the number of tied rows to the tied rank to calculate the next rank. Therefore, the ranks may not be consecutive numbers. This function is useful for top-N and bottom-N reporting.
· As an aggregate function, RANK
calculates the rank of a hypothetical row identified by the arguments of the function with respect to a given sort specification. The arguments of the function must all evaluate to constant expressions within each aggregate group, because they identify a single row within each group. The constant argument expressions and the expressions in the ORDER
BY
clause of the aggregate match by position. Therefore, the number of arguments must be the same and their types must be compatible.
· As an analytic function, RANK
computes the rank of each row returned from a query with respect to the other rows returned by the query, based on the values of the value_exprs in the order_by_clause.
Aggregate Syntax
rank_aggregate::=
RANK(expr [, expr ]...) WITHIN GROUP
(ORDER BY
expr [ DESC | ASC ]
[ NULLS { FIRST | LAST } ]
[, expr [ DESC | ASC ]
[ NULLS { FIRST | LAST } ]
]...
)
eg.
SELECT RANK(1982) WITHIN GROUP
(ORDER BY sal DESC) "Rank of 1982"
FROM emp;
Rank of 1982
--------------
7
Analytic Syntax
rank_analytic::=
RANK( )
OVER ([ query_partition_clause ] order_by_clause)
e.g
SELECT deptno, ename, sal, comm, RANK() OVER (PARTITION BY deptno ORDER BY sal DESC, comm) "Rank" FROM emp WHERE deptno in (10,20);
DEPTNO | ENAME | SAL | COMM | Rank |
10 | KING | 5,000.00 | | 1 |
10 | CLARK | 2,450.00 | | 2 |
10 | MILLER | 1,300.00 | | 3 |
20 | SCOTT | 3,000.00 | | 1 |
20 | FORD | 3,000.00 | | 1 |
20 | JONES | 2,975.00 | | 3 |
20 | ADAMS | 1,100.00 | | 4 |
20 | SMITH | 800.00 | | 5 |
2. dense_rank(): DENSE_RANK
computes the rank of a row in an ordered group of rows and returns the rank as a NUMBER
. The ranks are consecutive integers beginning with 1. The largest rank value is the number of unique values returned by the query. Rank values are not skipped in the event of ties. Rows with equal values for the ranking criteria receive the same rank. This function is useful for top-N and bottom-N reporting.
This function accepts as arguments any numeric datatype and returns NUMBER
.
· As an aggregate function, DENSE_RANK
calculates the dense rank of a hypothetical row identified by the arguments of the function with respect to a given sort specification. The arguments of the function must all evaluate to constant expressions within each aggregate group, because they identify a single row within each group. The constant argument expressions and the expressions in the order_by_clause of the aggregate match by position. Therefore, the number of arguments must be the same and types must be compatible.
· As an analytic function, DENSE_RANK
computes the rank of each row returned from a query with respect to the other rows, based on the values of the value_exprs in the order_by_clause.
Aggregate Syntax
dense_rank_aggregate::=
DENSE_RANK(expr [, expr ]...) WITHIN GROUP
(ORDER BY expr [ DESC | ASC ]
[ NULLS { FIRST | LAST } ]
[,expr [ DESC | ASC ]
[ NULLS { FIRST | LAST } ]
]...
)
e.g
SELECT DENSE_RANK (1982, .05) WITHIN GROUP (ORDER BY sal DESC, comm) "Dense Rank"
FROM emp;
Dense Rank
-------------------
6
Analytic Syntax
dense_rank_analytic::=
DENSE_RANK( )
OVER([ query_partition_clause ] order_by_clause)
SELECT d.deptno, e.ename, e.sal, DENSE_RANK () OVER (PARTITION BY e.deptno ORDER BY e.sal) AS "Dense Rank" FROM emp e, dept d WHERE e.deptno = d.deptno AND d.deptno IN (10, 30);
DEPTNO | ENAME | SAL | Dense Rank |
10 | MILLER | 1,300.00 | 10 |
10 | CLARK | 2,450.00 | 2 |
10 | KING | 5,000.00 | 3 |
30 | JAMES | 950.00 | 1 |
30 | MARTIN | 1,250.00 | 2 |
30 | WARD | 1,250.00 | 2 |
30 | TURNER | 1,500.00 | 3 |
30 | ALLEN | 1,600.00 | 4 |
30 | BLAKE | 2,850.00 | 5 |
Using the ROWNUM Pseudocolumn
For each row returned by a query, the ROWNUM
pseudocolumn returns a number indicating the order in which Oracle selects the row from a table or set of joined rows. The first row selected has a ROWNUM
of 1, the second has 2, and so on.You can use ROWNUM
to limit the number of rows returned by a query, as in this example:
SELECT * FROM employees WHERE ROWNUM <>
If a SELECT
statement includes an ORDER
BY
clause, ROWNUM
s are assigned to the retrieved rows before the sort is done; use a subselect to get the first n sorted rows. The value of ROWNUM
increases only when a row is retrieved, so the only meaningful uses of ROWNUM
in a WHERE
clause are:
... WHERE ROWNUM <>
... WHERE ROWNUM <= constant;
Courtesy:- Marcin Miller
Top-n Query is one of the more advanced SQL problems. How does one retrieve N first (or least) rows from a record set? For example, how does one find the top five highest-paid employees in a given department?
People with no experience in such situation usually try to solve the problem with a query like this:
SELECT e.*
FROM emp e
WHERE ROWNUM < 6
ORDER BY sal DESC
Of course, this is the incorrect solution. The 'where rownum<6' condition is executed BEFORE all the rows are sorted. This means that instead of the top five highest-paid employees, you get a random set of five employees sorted by their salaries. Below is the correct solution:
SELECT *
FROM (SELECT ROWNUM,
e.*
FROM emp e
ORDER BY sal DESC)
WHERE ROWNUM <>
Similar queries are often used in various reports and sometimes have much more complicated structure - for instance:
SELECT ename,
sal,
dname,
manager
FROM (SELECT ROWNUM,
e.ename,
e.sal,
d.dname,
e2.ename AS manager
FROM emp e,
dept d,
emp e2
WHERE e.deptno = d.deptno
AND e.mgrno = e2.empno
AND e.job = 'SALESMAN'
ORDER BY e.sal DESC)
WHERE ROWNUM <>
In this case, the Oracle optimizer has to create a whole Cartesian product described by the inline view, then sort it out and finally retrieve the first five rows. This query is executed very quickly on SCOTT's tables, but the problem becomes complicated if the table EMP consists of one milion rows. The problem also becomes even more complicated if (for example) besides just the name of employee and his/her manager, you also need to get the employee's address, phone number, shop's address, etc. All these data must be read from another tables and joined with the inline view before sorting. It may cause the query to execute very slowly.
Is there any other way to do it?
Yes, there is. Just modify the starting query:
SELECT e.ename,
e.sal,
d.dname,
e2.ename
FROM (SELECT *
FROM (SELECT empno,
ename,
sal,
mgrno,
deptno
FROM emp
WHERE job = 'SALESMAN'
ORDER BY emp.sal DESC)
WHERE ROWNUM < 6) e,
dept d,
emp e2
WHERE e.deptno = d.deptno
AND e.mgrno = e2.empno
ORDER BY sal DESC
At first glance it looks like an unnecessary complication. But in fact the structure of the query forces the optimizer to find the top five highest-paid employees, retrieve their data, and just after that join them with other tables. The benefit is obvious: the join operation is executed for only five rows, instead of hundreds of thousands. The proper use of top-n queries may also increase efficiency when the application retrieves many rows for the given conditions, but the user is actually interested only in a few of them. For example, the application user might need to look through the invoices (with all the details, products, prices, etc.) paid by a selected customer. The user is usually more interested in the most recent invoices, so the data must be sorted in descending order by invoice date - for example:
SELECT c.custno,
custname,
i.invno,
invdate,
detno,
p.prodno,
proddesc,
detamount,
prodprice
FROM customers c,
invoices i,
details d,
products p
WHERE c.custno = i.custno
AND i.invno = d.invno
AND d.prodno = p.prodno
AND c.custno = 123
ORDER BY i.invdate DESC;
The real application using a query like this worked very fast at first, but after several months it slowed down, and users began to complain. Searching for the data of 'heavy users' with thousands of invoices took a lot of time and became very irritating. After a few meetings with the application users it was found that, in 95% of the cases, information about the last 20 invoices was sufficient. This data is easy to retrieve by the query:
SELECT c.custno,
custname,
i.invno,
invdate,
detno,
p.prodno,
proddesc,
detamount,
prodprice
FROM (SELECT *
FROM (SELECT custno,
invno,
invdate
FROM invoices
WHERE custno = 123
ORDER BY invdate DESC)
WHERE ROWNUM < 21) i,
details d,
products p,
customers c
WHERE c.custno = i.custno
AND i.invno = d.invno
AND p.prodno = d.prodno;
This query sorts data only in the INVOICES table, retrieves just 20 rows, and joins them with other tables. Thanks to this, the query is always executed in less than 0,1 seconds. Of course, the application user has also the opportunity to search more than the last 20
invoices. In such a case, the user has to check the proper check box on the form. This causes the query to execute in the original, full (and slow) version - but at least the user is aware of this.
The examples described above may seem very trivial to many users, but I could give many similar cases: sorting customers by the amount of bought goods, by the number of executed money transfers, by the value
of paid bills; printing 20 operations per bank statement, and so on. In all these situations, moving the conditions and "rownum" clause to the inner query may increase efficiency significantly. For these who are not yet convinced, here are two simple scripts that illustrate the described problems. Both scripts were tested on Oracle9i Enterprise Edition Release 9.2.0.6.0 on HP-UX.
Example 1: Setup
CREATE TABLE emp (
empno NUMBER(10),
ename VARCHAR2(100),
deptno NUMBER(10),
job VARCHAR2(20),
sal NUMBER(10,2),
mgrno NUMBER(10));
CREATE TABLE dept (
deptno NUMBER(10),
dname VARCHAR2(100));
CREATE SEQUENCE emp_seq START WITH 1;
CREATE SEQUENCE dept_seq START WITH 1;
CREATE INDEX emp_idx1 ON emp (
empno);
CREATE INDEX emp_idx2 ON emp (
job,
sal);
CREATE INDEX dept_idx1 ON dept (
deptno);
CREATE OR REPLACE PROCEDURE test1
AS
salary NUMBER(10,2);
manager NUMBER(10,2);
BEGIN
dbms_random.Initialize(1000);
-- departments
FOR i IN 1.. 100 LOOP
INSERT INTO dept (deptno,dname )
VALUES (dept_seq.nextval,
'DEPT' || To_char(dept_seq.currval));
END LOOP;
-- employees in every department
FOR j IN 1.. 100 LOOP
FOR i IN 1.. 1000 LOOP
salary := Abs(Mod(dbms_random.random,
100000));
manager := Abs(Mod(dbms_random.random,
100000));
INSERT INTO emp (empno,ename,deptno,job,sal,mgrno )
VALUES (emp_seq.nextval,
'SCOTT' || To_char(emp_seq.nextval),
j,
'CLERK',
salary,
manager);
INSERT INTO emp (empno,ename,deptno,job,sal,mgrno )
VALUES (emp_seq.nextval,
'SCOTT' || To_char(emp_seq.nextval),
j,
'ANALYST',
salary,
manager);
INSERT INTO emp (empno,ename,deptno,job,sal,mgrno )
VALUES (emp_seq.nextval,
'SCOTT' || To_char(emp_seq.nextval),
j,
'CLERK',
salary,
manager);
INSERT INTO emp (empno,ename,deptno,job,sal,mgrno )
VALUES (emp_seq.nextval,
'SCOTT' || To_char(emp_seq.nextval),
j,
'SALESMAN',
salary,
manager);
INSERT INTO emp (empno,ename,deptno,job,sal,mgrno )
VALUES (emp_seq.nextval,
'SCOTT' || To_char(emp_seq.nextval),
j,
'MANAGER',
salary,
manager);
END LOOP;
COMMIT;
END LOOP;
END;
/
EXEC test1;
ANALYZE TABLE dept COMPUTE STATISTICS;
ANALYZE TABLE emp COMPUTE STATISTICS;
Query 1: Slow
SELECT ename,
sal,
dname,
manager
FROM (SELECT ROWNUM,
e.ename,
e.sal,
d.dname,
e2.ename AS manager
FROM emp e,
dept d,
emp e2
WHERE e.deptno = d.deptno
AND e.mgrno = e2.empno
AND e.job = 'SALESMAN'
ORDER BY e.sal DESC)
WHERE ROWNUM < 6;
Time: 6,125 s
Query 2: Fast
SELECT e.ename,
e.sal,
d.dname,
e2.ename
FROM (SELECT *
FROM (SELECT empno,
ename,
sal,
mgrno,
deptno
FROM emp
WHERE job = 'SALESMAN'
ORDER BY emp.sal DESC)
WHERE ROWNUM < 6) e,
dept d,
emp e2
WHERE e.deptno = d.deptno
AND e.mgrno = e2.empno
ORDER BY sal DESC
Time: 0,65 s
Example 2: Setup
CREATE TABLE customers (
custno NUMBER(10),
custname VARCHAR2(20));
CREATE TABLE invoices (
invno NUMBER(10),
custno NUMBER(10),
invdate DATE);
CREATE TABLE details (
detno NUMBER(10),
invno NUMBER(10),
prodno NUMBER(10),
detamount NUMBER(10));
CREATE TABLE products (
prodno NUMBER(10),
proddesc VARCHAR2(20),
prodprice NUMBER(10,2));
CREATE SEQUENCE cust_seq START WITH 1;
CREATE SEQUENCE inv_seq START WITH 1;
CREATE SEQUENCE det_seq START WITH 1;
CREATE SEQUENCE prod_seq START WITH 1;
CREATE INDEX cust_idx1 ON customers (
custno);
CREATE INDEX inv_idx1 ON invoices (
invno);
CREATE INDEX inv_idx2 ON invoices (
custno,
invdate);
CREATE INDEX det_idx1 ON details (
detno);
CREATE INDEX det_idx2 ON details (
invno);
CREATE INDEX prod_idx1 ON products (
prodno);
CREATE OR REPLACE PROCEDURE test2
AS
customer NUMBER(10);
product NUMBER(10);
BEGIN
dbms_random.Initialize(1000);
-- customers
FOR i IN 1.. 1000 LOOP
INSERT INTO customers
VALUES (cust_seq.nextval,
'CUSTOMER ' || To_char(cust_seq.currval));
END LOOP;
-- products
FOR i IN 1.. 1000 LOOP
INSERT INTO products
VALUES (prod_seq.nextval,
'PRODUCT' || To_char(prod_seq.currval),
prod_seq.currval);
END LOOP;
-- invoices
FOR i IN 1.. 100000 LOOP
customer := Abs(Mod(dbms_random.random,
1000));
INSERT INTO invoices
VALUES (inv_seq.nextval,
customer,
SYSDATE + Abs(Mod(inv_seq.currval,
100)));
-- details
FOR j IN 1.. 10 LOOP
product := Abs(Mod(dbms_random.random,
1000));
INSERT INTO details
VALUES (det_seq.nextval,
inv_seq.currval,
product,
Mod(j,
5));
END LOOP;
COMMIT;
END LOOP;
END;
/
EXEC test2;
ANALYZE TABLE customers COMPUTE STATISTICS;
ANALYZE TABLE products COMPUTE STATISTICS;
ANALYZE TABLE invoices COMPUTE STATISTICS;
ANALYZE TABLE details COMPUTE STATISTICS;
Query 1: Slow
SELECT c.custno,
custname,
i.invno,
invdate,
detno,
p.prodno,
proddesc,
detamount,
prodprice
FROM customers c,
invoices i,
details d,
products p
WHERE c.custno = i.custno
AND i.invno = d.invno
AND d.prodno = p.prodno
AND c.custno = 123
ORDER BY i.invdate DESC;
Time: 0,578 s
Query 2: Fast
SELECT c.custno,
custname,
i.invno,
invdate,
detno,
p.prodno,
proddesc,
detamount,
prodprice
FROM (SELECT *
FROM (SELECT custno,
invno,
invdate
FROM invoices
WHERE custno = 123
ORDER BY invdate DESC)
WHERE ROWNUM < 21) i,
details d,
products p,
customers c
WHERE c.custno = i.custno
AND i.invno = d.invno
AND p.prodno = d.prodno;
Time: 0,093 s
Using RANK() to Obtain a Top-N Query
Now suppose the two employ has the same salary. So the value obtained from the above sql query will return the values where ROWNUM is like the condition. But the above result is false.
Another way to perform a top-N query uses the new Oracle feature called "analytic functions." The SQL language, while extremely capable in many areas, has never provided strong support for analytic tasks, such as computing rankings, cumulative and moving averages, lead/lag comparisons, and reporting. These tasks require extensive PL/SQL programming, often with performance issues. Oracle now provides a new wide set of analytic functions that address this need.
For a top-N query you can use two ranking functions: RANK and DENSE_RANK. Both allow you to rank items in a group—for example, finding the top-five employees by salary, which is exactly what we need to achieve.
The difference between RANK() and DENSE_RANK() is that RANK() leaves gaps in the ranking sequence when there are ties. In our case, Scott and Ford tie for second place with a $3,000 salary; Jones' $2,975 salary brings him in third place using DENSE_RANK() but only fourth place using RANK():
SELECT Empno, Ename, Job, Mgr, Sal,
RANK() OVER
(ORDER BY SAL Desc NULLS LAST) AS Rank,
DENSE_RANK() OVER
(ORDER BY SAL Desc NULLS LAST) AS Drank
FROM Emp
ORDER BY SAL Desc NULLS LAST;
Output of the above code.
EMPNO ENAME JOB MGR SAL RANK DRANK
------ ---------- --------- ---------- ---------- ---------- ----------
7839 KING PRESIDENT 5000 1 1
7788 SCOTT ANALYST 7566 3000 2 2
7902 FORD ANALYST 7566 3000 2 2
7566 JONES MANAGER 7839 2975 4 3
7698 BLAKE MANAGER 7839 2850 5 4
7782 CLARK MANAGER 7839 2450 6 5
7499 ALLEN SALESMAN 7698 1600 7 6
7844 TURNER SALESMAN 7698 1500 8 7
7934 MILLER CLERK 7782 1300 9 8
7521 WARD SALESMAN 7698 1250 10 9
7654 MARTIN SALESMAN 7698 1250 10 9
7876 ADAMS CLERK 7788 1100 12 10
7369 SMITH CLERK 7902 800 13 11
7900 JAMES CLERK 7698 14
The NULLS FIRST | NULLS LAST clause determines the position of rows with NULL values in the ordered query.
If the sequence is in descending order, then NULLS LAST implies that NULL values are smaller than non-NULL ones and rows with NULLs will appear at the bottom of the list. If the NULLS FIRST | NULLS LAST clause is omitted, then NULL values are considered larger than any other values and their ordering position depends on the ASC | DESC arguments.
If the ordering sequence is ascending (ASC), then rows with NULLs will appear last; if the sequence is descending (DESC), then rows with NULLs will appear first. NULLs are considered equal to other NULLs and, therefore, the order in which rows with NULLs are presented is nondeterministic.
To obtain a top-N query, use RANK() in a subquery and then apply a filter condition outside the subquery:
SELECT Empno, Ename, Job, Mgr, Hiredate, Sal
FROM
(SELECT Empno, Ename, Job, Mgr, Hiredate, Sal,
RANK() OVER
(ORDER BY SAL Desc NULLS LAST) AS Emp_Rank
FROM Emp
ORDER BY SAL Desc NULLS LAST)
WHERE Emp_Rank <>
The output of the above code.
EMPNO ENAME JOB MGR HIREDATE SAL
------ ---------- --------- ---------- --------- ----------
7839 KING PRESIDENT 17-NOV-81 5000
7788 SCOTT ANALYST 7566 19-APR-87 3000
7902 FORD ANALYST 7566 03-DEC-81 3000
7566 JONES MANAGER 7839 02-APR-81 2975
7698 BLAKE MANAGER 7839 01-MAY-81 2850
Using the same technique, you can retrieve the bottom-five employees by salary:
SELECT Empno, Ename, Job, Mgr, Hiredate, Sal
FROM
(SELECT Empno, Ename, Job, Mgr, Hiredate, Sal,
RANK() OVER
(ORDER BY SAL ASC NULLS FIRST) AS Emp_Rank
FROM Emp
ORDER BY SAL ASC NULLS FIRST)
WHERE Emp_Rank <>
Output of the above code.
EMPNO ENAME JOB MGR HIREDATE SAL
------ ---------- --------- ---------- --------- ----------
7839 KING PRESIDENT 17-NOV-81 5000
7788 SCOTT ANALYST 7566 19-APR-87 3000
7902 FORD ANALYST 7566 03-DEC-81 3000
7566 JONES MANAGER 7839 02-APR-81 2975
7698 BLAKE MANAGER 7839 01-MAY-81 2850
Ranking functions can be used to operate within groups, too—that is, the rank value gets reset whenever the group changes. This is achieved with a PARTION BY subclause. Here is the syntax to retrieve the top employee by salary per manager group:
SELECT Empno, Ename, Job, Mgr, Hiredate, Sal
FROM
(SELECT Empno, Ename, Job, Mgr, Hiredate, Sal,
RANK() OVER
(PARTITION BY MGR ORDER BY MGR, SAL DESC NULLS LAST) AS Emp_Rank
FROM Emp
ORDER BY MGR, SAL DESC NULLS LAST)
WHERE Emp_Rank = 1;
EMPNO ENAME JOB MGR HIREDATE SAL
------ ---------- --------- ---------- --------- ----------
7788 SCOTT ANALYST 7566 19-APR-87 3000
7902 FORD ANALYST 7566 03-DEC-81 3000
7499 ALLEN SALESMAN 7698 20-FEB-81 1600
7934 MILLER CLERK 7782 23-JAN-82 1300
7876 ADAMS CLERK 7788 23-MAY-87 1100
7566 JONES MANAGER 7839 02-APR-81 2975
7369 SMITH CLERK 7902 17-DEC-80 800
7839 KING PRESIDENT 17-NOV-81 5000
The FIRST
and LAST
functions can be used to return the first or last value from an ordered sequence. Say we want to display the salary of each employee, along with the lowest and highest within their department we may use something like:
SELECT deptno,
ename,
sal,
MIN(sal) KEEP (DENSE_RANK FIRST ORDER BY sal) OVER (PARTITION BY deptno) "Lowest",
MAX(sal) KEEP (DENSE_RANK LAST ORDER BY sal) OVER (PARTITION BY deptno) "Highest"
FROM emp
ORDER BY deptno, sal;
DEPTNO ENAME SAL Lowest Highest
---------- ---------- ---------- ---------- ----------
10 MILLER 1300 1300 5000
10 CLARK 2450 1300 5000
10 KING 5000 1300 5000
20 SMITH 800 800 3000
20 ADAMS 1100 800 3000
20 JONES 2975 800 3000
20 SCOTT 3000 800 3000
20 FORD 3000 800 3000
30 JAMES 950 950 2850
30 WARD 1250 950 2850
30 MARTIN 1250 950 2850
DEPTNO ENAME SAL Lowest Highest
---------- ---------- ---------- ---------- ----------
30 TURNER 1500 950 2850
30 ALLEN 1600 950 2850
30 BLAKE 2850 950 2850
No comments:
Post a Comment