0% found this document useful (0 votes)
13 views75 pages

SQL Decorrelation and Window Functions - in Data Engineering

The document discusses correlated subqueries and how to decorrelate them. [1] Correlated subqueries are subqueries that depend on values from the outer query, slowing down execution time. [2] Decorrelating a query involves converting the correlated subquery into an uncorrelated one by moving conditions to the FROM clause and joining tables. [3] Two examples show how to decorrelate queries by moving the correlated condition to a subquery in the FROM clause and joining on common columns.

Uploaded by

Mo
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
0% found this document useful (0 votes)
13 views75 pages

SQL Decorrelation and Window Functions - in Data Engineering

The document discusses correlated subqueries and how to decorrelate them. [1] Correlated subqueries are subqueries that depend on values from the outer query, slowing down execution time. [2] Decorrelating a query involves converting the correlated subquery into an uncorrelated one by moving conditions to the FROM clause and joining tables. [3] Two examples show how to decorrelate queries by moving the correlated condition to a subquery in the FROM clause and joining on common columns.

Uploaded by

Mo
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 75

CORRELATED SUBQUERIES AND

DECORRELATION
Contents
1. Introduction to subqeries, and decorrelation of correlated subqueries ............................................ 2
2. Examples of decorrelation of correlated queries ................................................................................ 4

Page 1 of 9
1. Introduction to subqeries, and decorrelation of correlated subqueries
Firstly we are going to define what a subquery is. A subquery is basically a query inside another
query (sometimes called „inner query“), and the output of a subquery is used in the outer query (the
query which contains a subquery).

A subquery can appear in almost any part of an SQL query, like in the SELECT clause, in the FROM
clause, and in the WHERE clause. Example of a query containing a subquery in all of these three SQL
statement clauses is given:

SELECT n_name, (
SELECT count(*) A subquery in SELECT clause
FROM region
)
FROM nation,
(
SELECT *
A subquery in FROM clause
FROM region
WHERE r_name = ‘EUROPE’
) AS region
WHERE n_regionkey=r_regionkey AND
EXISTS (
SELECT 1
A subquery in WHERE clause
FROM customer
WHERE n_nationkey = c_nationkey
);

A so called scalar subquery is a suquery which produces just a single value output, and can be used
as a single value (such a query is the query which is in the above example circled in orange).

Another type of a subquery is so called set-value subquery, which is a subquery which produces set
of values (not a single value). Example of such a query is the blue query from the example above.

By the SQL standard, if you use a subquery which produces a new relation (i.e. table, or set of
values), you must give it a name (like it was done in the case of the blue subquery with the
construction “AS region”).

Now we will define what correlated subqueries are and how do they affect the execution time of a
query. A correlated subquery is a subquery which depends on something (some column, or table)
that was not defined in that subquery, but rather in the outer query. Writing such queries may be
convenient for the people who write the query, but there correlated subqueries are one of the main
reasons why queries do not finish at all (or at least on time). The reason for this is that most of
systems execute the correlated subquery for every row of the outer query, making the execution

Page 2 of 9
time rise quadratic. This problem can be fixed by converting the correlated subquery into a subquery
which is not correlated, and this process is called decorrelation. Some systems do decorrelation by
themselves, but of the systems don’t, and even the ones who do perform decorrelation can’t
perform the decorrelation process for any correlated subqueries. But, because we get such a huge
improvement in execution time by performing decorrelation on a correlated subquery, and because
we can perform the decorrelation process by hand, we should always do it.

An example of a correlated subquery is given below:

SELECT avg(l1.l_extendedprice)
FROM lineitem l1
WHERE l1.l_extendedprice = (
SELECT min(l2.l_extendedprice)
FROM lineitem l2
WHERE l1.l_orderkey = l2.l_orderkey
);

Page 3 of 9
2. Examples of decorrelation of correlated queries
Example 1:

### correlated:

SELECT
sum(l1.l_extendedprice)/7.0 AS avg_yearly
FROM
lineitem l1,
part p
WHERE
p.p_partkey = l1.l_partkey AND
p.p_brand = 'Brand#23' AND
p.p_container = 'MED BOX' AND
l1.l_quantity < (
SELECT 0.2*avg(l2.l_quantity)
FROM lineitem l2
WHERE l2.l_partkey = p.p_partkey
);

Page 4 of 9
### uncorrelated, (1 way):

To perform decorrelation of a correlated query, where in a “WHERE” clause of the outer query we
have a correlated subquery which has one condition in its “WHERE” clause that uses a reference to a
column defined in the outer query, we need to follow the steps:

1. Make a subquery in the “FROM” clause of the outer query


2. In the newly made subquery, in the “FROM” clause put the same tables as the ones specified
in the “FROM” clause of the correlated subquery
3. In the newly made subquery, in the “SELECT” clause put all the columns from the “SELECT”
clause of the correlated subquery, and as well put the column used in the “WHERE” clause
of the correlated subquery which is not referring to anything defined outside the correlated
subquery
4. In the newly made subquery make a “GROUP BY” clause, and put in it all of the columns
which were at the end of step 3 listed in the “SELECT” clause of the newly made subquery
5. By the SQL standard you are obliged to assign this newly made subquery a name, so add it by
using “AS name” syntax after the closing brackets of the newly made subquery
6. The “SELECT” of the outer query should stay as it is
7. Delete the correlated subquery from the “WHERE” clause of the outer query, as well as the
condition in which the correlated subqery participated in
8. Add a condition in the “WHERE” clause of the outer query connecting the newly made
subqery (from the “FROM” clause), and the outer query, by adding condition of equality
among columns of these two for which the correlation happened in the first place
9. Add a condition in the “WHERE” clause of the outer query which resembles the condition for
which a correlated subqery existed in the first place

SELECT
sum(l1.l_extendedprice)/7.0 AS avg_yearly
FROM
lineitem l1,
part p,
(
SELECT
0.2*avg(l2.l_quantity) AS yearavg,
l2.l_partkey
FROM
lineitem l2
GROUP BY
l2.l_partkey
) AS uncor
WHERE
p.p_partkey = l1.l_partkey AND
p.p_brand = 'Brand#23' AND
p.p_container = 'MED BOX' AND
uncor.l_partkey = l1.l_partkey AND
l1.l_quantity < uncor.yearavg;

Page 5 of 9
### uncorrelated, [2 way, (slower than way 1)]:

SELECT
sum(l1.l_extendedprice)/7.0 AS avg_yearly
FROM
lineitem l1,
part p,
(
SELECT
0.2*avg(l2.l_quantity) AS yearavg,
domain_set.l_partkey
FROM
lineitem l2,
(
SELECT DISTINCT
l3.l_partkey
FROM
lineitem l3
) AS domain_set
WHERE
domain_set.l_partkey = l2.l_partkey
GROUP BY
domain_set.l_partkey
) AS uncor
WHERE
p.p_partkey = l1.l_partkey AND
p.p_brand = 'Brand#23' AND
p.p_container = 'MED BOX' AND
l1.l_partkey = uncor.l_partkey AND
l1.l_quantity < uncor.yearavg;

Page 6 of 9
Example 2:

### correlated:

SELECT sum(l1.l_extendedprice)
FROM lineitem l1
WHERE l1.l_extendedprice > (
SELECT avg(l2.l_extendedprice)
FROM lineitem l2
WHERE l2.l_orderkey = l1.l_orderkey);

_________________________________________________________________________________________________

### uncorrelated, (1 way):

SELECT sum(l1.l_extendedprice)
FROM
lineitem l1,
(
SELECT
avg(l2.l_extendedprice) AS avgex,
l2.l_orderkey
FROM
lineitem l2
GROUP BY
l2.l_orderkey
) AS uncor
WHERE
l1.l_orderkey = uncor.l_orderkey AND
l1.l_extendedprice > uncor.avgex

Page 7 of 9
### uncorrelated, [2 way, (slower than way 1)]:

SELECT sum(l1.l_extendedprice)
FROM
lineitem l1,
(
SELECT
avg(l2.l_extendedprice) AS avgprc,
domain_set.l_orderkey
FROM
lineitem l2,
(
SELECT DISTINCT
l3.l_orderkey
FROM
lineitem l3
) AS domain_set
WHERE
domain_set.l_orderkey = l2.l_orderkey
GROUP BY
domain_set.l_orderkey
) AS decor
WHERE
l1.l_orderkey = decor.l_orderkey AND
l1.l_extendedprice > decor.avgprc;

Page 8 of 9
Example 3:

### correlated:

SELECT o1.o_orderkey
FROM orders o1
WHERE o1.o_totalprice < (
SELECT avg(o2.o_totalprice )
FROM orders o2
WHERE o2.o_shippriority = o1.o_shippriority OR
o2.o_orderstatus = o1.o_orderstatus
);

## uncorrelated:

SELECT o1.o_orderkey
FROM
orders o1,
(
SELECT
avg(o2. o_totalprice) AS avgprc,
domain_set.o_shippriority,
domain_set.o_orderstatus
FROM
orders o2,
(
SELECT DISTINCT
o3.o_shippriority,
o3.o_orderstatus
FROM
orders o3
) AS domain_set
WHERE
domain_set.o_shippriority = o2.o_shippriority OR
domain_set.o_orderstatus = o2.o_orderstatus
GROUP BY
domain_set.o_shippriority,
domain_set.o_orderstatus
) AS uncor
WHERE
o1.o_shippriority = uncor.o_shippriority AND
o1.o_orderstatus = uncor.o_orderstatus AND
o1.o_totalprice < uncor.avgprc;

Page 9 of 9
WITH RECURSIVE statement (recursive
common table expression-CTE)
Contents
1. Introduction to WITH RECURSIVE statement ...................................................................................... 2
1.2 Examples of use of WITH RECURSIVE statement .......................................................................... 3
2. WITH RECURSIVE statement with UNION statement ....................................................................... 15
2.1 Examples of use of WITH RECURSIVE statement with UNION .................................................... 16

Page 1 of 21
1. Introduction to WITH RECURSIVE statement

Using WITH RECURSIVE statement is not really using recursion, but rather iteration (in professors
opinion, and he is right).

Firstly the structure of a WITH RECURSIVE statement is going to be introduced:

WITH RECURSIVE name (list_or_input_parameters) AS (


A NON-RECURSIVE TERM
UNION ALL
A RECURSIVE TERM)
SELECT list_of_columns
FROM table
WHERE specify_which_entries_from_the_statement_output;

The first want to discuss what is the difference between A NON-RECURSIVE TERM and A RECURSIVE
TERM. These are also called the first and second queries of the “WITH RECURSIVE” statement. A
NON-RECURSIVE TERM is a query which gives an output in each iteration which is the resulting
output of the previous iteration of the statement, while A RECURSIVE TERM is the term which is a
query containing a call to the same “WITH RECURSIVE” statement as the one in which it is located,
and it contains the terminating condition in its “WHERE” clause, which if it is not specified carefully,
could lead to an infinite loop.

If we on the other hand thing of a “WITH RECURSIVE” statement as a loop, we can think of it as the
following. When a recursive CTE query runs, the first query (the NON-RECURSIVE TERM) generates
one or more beginning rows which are added to the result set. Then, the second query is run and the
resulting rows of this query are added to the result set. This continues so that the second query is run
against all the rows from the last iteration, and the new resulting rows are added to the result set.
The procedure ends when no more rows are returned by from the second query.

The algorithm look as following:

working_table = evaluate_non_recursive_term()
output(working_table)

while(working_table != empty)
{
working_table = evaluate_recursive_term(working_table)

output(working_table)
}

Page 2 of 21
1.2 Examples of use of WITH RECURSIVE statement
Example 1.1: A NON-RECURSIVE TERM

You can look at it as input values parameters of a


WITH RECURSIVE r(i) AS ( loop will have.
SELECT 1 AS i
UNION ALL
SELECT i+1 A RECURSIVE TERM
FROM r
It contains the name of the WITH RECURSIVE
WHERE r.i < 5 statement in the „FROM“ clause, and a terminating
) condition in the „WHERE“ clause.
SELECT *
Notice that there is no comma, name of WITH RESURSIVE statement in „FROM“
FROM r;
nor semicolon at the end of the clause
WITH RECURSIVE statement termination condition in the „WHERE“ clause

_________________________________________________________________________________________________
Example 1.2:

WITH RECURSIVE r(i) AS (


SELECT 1 AS i
UNION ALL
SELECT i+1
FROM r
WHERE r.i < 10)
SELECT *
FROM r;

Page 3 of 21
Example 2:

Using “WITH RCURSIVE” statement for traversing a tree-like structure differs a little bit from
standard use of “WITH RECURSIVE”, so you should look at these two ‘types’ of uses of “WITH
RECURSIVE” statements differently.

With the following code we are just presenting the content of the table „animals“:

animals(id, name, parent) AS (


VLAUES (1, 'animal', null),
(2, 'mammal', 1),
(3, 'giraffe', 2),
(4, 'tiger', 2),
(5, 'reptile', 1),
(6, 'snake', 5),
(7, 'turtle', 5),
(8, 'green sea turtle', 7))
SELECT *
FROM animals;

The following picture is a from FDE lecture “Using a Database”, slide number 20, which shows in a
form of a tree how the entries of the above table are connected with one another.

As well, from the code on the screenshot above you see that it is unnecessary to write an additional
“d” in the “FROM“ clause in the second query of the “WITH RECURSIVE” statement, as professor did
in the following examples.

Page 4 of 21
Example 2.1:

In this query (which has the WITH RECURSIVE statement) we just copied the table “animals” we
outputted above (in “Example 2”).
In this example we are outputting all the animals from the tree which are in the hierarchy below a
“turtle”.

WITH RECURSIVE animals(id, name, parent) AS (


VLAUES (1, 'animal', null),
(2, 'mammal', 1),
(3, 'giraffe', 2),
(4, 'tiger', 2),
(5, 'reptile', 1),
(6, 'snake', 5),
(7, 'turtle', 5),
(8, 'green sea turtle', 7)),
d AS (
SELECT a1.*
FROM animals a1
WHERE a1.name='turtle'
UNION ALL
SELECT a2.*
FROM animals a2, d
WHERE a2.parent=d.id)
SELECT *
FROM d;

Page 5 of 21
Example 2.2:

Same as example 2.1, but instead of looking all the animals which are in the hierarchy below a
„turtle“ in the tree, we are looking at all animals which are in the hierarchy below a „reptile“.

WITH RECURSIVE animals(id, name, parent) AS (


VLAUES (1, 'animal', null),
(2, 'mammal', 1),
(3, 'giraffe', 2),
(4, 'tiger', 2),
(5, 'reptile', 1),
(6, 'snake', 5),
(7, 'turtle', 5),
(8, 'green sea turtle', 7)),
d AS (
SELECT a1.*
FROM animals a1
WHERE a1.name='reptile'
UNION ALL
SELECT a2.*
FROM animals a2, d
WHERE a2.parent=d.id)
SELECT *
FROM d;

“SELECT a2.*” specifies that we select all the columns of table “a2”, which is equivalent as if we
wrote “SELECT a2.id, a2.name, a2.parent”

Page 6 of 21
Example 2.3:

Same as “Example 2.1”, but instead of looking all the animals which are in the hierarchy below a
„turtle“ in the tree, we are looking at all animals which are in the hierarchy above a „turtle“.

WITH RECURSIVE animals(id, name, parent) AS (


VLAUES (1, 'animal', null),
(2, 'mammal', 1),
(3, 'giraffe', 2),
(4, 'tiger', 2),
(5, 'reptile', 1),
(6, 'snake', 5),
(7, 'turtle', 5),
(8, 'green sea turtle', 7)),
d AS (
SELECT a1.*
FROM animals a1
WHERE a1.name='turtle'
UNION ALL
SELECT a2.*
FROM animals a2, d
WHERE a2.id=d.parent)
SELECT *
FROM d;

Page 7 of 21
Example 2.4:

This example shows that you could use more than one constraint in the “WHERE” clause of the
second query (of the two queries a “WITH RECURSIVE” statement consists of).

WITH RECURSIVE animals(id, name, parent) AS (


VLAUES (1, 'animal', null),
(2, 'mammal', 1),
(3, 'giraffe', 2),
(4, 'tiger', 2),
(5, 'reptile', 1),
(6, 'snake', 5),
(7, 'turtle', 5),
(8, 'green sea turtle', 7)),
d AS (
SELECT a1.*
FROM animals a1
WHERE a1.name='turtle'
UNION ALL
SELECT a2.*
FROM animals a2, d
WHERE a2.parent=d.id AND a2.name LIKE ‘%t%’)
SELECT *
FROM d;

Page 8 of 21
Example 3.1:
[Exercise from lecture “Using a Database”, slide 21]

Compute 10!, using WITH RECURSIVE statement (a recursive CTE).

WITH RECURSIVE factorial(n, value) AS (


SELECT 1 AS n, 1 AS value
UNION ALL
SELECT n+1, (n+1)*value
FROM factorial
WHERE n<10)
SELECT value
FROM factorial
WHERE n=10;

___________________________________________________________________________________________________

Example 3.2:

WITH RECURSIVE factorial (n, value) AS (


SELECT 1 AS n, 1::bigint AS value
UNION ALL
SELECT n+1, (n+1)*value
FROM factorial
WHERE n<15)
SELECT value
FROM factorial
WHERE n=15;

You can see that because of the quick growth of the factorial function that in this example (example
3.2) we had to write “1::bigint AS value” (changing the data type of the column) as the value would
quickly become too big to be stored in an integer type.

Page 9 of 21
Example 3.3:

WITH RECURSIVE factorial (n, value) AS (


SELECT 1 AS n, 1::numeric AS value
UNION ALL
SELECT n+1, (n+1)*value
FROM factorial
WHERE n<30)
SELECT value
FROM factorial
WHERE n=30;

With picking the appropriate data type you can compute pretty large factorial numbers. In this
example we compute factorial 30!, and use type “numeric” to be able to represent this big number.

Page 10 of 21
Example 4.1:
[Exercise from lecture “Using a Database”, slide 21]

Compute the first 20 Fibonacci numbers (F1=1, F2=1, Fn=Fn-1+Fn-2):

WITH RECURSIVE fibonacci(n, value, previous_value) AS (


SELECT 1 AS n, 1 AS value, 0 AS previous_value
UNION ALL
SELECT n+1, value+previous_value, value
FROM fibonacci
WHERE n<20)
SELECT *
FROM fibonacci;

Page 11 of 21
Example 4.2:

WITH RECURSIVE fibonacci(n, value, previous_value) AS (


(VALUES(1, 1, 0))
UNION ALL
SELECT n+1, value+previous_value, value
FROM fibonacci
WHERE n<20)
SELECT n, value
FROM fibonacci;

As seen in this example, we can construct explicitly a single row (or “tuple”) with help of “VALUES”
statement. Using “(VALUES(1, 1, 0))” is basically the same as constructing a single row using “SELECT
1 AS n, 1 AS value, 0 AS previous_value”, where we specify 1 for first parameter of the “WITH
RECURSIVE” statement (which is “n”), 1 for the second parameter (which is “value”), and 0 for the
third paameter (which is “previous_value”).

Page 12 of 21
Example 4.3:

WITH RECURSIVE fibonacci(n, value, previous_value) AS (


(VALUES(1, 1, 0), (2, 1, 1))
UNION ALL
SELECT n+1, value+previous_value, value
FROM fibonacci
WHERE n<20)
SELECT n, value
FROM fibonacci;

We can see that if we construct the “WITH RECURSION” statement like above, output will contain
duplicates. This is due to the fact that in each iteration we output the “workingTable” and, it contains
two values. We now have two values in the “workingTable” in each iteration, which leads to a
situation where the last outputting row in an iteration and the first outputted row in the next
iteration are the same.

In general it is a little bit silly to start with two rows, because this may make the problem even more
difficult, but if you have some recursive formula which does require multiple starting values, you can
do it like in this example.

Page 13 of 21
Example 4.4:

WITH RECURSIVE fibonacci(n, value, previous_value) AS (


(VALUES(1, 1, 0), (2, 1, 1))
UNION ALL
SELECT n+1, value+previous_value, value
FROM fibonacci
WHERE n BETWEEN 2 AND 19)
SELECT n, value
FROM fibonacci;

In the step of the algorithm where we expand our table, we ignore the first element. Because we
explicitly gave the first two values, we no longer need to expand the first value, because it has
already been expanded explicitly, so it can be ignored.

Page 14 of 21
2. WITH RECURSIVE statement with UNION statement
Instead of using “UNION ALL” statement to make a connection between two queries in the “WITH
RECURSIVE” clause, we are going to use “UNION” for that purpose. Although “UNION” statement is
not standardised in contrast to “UNION ALL” statement, PostgreSQL supports it. It is very useful
because it eliminates duplicates when running the query, meaning that we never output an element
twice.

A great advantage of usage of “UNION” statement instead of “UNION ALL” statement is the fact that
it will never be stuck in an infinite loop caused by a graph which has a cyclic structure inside it. This
means that if we allow duplicate outputs on a graph with a cyclic structure inside it, and try to
perform “WITH RECURSIVE” statement with “UNION ALL” between the two queries (the two queries
that “WITH RECURSIVE” statement consists of), we will end up in infinite loop (or at least in a loop
until our system recognises that the execution time it too long and terminates the queries itself).

The structure of a “WITH RECURSIVE” statement with “UNION” is of same as when the recursive CTE
with “UNION ALL” statement was used, except that between two queries there is a “UNION”
statement, that is:

WITH RECURSIVE name(list_or_input_parameters) AS (


A NON-RECURSIVE TERM
UNION
A RECURSIVE TERM)
SELECT list_of_columns
FROM table
WHERE specify_which_entries_from_the_statement_output;

The algorithm looks as following:

working_table = unique( evaluate_non_recursive_term() )


result = working_table

while(working_table != empty)
{
working_table = unique( evaluate_recursive_term(working_table) ) / result

result = result UNION working_table


}

output(result)

Page 15 of 21
2.1 Examples of use of WITH RECURSIVE statement with UNION
Example 1:

WITH RECURSIVE
friends(a, b) AS (VALUES (‘Alice’, ‘Bob’),
(‘Alice’, ‘Carol’),
(‘Carol’, ‘Grace’),
(‘Carol’, ‘Chuck’),
(‘Carol’, ‘Grace’),
(‘Chuck’, Anne),
(‘Bob’, ‘Dan’),
(‘Dan’, ‘Anne’),
(‘Eve’, ‘Adam’)),
friendship(name, friend) AS (
SELECT a, b,
FROM friends
UNION ALL
SELECT b, a
FROM friends)

We defined two tables, “friends” and “friendship”.


___________________________________________________________________________________________________

Example 1.1:

SELECT *
FROM friends;

We firstly observe the content of the table “friends”.

Page 16 of 21
We secondly observe the content of the table “friendship”.

SELECT *
FROM friendship;

We see that table “friendship” is the same as the table “friends”, except that in this table we have a
connection between two entries (or people) from both directions, meaning that for an example in
table “friends” we have a row (or edge/connection, if you think about the problem in a way as if it
was presented with a graph) which outputs “Alice” in the first column, and “Bob” in the second
column, while on the other hand, in table “friendship” we have two output row which are contain
the same two people, but in a different columns, i.e. “Alice | Bob” in in one output row, and “Bob |
Alice” in the second output row (you can see these three rows from the two tables in the two
screenshots above, circled in orange). In other words, table “friendship” is symmetric.
_____________________________________________________________________________________________

The following picture is a from FDE lecture “Using a Database”, slide number 23, which shows in a
form of a fully connected graph entries of the friendship are connected with one another.

Page 17 of 21
Example 1.2:

Output all people who are friends with “Alice”, where friends of “Alice” are all people connected to
the “Alice” or some of her friends.

WITH RECURSIVE
friends(a, b) AS (VALUES (‘Alice’, ‘Bob’),
(‘Alice’, ‘Carol’),
(‘Carol’, ‘Grace’),
(‘Carol’, ‘Chuck’),
(‘Carol’, ‘Grace’),
(‘Chuck’, Anne),
(‘Bob’, ‘Dan’),
(‘Dan’, ‘Anne’),
(‘Eve’, ‘Adam’)),
friendship(name, friend) AS (
SELECT a, b,
FROM friends
UNION ALL
SELECT b, a
FROM friends),
freindsofalice AS (
SELECT ‘Alice’ as name
UNION
SELECT friend
FROM friendship, friendofalice
WHERE friendship.name= friendofalice.name)
SELECT *
FROM friendofalice;

Page 18 of 21
Example 1.3:
Output all people who are friends with “Dan”, where friends of “Dan” are all people connected to the
“Dan” or some of his friends.

WITH RECURSIVE
friends(a, b) AS (VALUES (‘Alice’, ‘Bob’),
(‘Alice’, ‘Carol’),
(‘Carol’, ‘Grace’),
(‘Carol’, ‘Chuck’),
(‘Carol’, ‘Grace’),
(‘Chuck’, Anne),
(‘Bob’, ‘Dan’),
(‘Dan’, ‘Anne’),
(‘Eve’, ‘Adam’)),
friendship(name, friend) AS (
SELECT a, b,
FROM friends
UNION ALL
SELECT b, a
FROM friends),
freindsofalice AS (
SELECT ‘Dan’ as name
UNION
SELECT friend
FROM friendship, friendofalice
WHERE friendship.name= friendofalice.name)
SELECT *
FROM friendofalice;

In the query above, in contrast to the query from example 1.1, we just changed the person in the
NON-RECURSIVE TERM (the first query of the WITH RECURSIVE statement) from “Alice” to “Dan”.
Although we did this, we did not change the table name from “friendsofalice” and it may look not
appropriate for looking at friends of someone else, but it is conceptually the same (the table name is
not important).
We get the same result, but only in different order (for example, we start with “Dan” in this example,
as he is specified in the query as the starting point and he is outputted first), and that is because all
the people which are connected to “Alice” are connected to “Dan” as well, and it doesn’t matter
where (or better said from whom) you start, you will always get everybody who are in the same part
of a fully connected graph.

Page 19 of 21
Example 2:

[Exercise sheet 7, section “Formulating the Query”, exercise 2]

“Formulate a query with a recursive view, which finds the number of actors that have a Bacon
Number <= c where c is a given constant.”

WITH RECURSIVE baconnr(id, nr) AS (


SELECT ‘Bacon, Kevin’, 0
UNION
SELECT p2.actor_name, baconnr.nr+1
FROM baconnr, playedin_text p1, playedin_text p2
WHERE baconnr.id=p1.actor_name AND
p1.movie_name=p2.movie_name AND
baconnr.nr<1)
SELECT count(*)
FROM baconnr;

Page 20 of 21
Example 3:

[Exercise sheet 7, section “Tweaking the Query”, exercise 2]

“Is there any mean to reduce the size of intermediate results and thus speed up the query
evaluation? (Hint: Try to remove duplicates after every join. With this technique, the query should
finish within reasonable time for c = 2.)”

WITH RECURSIVE baconnr(id, nr) AS (


SELECT ‘Bacon, Kevin’, 0
UNION ALL
SELECT DISTINCT p2.actor_name, movies.nr+1
FROM (
SELECT DISTINCT p1.movie_name, baconnr.nr
FROM baconnr, playedin_text p1
WHERE baconnr.avtor_name=p1.actor_name
) AS movies,
WHERE movies.movie_name=p2.movie_name AND movies.nr<5
)
SELECT count(DISTINCT actor_name)
FROM baconnr;

Page 21 of 21
WINDOW FUNCTIONS, AND ORDERED-
SET FUNCTIONS
Contents
1. WINDOW FUNCTIONS ......................................................................................................................... 2
1.1 Introduction to window functions................................................................................................. 2
1.2 lecture 11 (FDE), examples, window functions ............................................................................. 4
1.3 lecture 12 (FDE), examples, window functions ........................................................................... 17
1.3.1 WITH statement ................................................................................................................... 27
2. ORDERED-SET FUNCTIONS ................................................................................................................ 31
2.1 Introduction to ordered-set functions ........................................................................................ 31
2.2 lecture 12 (FDE), examples, ordered-set functions ..................................................................... 32
3. GROUPING SETS, ROLLUP AND CUBE ................................................................................................ 36
3.1 Introduction to Grouping sets, Rollup and Cube ......................................................................... 36
3.1.1 GROUP BY GROUPING SETS.................................................................................................. 36
3.1.2 GROUP BY ROLLUP ............................................................................................................... 37
3.1.3 GROUP BY CUBE ................................................................................................................... 37
3.2 lecture 12 (FDE), examples, Grouping sets, Rollup and Cube ..................................................... 38

Page 1 of 45
1. WINDOW FUNCTIONS
1.1 Introduction to window functions
Syntax for window functions:

SELECT columns_which_will_be_outputted, window_function() OVER (


PARTITION BY some_column
ORDER BY some_column
RANGE/ROWS BETWEEN value1 preceding AND value2 following)
FROM table;

Firstly, instead of columns_which_will_be_outputted we write some column names from the table.
Secondly, instead of window_function() we write one window function, some of which are:

1) sum(column)
2) max(column)
3) min(column)
4) avg(column)
5) cout(column)
6) rank()
7) dense_rank()
8) row_number()
9) ntile(integer_number)
10) percent_rank()
11) cume_dist()
12) lag(column, integer_number, double_number)
13) lead(column, integer_number, double_number)

The row “PARTITION BY some_column” specifies the so called “PARTITION BY” clause, which is the
clause based on which division of rows (which are inside table) into partitions is done. To be more
specific, based on values in the column some_column, the partitioning is performed. If there is no
“PARTITION BY” clause specified inside a window function, then all the rows belong to one big
partition.

The row “ORDER BY some_column” specifies the so called “ORDER BY” clause, which is the clause
based on which the ordering or row inside one partition is going to be performed. To be more
specific, the ordering depends on the values inside column some_column.

The row “RANGE/ROWS BETWEEN value1 preceding AND value2 following” specifies the so called
“FRAMING” clause, and it specifies how much of rows from the same partition influence the value in
the in the current row. It can start with either “ROWS” or “RANGE”, where the difference is that
when “ROWS” is used we literally look value1 rows which come before the current row (inside the
same partition) and value2 rows (inside the same partition) which come after the current row. On
the other hand, when we use “RANGE”, we look at “value1 - 1” rows which come before the current

Page 2 of 45
row (in the same partition), and “value2 - 1” rows which come after the current row (inside the same
partition). Additionally, there are some predefined values which can be used instead of specifying
value1 and value2 with integer numbers, which allow you to look at all the rows which (in the same
partition) come before the current row, and all the rows which come after the current row (in the
same partition). The predefined value which allows you to look at all the rows which (in the same
partition) come before the current row is unbounded proceeding, while the value which allows you
to look at all the rows which (in the same partition) come after the current row is unbounded
following. If we specify the “FRAMING” clause as: “RANGE BETWEEN unbounded preceding AND
unbounded following”, then this would mean that the current row in a partition (and each of the
rows inside a partition) is influenced by all the others rows inside the same partition.

1. The data is divided into partitions by the „PARTITION BY“ clause


2. The rows inside each partition are sorted according to the “ORDER BY” clause
3. The used window function as a result adds one more column to the output, after it is
executed on a row of the partition.

Page 3 of 45
1.2 lecture 11 (FDE), examples, window functions

SELECT o_custkey, o_orderdate, sum(o_totalprice) OVER (


PARTITION BY o_custkey
ORDER BY o_orderdate
RANGE BETWEEN unbounded preceding AND current row)
FROM orders;

= 74602.81 + current_price
partition 1 = 197679.65 + current_price = 74602.81+ price_row_2+current_price

When specifying the range as:


partition 2 „RANGE BETWEEN unbounded preceding AND
current row“
this will mean that a row inside a partition is
influenced by all the rows which come before it in
a partition.
So for an example the first row in the first partition
would not be influenced by any other row, as there
are no rows which come before it. On the other
hand, the second row in the partition would be
influenced only by the rows which come before it,
and there is only one row which comes before it,
partition 3
so only the first row would influence it.

Page 4 of 45
SELECT o_custkey , o_orderdate, sum(o_totalprice) OVER (
PARTITION BY o_custkey
ORDER BY o_orderdate
RANGE BETWEEN current row AND unbounded following)
FROM orders;

6 When specifying the range as:


5
„RANGE BETWEEN current row AND unbounded following“
4
3 this will mean that a row inside a partition is influenced by all
2 the rows which come after it in a partition.
1
So for an example the first row in the first partition would be a
sum of all values in column „o_totalprice“ which are specific
for the rows which come after it inside a partition. On the
other hand, the second row in the partition would be
influenced by the rows which come after it inside a partition,
which are all rows except the first row of the partition.

Page 5 of 45
SELECT o_custkey, o_orderdate, sum(o_totalprice) OVER (
PARTITION BY o_custkey
ORDER BY o_orderdate
RANGE BETWEEN unbounded preceding AND unbounded following)
FROM orders;

Outputted values for each row, among rows which are in the
same partition, have the same value if you formulate the
window function as above. More specifically, if you specify the
range as:
„RANGE BETWEEN unbounded preceding AND unbounded
following“
This will mean that a single row inside a partition will be
influenced by all other rows from the same partition.

Page 6 of 45
SELECT o_custkey, o_orderdate, max(o_totalprice) OVER (
PARTITION BY o_custkey
ORDER BY o_orderdate
RANGE BETWEEN unbounded preceding AND current row)
FROM orders;

In this example we use window function “max(column)”, and


with help of the “FRAMING” clause we try to find the biggest
value inside a partition up until some point. Value outputted
in the column „max“ here represents the maximum value
inside a partition which is read up until that moment.

Page 7 of 45
SELECT o_custkey, o_orderdate, rank() OVER (
PARTITION BY o_custkey
ORDER BY o_orderdate) Notice that there is no “FRAMING”
FROM orders; clause.

By using window function “rank()”, we assign each unique


row inside a partition a number which is bigger than the
previous assigned number (bigger by 1).

Page 8 of 45
SELECT n_name, n_regionkey, rank() OVER (
PARTITION BY substr(n_name, 1, 1)
ORDER BY n_regionkey)
FROM nation;

When performing operation „rank()“ there is a possibility of getting ties


among multiple entries inside a partition. This happens depending on what is
written inside the „ORDER BY“ clause.
In this example the ties which are circled in red happened due to the fact that
ordering of the rows inside a partition (which is specified by the “ORDER BY”
clause) was done based on the content of the column “n_regionkey”.
For an example if we look at the rows which contain values “INDIA” and
“INDONESIA”, we can see that they both have value “2” in column
“n_regionkey”, which is the reason why these two rows (which belong to the
same partition as partitioning was done based on the first letter of value from
column “n_name”) have the same value outputted in column “rank”. It could
have happened that we have more than 2 values with the same rank. The next
row which belongs to the same partition will be assigned value 3, as there
were three rows before it in the partition, no matter if these two ended up
having a tie. If we for an example had 6 rows in a same partition which had
same values (i.e. there was a tie among them), they would all be assigned
value 1, while the following row would be assigned number 7.
_______________________________________________________________________________________________

SELECT n_name, n_regionkey, dense_rank() OVER (


PARTITION BY substr(n_name, 1, 1)
ORDER BY n_regionkey)
FROM nation;

In this example window function „dense_rank()“ is used. It does the same


thing as the „rank()“ window function, except how it reacts when there is a
tie in a partition. In the example above we saw that after we had a tie the
number which was assigned to the next row in the partition was not bigger
by 1 than the value assigned to the rows which happen to have a tie. On the
other hand, this is exactly the case when we use window function
„dense_rank()“.
So, if we have six rows in a partition which happen to have a tie based on
some criteria which is specified by the “ORDER BY” clause, then all of the six
rows will be assigned number 1, while the rows which comes after these six
lines in a same partition be assigned number 2 (unlike it would be the case if
window function “rank()” was used instead of “dense_rank()”.

Page 9 of 45
SELECT n_name, n_regionkey, row_number() OVER (
PARTITION BY substr(n_name, 1, 1)
ORDER BY n_regionkey)
FROM nation;

In this example window function „row_number()“ is used. Unlike both


window functions “rank()” and “dense_rank()”, the “row_number()” window
function assigns an unique number to each row inside a partition , even if
there is a tie among some rows. Because it assign different numbers to rows
which may have a tie (based on the condition specified in the “ORDER BY”
clause), the “row_number()” function not uniquely specified. This window
function can be very useful because the thing which is in SQL surprisingly
difficult is to distinguish two rows which are conceptually different, but have
the same value. It is not easy to separate them, unless you are using window
function “row_number()”.

Page 10 of 45
SELECT n_name, n_regionkey, ntile(5) OVER (
PARTITION BY substr(n_name, 1, 1)
ORDER BY n_regionkey)
FROM nation;

In this example window function „ntile(integer_number)“ is used. This


window function specifies that inside a single partition all the rows are going
to be assigned numbers between 1 and the value which is specified in the
window function.
So, in this example we used “ntile(5)”, which means that each row inside a
single partition is going to be assigned a value between 1 and 5.
Unfortunately, this example does not have a partition which is big enough to

_____________________________________________________________________________________________

SELECT n_name, n_regionkey, ntile(10) OVER (


PARTITION BY substr(n_name, 1, 1)
ORDER BY n_regionkey)
FROM nation;

Same as example above, with only difference that we just used „ntile(10)“
instead of „ntile(5)“.

Page 11 of 45
SELECT n_name, n_regionkey, ntile(10) OVER (

ORDER BY n_regionkey)

FROM nation; There is no „PARTITION BY“ clause, meaning


that the program will look at all the rows as if
the belonged one big partition.

In this example, as in previous two examples we use window


function „ntile(number)“, except that we now have a case where
we have a partition bigger than the number specified in the
window function.
When we have a case as in this example where we have 25
wows, and in the window function “ntile(10)” we specified that
PARTITION 1 all rows inside the same partition are going to be assigned values
between 1 and 10, then we firstly need to see if we can
distribute the values across the rows evenly. This can be
concluded by seeing if the division of number of rows and the
specified value in the window function gives out remainder zero.
If that is not the case, like it is not in this example, then some
rows will be grouped in bigger groups, while some will be
grouped in smaller groups. In this example, number of rows
which is 25 is not divisible by 10 (the value specified in the
window function), the system will decide to divide it in 5 groups
of 3, and 5 groups of 2.

Page 12 of 45
SELECT n_name, n_regionkey, percent_rank() OVER (
ORDER BY n_regionkey)
FROM nation;

Again, as there is no „PARTITION BY“ clause, all the rows


belong to one big partition.
In this example we use window function “percent_rank()”.
6−1 5
=
25 − 1
=
24
This window function outputs a value in “percent_rank”
column according to the following formula:

𝑟𝑎𝑛𝑘 − 1
𝑝𝑒𝑟𝑐𝑒𝑛𝑡_𝑟𝑎𝑛𝑘 =
𝑡𝑜𝑡𝑎𝑙_𝑛𝑢𝑚𝑏𝑒𝑟_𝑜𝑟_𝑟𝑜𝑤𝑠 − 1

where “rank” is the value which would be assigned to a row


according to the “rank()” window function.
In this example some rows have the same value in the
“percent_rank” column because the column we are ordering
on, column “n_regionkey” has values which are not unique.
This means that there would be ties if we performed window
function “rank()”, which leads to same outputs in the
“percent_rank” column, as values in this column depend on
the output that window function “rank()” would output.
_____________________________________________________________________________________________

SELECT n_name, n_regionkey, percent_rank() OVER (


ORDER BY n_nationkey)
FROM nation;

In this example we are using the same widow function as in the


example above, the „percent_rank()“ window function, except that
we are using a different column according to which we perform
ordering of rows inside a partition. This column is „n_nationkey“,
and it differs from previously used column „n_regionkey“ in a sense
that it contains only unique values. So, by having unique values,
this column will lead to different ranks assigned to every row, and
consequently different value in the column „percent_rank“ for
each row.
The outputted columns in this example may be confusing, as the
professor outputted column „n_regionkey“, instead column
„“n_nationkey“, according to which the ordering was performed,
so don’t confuse yourself by the output that the ordering was done
according to the same column as in the previous example, but
rather look at the written code.

Page 13 of 45
SELECT n_name, n_regionkey, cume_dist() OVER (

ORDER BY n_regionkey)

FROM nation;

Again, as there is no „PARTITION BY“ clause, all the rows belong to one big
partition.
In this example we use window function “cume_dist()”. This window function
outputs a value in “cume_dist” column according to the following formula:

𝑛𝑢𝑚𝑏𝑒𝑟_𝑜𝑓_𝑝𝑟𝑒𝑐𝑒𝑒𝑑𝑖𝑛𝑔_𝑟𝑜𝑤𝑠_𝑖𝑛_𝑎_𝑝𝑎𝑟𝑡𝑖𝑡𝑖𝑜𝑛
𝑐𝑢𝑚𝑒_𝑑𝑖𝑠𝑡 =
𝑡𝑜𝑡𝑎𝑙_𝑛𝑢𝑚𝑏𝑒𝑟_𝑜𝑓_𝑟𝑜𝑤𝑠

As when using window function “percent_rank()”, when we use window function


“cume_dist()”, the value which would be assigned to a row when using window
function “rank()”, plays a role, although in the formula we don’t specifically use
the rank assigned to a row. This is the case in this example, where the ordering of
rows inside a partition is done based on a column which has non-unique values,
which leads to multiple rows being assigned same rank.
So, for an example, for the first row we would not calculate the value in the
“cume_dist” column by 1/25, but rather 5/25, as there are 5 rows which would
be assigned the same rank.

Page 14 of 45
SELECT o_custkey, o_orderdate, o_totalprice – lag(o_totalprice, 1) OVER (
PARTITION BY o_custkey
ORDER BY o_orderdate)
FROM orders;

As there is no preceding value to the first row of a partition,


window function „lag()“ will try to address a row which comes
before it which doesn’t exist, which is why we will get “NULL” as
output (if we use the “lag()” window function to look at just one
row behind).

Window function “lag(column, integer_number, double_number)”


is used to address values from the previous rows of the column
specified in the function. How far back (meaning how far a row
inside the same partition, which comes before this row, influences
the current row) depends on the second argument of the function.
The third argument of the window function is optional, and it
specifies the default value which is going to be used if there are not
enough preceding rows in a partition in order to determine the
output of the window function “lag”. For an example, if we used
zero as the third argument of the “lag” window function, then in the
first rows of a partition we would not have “NULL“ as output, but
rather the same value as the row has in the “o_totalprice” column.

Page 15 of 45
SELECT o_custkey, o_orderdate, o_totalprice, o_totalprice – lag(o_totalprice, 1, 0.0) OVER (
PARTITION BY o_custkey
ORDER BY o_orderdate)
FROM orders;

The error you get if you tried to put „0“ instead of „0.0“ in the „lag“ function as third parameter:

Page 16 of 45
1.3 lecture 12 (FDE), examples, window functions

SELECT o_custkey, o_orderdate, avg(o_totalprice) OVER (


PARTITION BY o_custkey
ORDER BY o_orderdate
ROWS BETWEEN 1 preceding AND 1 following)
FROM orders;

Computing the average among:

1. value from the previous row,


2. value from the current row,
3. value from the next row

Page 17 of 45
SELECT o_custkey, o_orderdate, o_totalprice, avg(o_totalprice) OVER (
PARTITION BY o_custkey
ORDER BY o_orderdate
ROWS BETWEEN 1 preceding AND 1 following)
FROM orders;

Same as previous except column „o_totalprice“ was added to the output to see better the average being
computed.

The result in the first row of each partition in the „avg“ column is not average of a sum of three values, but an
average of two values, as there is no preceding value in this case ( (74602.81+123079.65)/2 = 98839.825, for the
first partition, first row), and as well the last value in a partition in the „avg“ column is an average of two values, the
current value and the preceding, as there is no following number ( (95911.01+5404826)/2 = 74979.635, for the last
row in the first partition). All the values in between are an average of three values.

Page 18 of 45
SELECT o_custkey, o_orderdate, avg(o_totalprice) OVER (
PARTITION BY o_custkey
ORDER BY o_orderdate
ROWS BETWEEN 2 preceding AND 2 following)
FROM orders;

Notice the use of phrase „ROWS“ (not „RANGE“) in the framing clause of the window function, as these two have a
different meaning, which can be seen from the photo:

Page 19 of 45
SELECT o_custkey, o_orderdate, o_totalprice, sum(o_totalprice) OVER (
PARTITION BY o_custkey
ORDER BY o_orderdate)
FROM orders;

No „framing“ clause

Page 20 of 45
SELECT o_custkey, o_orderdate, o_totalprice, sum(o_totalprice) OVER (
PARTITION BY o_custkey)
FROM orders;

No „ORDER BY“ clause, and no „FRAMING“ clause („ROWS“ or „RANGE“).

Page 21 of 45
SELECT o_custkey, o_orderdate, o_totalprice, sum(o_totalprice) OVER (
PARTITION BY o_custkey
ROWS BETWEEN 1 preceding AND 1 following)
FROM orders;

No „ORDER BY“ clause, but there is a „FRAMING“ clause („ROWS“ or „RANGE“).

This doesn't make sense, and professors says that in his opinion this should display an error message, as it is not
clear what the output of such a window function should be, as there is no specified order (not sorted rows within a
partition) according to which the aggregate function of the window function should be applied. It is totally unclear
in which order are we going to get output of rows, if „ORDER BY“ clause is not specified. PostgeSQL by default does
not output an error (as you can see from the example above), but avoid such constructions of window functions.

Page 22 of 45
SELECT o_custkey, o_orderdate, o_totalprice, o_orderstatus, sum(o_totalprice) OVER (
PARTITION BY o_custkey
ORDER BY o_orderdate
ROWS BETWEEN
CASE WHEN o_orderstatus='F'
THEN 3
ELSE 1
END
preceding AND
CASE WHEN o_orderstatus='F'
THEN 3
ELSE 1
END
following)
FROM orders;

______________________________________________________________________________________________

(works on HyPer1)

PostgreSQL refuses to do complex frames,


meaning expressions in the „framing“ clause
where there is not a specified number, but
some kind of expression which depend on
the values in from the tables like in this
example, as execution of such queries may
lead to O(n2) complexity. On the other hand,
Oracle executes such queries, but they have
O(n2) complexity, which is unacceptable, as
we work with large databases. There is a way
that such queries can be executed in O(log(n)) complexity, and that is by using segment trees. These are
trees which are constructed so that aggregate functions from the window-functions can be performed on
arbitrary ranges (each output may have a different range).

1
http://www.hyper-db.de/interface.html#

Page 23 of 45
SELECT o_custkey, o_orderdate, sum(o_totalprice) OVER (
ORDER BY o_orderdate)
FROM orders;

Page 24 of 45
SELECT o_orderdate, sum(o_totalprice) OVER (
ORDER BY o_orderdate)
FROM (SELECT o_orderdate, sum(o_totalprice) AS o_totalprice
FROM orders
GROUP BY o_orderdate) AS s

Page 25 of 45
Exercise from slide 31, FDE, section „Using a Database“:

- For each customer from GERMANY compute the cumulative spending ( “sum(o_totalprice)” ) by year (
“extract(year from o_orderdate)” )?

SELECT o_custkey, year, sum(revenue) OVER (


PARTITION BY o_custkey
ORDER BY year)
FROM ( SELECT o_custkey, year, sum(o_totalprice) AS revenue
FROM ( SELECT o_custkey, extract(year FROM o_orderdate) AS year, o_totalprice
FROM orders) AS s
GROUP BY o_custkey, year) AS s

Page 26 of 45
1.3.1 WITH statement
In this example we will introduce the “WITH” clause. It is a clause which has the following syntax:

WITH q1 AS ( SELECT ...


FROM ...),
q2 AS ( SELECT ...
FROM q1, ....)
SELECT ...
FROM ...

From this we can see that a single “WITH” clause can introduce multiple queries by separating them
with a comma (i.e. the “WITH” word is not repeated). On the other hand, when there is only one
query in a “WITH” clause, there is no comma, nor any character after it, as well at there is no
comma, nor any other character after the last query specified in a “WITH” clause when there is
more than one query specified in it.

Also, we can see from the example above that when we specify multiple queries inside a single
“WITH” statement, inside one of the queries we are writing we can refer to any query which is
defined above the current one (like in the example above where in query “q2” we refer to the query
defined above it in the same “WITH” clause, i.e. query “q1”).

Also, a query from “WITH” can be referred at multiple times in one query, and in that case it is
executed only once, and its result is reused (i.e. it will not be executed multiple times).

The purpose of use of “WITH” clause is breaking a complex query into smaller parts.s

Page 27 of 45
WITH years AS ( SELECT DISTINCT extract(year FROM o_orderdate) AS year
FROM orders)
SELECT o_custkey, year, sum(revenue) OVER (
PARTITION BY o_custkey
ORDER BY year)
FROM ( SELECT o_custkey, year, sum(o_totalprice) AS revenue
FROM ( SELECT o_custkey, extract(year FROM o_orderdate) AS year, o_totalprice
FROM orders) AS s
GROUP BY o_custkey, year) AS s

You get an entry for each customer in each year, even if they did not make any orders in a year (with help of a
dummy variable we added for each year which adds an order of price 0 for each year).

In the case we use „sum“ as the aggregate function, the situation where there is no entries (purchases in this case)
for certain rows, it doesn't make a difference if we add dummy entries or not, but in the case we used „avg“ as an
aggregate function and refer to rows which are preceding and following to the current row, it important that we
have these dummy rows, as they influence the result. An example of this is given on the next two pages.

What would happen if we did not use the version with dummy rows, and instead of calculating the sum like in the
previous example we calculated the average, using the „avg“ function:

Page 28 of 45
SELECT o_custkey, year, avg(revnue) OVER (
PARTITION BY o_custkey
ORDER BY year
ROWS BETWEEN 1 preceding AND 1 following)
FROM ( SELECT o_custkey, year, sum(o_totalprice) AS revenue
FROM ( SELECT o_custkey, extract(year FROM o_orderdate) AS year, o_totalprice
FROM orders) AS s
GROUP BY o_custkey, year) AS s

In this case average of value in the row of first partition where there is „1992“ in the column „year“, the average
value in the „avg“ column is computed as average of values in 1992 (current) and 1996 (following, at least that is
how the program sees that), which should not be the case as following to 1992 value should be 1993 value, but the
customer to which the first partition refers to did not make any purchases in 1993, 1994, 1995 (so dummy rows
should be added, for all years, and then for the missing ones they will be displayed as zero, while these dummy
rows will have no effect on rows for which some purchases were made).

Page 29 of 45
WITH years AS ( SELECT DISTINCT extract(year FROM o_orderdate) AS year
FROM orders)
SELECT o_custkey, year, avg(revnue) OVER (
PARTITION BY o_custkey
ORDER BY year
ROWS BETWEEN 1 preceing AND 1 following)
FROM ( SELECT o_custkey, year, sum(o_totalprice) AS revenue
FROM ( SELECT o_custkey, extract(year FROM o_orderdate) AS year, o_totalprice
FROM orders
UNION ALL
SELECT c_custkey, year, 0
FROM customer, years) AS s
GROUP BY o_custkey, year) AS s

In contrast to the previous example, in this one we added dummy rows and got a different (the output which would
be most likely be wanted instead of the output from the previous example).

Page 30 of 45
2. ORDERED-SET FUNCTIONS
2.1 Introduction to ordered-set functions
Ordered-set Functions are functions which require that a table on which these functions are applied
on is sorted.

Syntax for ordered-set functions:

SELECT columns_which_will_be_outputted, orderd_set_function() WITHIN GROUP(


ORDERED BY column)
FROM table
GROUP BY column;

Firstly, instead of columns_which_will_be_outputted we write some column names from the table.
Secondly, instead of orderd_set_function() we write one ordered-set function, some of which are:

1) mode()
2) percentile_disc(p)
3) percentile_cont(p)

In both ordered-set functions “percentile_disc(p)” and “p ercentile_cont(p)”, the p is short for


“probability”, and probability can only be in a value which belongs to range [0, 1].

Page 31 of 45
2.2 lecture 12 (FDE), examples, ordered-set functions

We see that there is an even number of rows.


_________________________________________________________________________________________

If we use order-set function „percentile_cont(p)“, and we as well have an even number of rows, it is not really
clear which value should be taken as the median value (middle value), so in case of this ordered-set function
interpolation is used to determine

row[15000/2] value is = 144409.02


row[(15000/2) +1] value is = 144409.06
Interpolation:
(144409.02 + 144409.06) / 2 = 144409.039999998
_________________________________________________________________________________

If we use order-set function „percentile_disc(p)“ instead of „percentile_cont(p)“, like above, we get the first value
of two if there is an even number of rows.

__________________________________________________________________________________

To demonstrate what is done by ordered-set functions, we will use window-functions to display the “15000/2”-th
row, and “(15000/2) +1”-th row:

This also shows how much would it more difficult be to calculate the median using window-functions, then order-
set functions, and in addition we would have to know in advance how many rows are there.

With ordered-set functions:

Page 32 of 45
With window-functions:
SELECT mode()
WITHIN GROUP (ORDER BY o_totalprice) SELECT o_totalprice, count(*) AS c
FROM orders; FROM orders
GROUP BY o_totalprice
ORDER BY c DESC
LIMIT 2;

Order-set function „mode()“ gives most frequent value.

We see from the window-function version of the result of this problem, that there was more than one element
which occurred the same number of times.

_____________________________________________________________________________________

SELECT mode()
WITHIN GROUP (ORDER BY o_orderstatus)
FROM orders;

In this example the element which occurred most times is unique.

Page 33 of 45
SELECT o_custkey, percentile_cont(0.5)
WITHIN GROUP ( ORDER BY o_totalprice)
FROM orders
GROUP BY o_custkey

It is also possible to combine order-set functions with „GROUP BY“ clause, like in this example where we calculate
the median value of spending of each customer.

Page 34 of 45
Not only does the order-set function „percentile_cont(p)“perform interpolation when calculating the median value,
but for values which depend on the value you specify in the function „percentile_count(p)“ as „p“.

_____________________________________________________________________________________________

Ordered-set function „percentile_cont(p)“ works only on numeric values, as it performs interpolation, so when you
give it a non-numeric value, it tries to perform interpolation and gives an error as output. On the other hand, when
you try function „percentile_disc(p)“ on a non-numeric value it will be able to work, as it just depends only on
numbers of rows.

Page 35 of 45
3. GROUPING SETS, ROLLUP AND CUBE
3.1 Introduction to Grouping sets, Rollup and Cube

We have types of SQL features that we use as a part of the “GROUP BY” clause, which we can use to
perform aggregation over multiple dimensions, and these are:

1. GROUPING SETS(sets_of_columns)
2. ROLLUP(list_of_columns)
3. CUBE(list_of_columns)

3.1.1 GROUP BY GROUPING SETS


When using “GROUP BY GROUPING SETS(sets_of_columns)” we can explicitly specify multiple sets of
columns according to which grouping is going to be performed. So, for an example we could write

GROUP BY GROUPING_SETS( (col1, col2), (col1), () )

This would be equivalent to SQL code:

SELECT col1, col2, sum(col3)


FROM r query 1
GROUP BY col1, col2
UNION ALL
SELECT col1, NULL, sum(col3)
FROM r query 2
GROUP BY col1, col2
UNION ALL
SELECT NULL, NULL, sum(col3)
FROM r query 3
GROUP BY col1, col2;

You could have specified different grouping sets like:

GROUP BY GROUPING SETS( (col1, col2), (col1), (col2) )

or

GROUP BY GROUPING SETS( (col1), (col2) )

or any other combination of the groupings between columns.

Page 36 of 45
3.1.2 GROUP BY ROLLUP
On the other hand, you could also use “ROLLUP(list_of_columns)” in the “GROUP BY” clause, and it is
more popular to use this SQL feature than the “GROUPING SETS(sets_of_columns)”.

You could this SQL feature as:

GROUP BY ROLLUP(col1, col2)

which is equivalent to “GROUP BY GROUPING_SETS( (col1, col2), (col1), () )”.

It is also possible to specify more than two columns in the list_of_columns of the “GROUP BY
ROLLUP(list_of_columns)”, like for an example: “GROUP BY ROLLUP(col1, col2, col3)”.

3.1.3 GROUP BY CUBE


When using “GROUP BY CUBE(list_of_columns)”, we actually perform all possible combinations of
groupings among columns specified in list_of_columns. So, if we wrote:

GROUP BY CUBE(col1, col2)

it would be equivalent to writing:

GROUP BY GROUPING SETS( (col1, col2), (col1), (col2), () )

Page 37 of 45
3.2 lecture 12 (FDE), examples, Grouping sets, Rollup and Cube

SELECT year, month, sum(o_totalprice)


FROM ( SELECT extract(year FROM o_orderdate) AS year, extract (month FROM o_orderdate) AS month, o_totalprice
FROM orders) AS s
GROUP BY ROLLUP(year, month)
ORDER BY year, month;

. ..
Another row is added to the output (for each year) where there is no value (NULL technically speaking) in the
column „month“, but there is a value in the column „sum“, which specifies the sum over all the months in a year.
Also, the last row of the output gives a sum of all values in all years. This is called „level of detail aggregation“, we
aggregate it over a month level, and over a year level, and over everything.

As mentioned earlier, the reason why we get such an output is because “GROUP BY ROLLUP(list_of_columns)” can
be written as an union between three queries. To bring closer to the reader what is the result of which of these
three queries three colors were used beside each row in the photos of the output above, the same colors used to
denote each of the three queries in the 3.1.1 section of this document.

Page 38 of 45
SELECT year, month, sum(o_totalprice)
FROM ( SELECT extract(year FROM o_orderdate) AS year, extract (quarter FROM o_orderdate) AS quarter, o_totalprice
FROM orders) AS s
GROUP BY ROLLUP(year, quarter)
ORDER BY year, quarter;

(This query is a little bit changed in comparison to the down original one, written by the professor. The thing that is
changed is that I put the column to be named „quarter“ instead of „month“, as professor forgot to do this, but the
output is the same)

...

The same as previous example, just a year is divided into quarters instead of months, so that all the rows of the
output can fit into the screen.

Page 39 of 45
SELECT year, month, sum(o_totalprice)
FROM ( SELECT extract(year FROM o_orderdate) AS year, extract (month FROM o_orderdate) AS month, o_totalprice
FROM orders) AS s
GROUP BY GROUPING SETS( (year, month), (year), (month) )
ORDER BY year, month;

...

Page 40 of 45
SELECT year, month, sum(o_totalprice)
FROM ( SELECT extract(year FROM o_orderdate) AS year, extract (month FROM o_orderdate) AS month, o_totalprice
FROM orders) AS s
GROUP BY GROUPING SETS( (year, month), (year), (month), () )
ORDER BY year, month;

...

By adding in the clause „GROUP BY GROUPING SETS“ value „()“, this is used to refer to all the values (to be summed
up in this case)

Page 41 of 45
SELECT year, month, sum(o_totalprice)
FROM ( SELECT extract(year FROM o_orderdate) AS year, extract (month FROM o_orderdate) AS month, o_totalprice
FROM orders) AS s
GROUP BY CUBE( year, month )
ORDER BY year, month;

...

This query written using „GROUP BY CUBE“ clause means all 2n combinations of groupings, so it could be written
with „GROUP BY GROUPING SETS“ clause like:

SELECT year, month, sum(o_totalprice)


FROM ( SELECT extract(year FROM o_orderdate) AS year, extract (month FROM o_orderdate) AS month, o_totalprice
FROM orders) AS s
GROUP BY GROUPING SETS( (year, month), (year), (month), () )
ORDER BY year, month;

Using „CUBE“ may be very useful sometimes, but it should be used only if a few attributes are included in the clause, as
otherwise the output would be huge.

Page 42 of 45
SELECT year, quarter, month, sum(o_totalprice)
FROM ( SELECT extract(year FROM o_orderdate) AS year, extract (quarter FROM o_orderdate) AS quarter, extract
(month FROM o_orderdate) AS month, o_totalprice
FROM orders) AS s
GROUP BY ROLLUP(year, quarter, month )
ORDER BY year, month;

...

Page 43 of 45
If for the same problem as the pervious one we used „CUBE“, instead of „ROLLUP“, we would get the following:

SELECT year, quarter, month, sum(o_totalprice)


FROM ( SELECT extract(year FROM o_orderdate) AS year, extract (quarter FROM o_orderdate) AS quarter, extract
(month FROM o_orderdate) AS month, o_totalprice
FROM orders) AS s
GROUP BY CUBE( year, quarter, month )
ORDER BY year, month;

...

This example is used to show how usage of „CUBE“ may drastically increase the number of output rows. In the
above example there were 115 outputted rows (using „ROLLUP“), while using „CUBE“ for the 'same' query we get
223 rows.

Page 44 of 45
Exercise from slide 36, FDE, section „Using a Database“:

- Aggregate revenue ( “sum(o_totalprice)” ): total, by region (r_name) by name (n_name), example output:

SELECT r_name, n_name, sum(o_totalprice)


FROM ( SELECT r_name, n_name, o_totalprice
FROM oders, customer, nation, region
WHERE c_custkey=o_custkey AND c_nationkey=n_nationkey AND r_regionkey=n_regionkey) AS s
GROUP BY ROLLUP(r_name, n_name);

or equivalently:

SELECT r_name, n_name, sum(o_totalprice)


FROM ( SELECT r_name, n_name, o_totalprice
FROM oders, customer, nation, region
WHERE c_custkey=o_custkey AND c_nationkey=n_nationkey AND r_regionkey=n_regionkey) AS s
GROUP BY GROUPING SETS( (r_name, n_name), (r_name), () );

Page 45 of 45

You might also like