SQL Decorrelation and Window Functions - in Data Engineering
SQL Decorrelation and Window Functions - in Data Engineering
DECORRELATION
Contents
1. Introduction to subqeries, and decorrelation of correlated subqueries ............................................ 2
2. Examples of decorrelation of correlated queries ................................................................................ 4
Page 1 of 9
1. Introduction to subqeries, and decorrelation of correlated subqueries
Firstly we are going to define what a subquery is. A subquery is basically a query inside another
query (sometimes called „inner query“), and the output of a subquery is used in the outer query (the
query which contains a subquery).
A subquery can appear in almost any part of an SQL query, like in the SELECT clause, in the FROM
clause, and in the WHERE clause. Example of a query containing a subquery in all of these three SQL
statement clauses is given:
SELECT n_name, (
SELECT count(*) A subquery in SELECT clause
FROM region
)
FROM nation,
(
SELECT *
A subquery in FROM clause
FROM region
WHERE r_name = ‘EUROPE’
) AS region
WHERE n_regionkey=r_regionkey AND
EXISTS (
SELECT 1
A subquery in WHERE clause
FROM customer
WHERE n_nationkey = c_nationkey
);
A so called scalar subquery is a suquery which produces just a single value output, and can be used
as a single value (such a query is the query which is in the above example circled in orange).
Another type of a subquery is so called set-value subquery, which is a subquery which produces set
of values (not a single value). Example of such a query is the blue query from the example above.
By the SQL standard, if you use a subquery which produces a new relation (i.e. table, or set of
values), you must give it a name (like it was done in the case of the blue subquery with the
construction “AS region”).
Now we will define what correlated subqueries are and how do they affect the execution time of a
query. A correlated subquery is a subquery which depends on something (some column, or table)
that was not defined in that subquery, but rather in the outer query. Writing such queries may be
convenient for the people who write the query, but there correlated subqueries are one of the main
reasons why queries do not finish at all (or at least on time). The reason for this is that most of
systems execute the correlated subquery for every row of the outer query, making the execution
Page 2 of 9
time rise quadratic. This problem can be fixed by converting the correlated subquery into a subquery
which is not correlated, and this process is called decorrelation. Some systems do decorrelation by
themselves, but of the systems don’t, and even the ones who do perform decorrelation can’t
perform the decorrelation process for any correlated subqueries. But, because we get such a huge
improvement in execution time by performing decorrelation on a correlated subquery, and because
we can perform the decorrelation process by hand, we should always do it.
SELECT avg(l1.l_extendedprice)
FROM lineitem l1
WHERE l1.l_extendedprice = (
SELECT min(l2.l_extendedprice)
FROM lineitem l2
WHERE l1.l_orderkey = l2.l_orderkey
);
Page 3 of 9
2. Examples of decorrelation of correlated queries
Example 1:
### correlated:
SELECT
sum(l1.l_extendedprice)/7.0 AS avg_yearly
FROM
lineitem l1,
part p
WHERE
p.p_partkey = l1.l_partkey AND
p.p_brand = 'Brand#23' AND
p.p_container = 'MED BOX' AND
l1.l_quantity < (
SELECT 0.2*avg(l2.l_quantity)
FROM lineitem l2
WHERE l2.l_partkey = p.p_partkey
);
Page 4 of 9
### uncorrelated, (1 way):
To perform decorrelation of a correlated query, where in a “WHERE” clause of the outer query we
have a correlated subquery which has one condition in its “WHERE” clause that uses a reference to a
column defined in the outer query, we need to follow the steps:
SELECT
sum(l1.l_extendedprice)/7.0 AS avg_yearly
FROM
lineitem l1,
part p,
(
SELECT
0.2*avg(l2.l_quantity) AS yearavg,
l2.l_partkey
FROM
lineitem l2
GROUP BY
l2.l_partkey
) AS uncor
WHERE
p.p_partkey = l1.l_partkey AND
p.p_brand = 'Brand#23' AND
p.p_container = 'MED BOX' AND
uncor.l_partkey = l1.l_partkey AND
l1.l_quantity < uncor.yearavg;
Page 5 of 9
### uncorrelated, [2 way, (slower than way 1)]:
SELECT
sum(l1.l_extendedprice)/7.0 AS avg_yearly
FROM
lineitem l1,
part p,
(
SELECT
0.2*avg(l2.l_quantity) AS yearavg,
domain_set.l_partkey
FROM
lineitem l2,
(
SELECT DISTINCT
l3.l_partkey
FROM
lineitem l3
) AS domain_set
WHERE
domain_set.l_partkey = l2.l_partkey
GROUP BY
domain_set.l_partkey
) AS uncor
WHERE
p.p_partkey = l1.l_partkey AND
p.p_brand = 'Brand#23' AND
p.p_container = 'MED BOX' AND
l1.l_partkey = uncor.l_partkey AND
l1.l_quantity < uncor.yearavg;
Page 6 of 9
Example 2:
### correlated:
SELECT sum(l1.l_extendedprice)
FROM lineitem l1
WHERE l1.l_extendedprice > (
SELECT avg(l2.l_extendedprice)
FROM lineitem l2
WHERE l2.l_orderkey = l1.l_orderkey);
_________________________________________________________________________________________________
SELECT sum(l1.l_extendedprice)
FROM
lineitem l1,
(
SELECT
avg(l2.l_extendedprice) AS avgex,
l2.l_orderkey
FROM
lineitem l2
GROUP BY
l2.l_orderkey
) AS uncor
WHERE
l1.l_orderkey = uncor.l_orderkey AND
l1.l_extendedprice > uncor.avgex
Page 7 of 9
### uncorrelated, [2 way, (slower than way 1)]:
SELECT sum(l1.l_extendedprice)
FROM
lineitem l1,
(
SELECT
avg(l2.l_extendedprice) AS avgprc,
domain_set.l_orderkey
FROM
lineitem l2,
(
SELECT DISTINCT
l3.l_orderkey
FROM
lineitem l3
) AS domain_set
WHERE
domain_set.l_orderkey = l2.l_orderkey
GROUP BY
domain_set.l_orderkey
) AS decor
WHERE
l1.l_orderkey = decor.l_orderkey AND
l1.l_extendedprice > decor.avgprc;
Page 8 of 9
Example 3:
### correlated:
SELECT o1.o_orderkey
FROM orders o1
WHERE o1.o_totalprice < (
SELECT avg(o2.o_totalprice )
FROM orders o2
WHERE o2.o_shippriority = o1.o_shippriority OR
o2.o_orderstatus = o1.o_orderstatus
);
## uncorrelated:
SELECT o1.o_orderkey
FROM
orders o1,
(
SELECT
avg(o2. o_totalprice) AS avgprc,
domain_set.o_shippriority,
domain_set.o_orderstatus
FROM
orders o2,
(
SELECT DISTINCT
o3.o_shippriority,
o3.o_orderstatus
FROM
orders o3
) AS domain_set
WHERE
domain_set.o_shippriority = o2.o_shippriority OR
domain_set.o_orderstatus = o2.o_orderstatus
GROUP BY
domain_set.o_shippriority,
domain_set.o_orderstatus
) AS uncor
WHERE
o1.o_shippriority = uncor.o_shippriority AND
o1.o_orderstatus = uncor.o_orderstatus AND
o1.o_totalprice < uncor.avgprc;
Page 9 of 9
WITH RECURSIVE statement (recursive
common table expression-CTE)
Contents
1. Introduction to WITH RECURSIVE statement ...................................................................................... 2
1.2 Examples of use of WITH RECURSIVE statement .......................................................................... 3
2. WITH RECURSIVE statement with UNION statement ....................................................................... 15
2.1 Examples of use of WITH RECURSIVE statement with UNION .................................................... 16
Page 1 of 21
1. Introduction to WITH RECURSIVE statement
Using WITH RECURSIVE statement is not really using recursion, but rather iteration (in professors
opinion, and he is right).
The first want to discuss what is the difference between A NON-RECURSIVE TERM and A RECURSIVE
TERM. These are also called the first and second queries of the “WITH RECURSIVE” statement. A
NON-RECURSIVE TERM is a query which gives an output in each iteration which is the resulting
output of the previous iteration of the statement, while A RECURSIVE TERM is the term which is a
query containing a call to the same “WITH RECURSIVE” statement as the one in which it is located,
and it contains the terminating condition in its “WHERE” clause, which if it is not specified carefully,
could lead to an infinite loop.
If we on the other hand thing of a “WITH RECURSIVE” statement as a loop, we can think of it as the
following. When a recursive CTE query runs, the first query (the NON-RECURSIVE TERM) generates
one or more beginning rows which are added to the result set. Then, the second query is run and the
resulting rows of this query are added to the result set. This continues so that the second query is run
against all the rows from the last iteration, and the new resulting rows are added to the result set.
The procedure ends when no more rows are returned by from the second query.
working_table = evaluate_non_recursive_term()
output(working_table)
while(working_table != empty)
{
working_table = evaluate_recursive_term(working_table)
output(working_table)
}
Page 2 of 21
1.2 Examples of use of WITH RECURSIVE statement
Example 1.1: A NON-RECURSIVE TERM
_________________________________________________________________________________________________
Example 1.2:
Page 3 of 21
Example 2:
Using “WITH RCURSIVE” statement for traversing a tree-like structure differs a little bit from
standard use of “WITH RECURSIVE”, so you should look at these two ‘types’ of uses of “WITH
RECURSIVE” statements differently.
With the following code we are just presenting the content of the table „animals“:
The following picture is a from FDE lecture “Using a Database”, slide number 20, which shows in a
form of a tree how the entries of the above table are connected with one another.
As well, from the code on the screenshot above you see that it is unnecessary to write an additional
“d” in the “FROM“ clause in the second query of the “WITH RECURSIVE” statement, as professor did
in the following examples.
Page 4 of 21
Example 2.1:
In this query (which has the WITH RECURSIVE statement) we just copied the table “animals” we
outputted above (in “Example 2”).
In this example we are outputting all the animals from the tree which are in the hierarchy below a
“turtle”.
Page 5 of 21
Example 2.2:
Same as example 2.1, but instead of looking all the animals which are in the hierarchy below a
„turtle“ in the tree, we are looking at all animals which are in the hierarchy below a „reptile“.
“SELECT a2.*” specifies that we select all the columns of table “a2”, which is equivalent as if we
wrote “SELECT a2.id, a2.name, a2.parent”
Page 6 of 21
Example 2.3:
Same as “Example 2.1”, but instead of looking all the animals which are in the hierarchy below a
„turtle“ in the tree, we are looking at all animals which are in the hierarchy above a „turtle“.
Page 7 of 21
Example 2.4:
This example shows that you could use more than one constraint in the “WHERE” clause of the
second query (of the two queries a “WITH RECURSIVE” statement consists of).
Page 8 of 21
Example 3.1:
[Exercise from lecture “Using a Database”, slide 21]
___________________________________________________________________________________________________
Example 3.2:
You can see that because of the quick growth of the factorial function that in this example (example
3.2) we had to write “1::bigint AS value” (changing the data type of the column) as the value would
quickly become too big to be stored in an integer type.
Page 9 of 21
Example 3.3:
With picking the appropriate data type you can compute pretty large factorial numbers. In this
example we compute factorial 30!, and use type “numeric” to be able to represent this big number.
Page 10 of 21
Example 4.1:
[Exercise from lecture “Using a Database”, slide 21]
Page 11 of 21
Example 4.2:
As seen in this example, we can construct explicitly a single row (or “tuple”) with help of “VALUES”
statement. Using “(VALUES(1, 1, 0))” is basically the same as constructing a single row using “SELECT
1 AS n, 1 AS value, 0 AS previous_value”, where we specify 1 for first parameter of the “WITH
RECURSIVE” statement (which is “n”), 1 for the second parameter (which is “value”), and 0 for the
third paameter (which is “previous_value”).
Page 12 of 21
Example 4.3:
We can see that if we construct the “WITH RECURSION” statement like above, output will contain
duplicates. This is due to the fact that in each iteration we output the “workingTable” and, it contains
two values. We now have two values in the “workingTable” in each iteration, which leads to a
situation where the last outputting row in an iteration and the first outputted row in the next
iteration are the same.
In general it is a little bit silly to start with two rows, because this may make the problem even more
difficult, but if you have some recursive formula which does require multiple starting values, you can
do it like in this example.
Page 13 of 21
Example 4.4:
In the step of the algorithm where we expand our table, we ignore the first element. Because we
explicitly gave the first two values, we no longer need to expand the first value, because it has
already been expanded explicitly, so it can be ignored.
Page 14 of 21
2. WITH RECURSIVE statement with UNION statement
Instead of using “UNION ALL” statement to make a connection between two queries in the “WITH
RECURSIVE” clause, we are going to use “UNION” for that purpose. Although “UNION” statement is
not standardised in contrast to “UNION ALL” statement, PostgreSQL supports it. It is very useful
because it eliminates duplicates when running the query, meaning that we never output an element
twice.
A great advantage of usage of “UNION” statement instead of “UNION ALL” statement is the fact that
it will never be stuck in an infinite loop caused by a graph which has a cyclic structure inside it. This
means that if we allow duplicate outputs on a graph with a cyclic structure inside it, and try to
perform “WITH RECURSIVE” statement with “UNION ALL” between the two queries (the two queries
that “WITH RECURSIVE” statement consists of), we will end up in infinite loop (or at least in a loop
until our system recognises that the execution time it too long and terminates the queries itself).
The structure of a “WITH RECURSIVE” statement with “UNION” is of same as when the recursive CTE
with “UNION ALL” statement was used, except that between two queries there is a “UNION”
statement, that is:
while(working_table != empty)
{
working_table = unique( evaluate_recursive_term(working_table) ) / result
output(result)
Page 15 of 21
2.1 Examples of use of WITH RECURSIVE statement with UNION
Example 1:
WITH RECURSIVE
friends(a, b) AS (VALUES (‘Alice’, ‘Bob’),
(‘Alice’, ‘Carol’),
(‘Carol’, ‘Grace’),
(‘Carol’, ‘Chuck’),
(‘Carol’, ‘Grace’),
(‘Chuck’, Anne),
(‘Bob’, ‘Dan’),
(‘Dan’, ‘Anne’),
(‘Eve’, ‘Adam’)),
friendship(name, friend) AS (
SELECT a, b,
FROM friends
UNION ALL
SELECT b, a
FROM friends)
Example 1.1:
SELECT *
FROM friends;
Page 16 of 21
We secondly observe the content of the table “friendship”.
SELECT *
FROM friendship;
We see that table “friendship” is the same as the table “friends”, except that in this table we have a
connection between two entries (or people) from both directions, meaning that for an example in
table “friends” we have a row (or edge/connection, if you think about the problem in a way as if it
was presented with a graph) which outputs “Alice” in the first column, and “Bob” in the second
column, while on the other hand, in table “friendship” we have two output row which are contain
the same two people, but in a different columns, i.e. “Alice | Bob” in in one output row, and “Bob |
Alice” in the second output row (you can see these three rows from the two tables in the two
screenshots above, circled in orange). In other words, table “friendship” is symmetric.
_____________________________________________________________________________________________
The following picture is a from FDE lecture “Using a Database”, slide number 23, which shows in a
form of a fully connected graph entries of the friendship are connected with one another.
Page 17 of 21
Example 1.2:
Output all people who are friends with “Alice”, where friends of “Alice” are all people connected to
the “Alice” or some of her friends.
WITH RECURSIVE
friends(a, b) AS (VALUES (‘Alice’, ‘Bob’),
(‘Alice’, ‘Carol’),
(‘Carol’, ‘Grace’),
(‘Carol’, ‘Chuck’),
(‘Carol’, ‘Grace’),
(‘Chuck’, Anne),
(‘Bob’, ‘Dan’),
(‘Dan’, ‘Anne’),
(‘Eve’, ‘Adam’)),
friendship(name, friend) AS (
SELECT a, b,
FROM friends
UNION ALL
SELECT b, a
FROM friends),
freindsofalice AS (
SELECT ‘Alice’ as name
UNION
SELECT friend
FROM friendship, friendofalice
WHERE friendship.name= friendofalice.name)
SELECT *
FROM friendofalice;
Page 18 of 21
Example 1.3:
Output all people who are friends with “Dan”, where friends of “Dan” are all people connected to the
“Dan” or some of his friends.
WITH RECURSIVE
friends(a, b) AS (VALUES (‘Alice’, ‘Bob’),
(‘Alice’, ‘Carol’),
(‘Carol’, ‘Grace’),
(‘Carol’, ‘Chuck’),
(‘Carol’, ‘Grace’),
(‘Chuck’, Anne),
(‘Bob’, ‘Dan’),
(‘Dan’, ‘Anne’),
(‘Eve’, ‘Adam’)),
friendship(name, friend) AS (
SELECT a, b,
FROM friends
UNION ALL
SELECT b, a
FROM friends),
freindsofalice AS (
SELECT ‘Dan’ as name
UNION
SELECT friend
FROM friendship, friendofalice
WHERE friendship.name= friendofalice.name)
SELECT *
FROM friendofalice;
In the query above, in contrast to the query from example 1.1, we just changed the person in the
NON-RECURSIVE TERM (the first query of the WITH RECURSIVE statement) from “Alice” to “Dan”.
Although we did this, we did not change the table name from “friendsofalice” and it may look not
appropriate for looking at friends of someone else, but it is conceptually the same (the table name is
not important).
We get the same result, but only in different order (for example, we start with “Dan” in this example,
as he is specified in the query as the starting point and he is outputted first), and that is because all
the people which are connected to “Alice” are connected to “Dan” as well, and it doesn’t matter
where (or better said from whom) you start, you will always get everybody who are in the same part
of a fully connected graph.
Page 19 of 21
Example 2:
“Formulate a query with a recursive view, which finds the number of actors that have a Bacon
Number <= c where c is a given constant.”
Page 20 of 21
Example 3:
“Is there any mean to reduce the size of intermediate results and thus speed up the query
evaluation? (Hint: Try to remove duplicates after every join. With this technique, the query should
finish within reasonable time for c = 2.)”
Page 21 of 21
WINDOW FUNCTIONS, AND ORDERED-
SET FUNCTIONS
Contents
1. WINDOW FUNCTIONS ......................................................................................................................... 2
1.1 Introduction to window functions................................................................................................. 2
1.2 lecture 11 (FDE), examples, window functions ............................................................................. 4
1.3 lecture 12 (FDE), examples, window functions ........................................................................... 17
1.3.1 WITH statement ................................................................................................................... 27
2. ORDERED-SET FUNCTIONS ................................................................................................................ 31
2.1 Introduction to ordered-set functions ........................................................................................ 31
2.2 lecture 12 (FDE), examples, ordered-set functions ..................................................................... 32
3. GROUPING SETS, ROLLUP AND CUBE ................................................................................................ 36
3.1 Introduction to Grouping sets, Rollup and Cube ......................................................................... 36
3.1.1 GROUP BY GROUPING SETS.................................................................................................. 36
3.1.2 GROUP BY ROLLUP ............................................................................................................... 37
3.1.3 GROUP BY CUBE ................................................................................................................... 37
3.2 lecture 12 (FDE), examples, Grouping sets, Rollup and Cube ..................................................... 38
Page 1 of 45
1. WINDOW FUNCTIONS
1.1 Introduction to window functions
Syntax for window functions:
Firstly, instead of columns_which_will_be_outputted we write some column names from the table.
Secondly, instead of window_function() we write one window function, some of which are:
1) sum(column)
2) max(column)
3) min(column)
4) avg(column)
5) cout(column)
6) rank()
7) dense_rank()
8) row_number()
9) ntile(integer_number)
10) percent_rank()
11) cume_dist()
12) lag(column, integer_number, double_number)
13) lead(column, integer_number, double_number)
The row “PARTITION BY some_column” specifies the so called “PARTITION BY” clause, which is the
clause based on which division of rows (which are inside table) into partitions is done. To be more
specific, based on values in the column some_column, the partitioning is performed. If there is no
“PARTITION BY” clause specified inside a window function, then all the rows belong to one big
partition.
The row “ORDER BY some_column” specifies the so called “ORDER BY” clause, which is the clause
based on which the ordering or row inside one partition is going to be performed. To be more
specific, the ordering depends on the values inside column some_column.
The row “RANGE/ROWS BETWEEN value1 preceding AND value2 following” specifies the so called
“FRAMING” clause, and it specifies how much of rows from the same partition influence the value in
the in the current row. It can start with either “ROWS” or “RANGE”, where the difference is that
when “ROWS” is used we literally look value1 rows which come before the current row (inside the
same partition) and value2 rows (inside the same partition) which come after the current row. On
the other hand, when we use “RANGE”, we look at “value1 - 1” rows which come before the current
Page 2 of 45
row (in the same partition), and “value2 - 1” rows which come after the current row (inside the same
partition). Additionally, there are some predefined values which can be used instead of specifying
value1 and value2 with integer numbers, which allow you to look at all the rows which (in the same
partition) come before the current row, and all the rows which come after the current row (in the
same partition). The predefined value which allows you to look at all the rows which (in the same
partition) come before the current row is unbounded proceeding, while the value which allows you
to look at all the rows which (in the same partition) come after the current row is unbounded
following. If we specify the “FRAMING” clause as: “RANGE BETWEEN unbounded preceding AND
unbounded following”, then this would mean that the current row in a partition (and each of the
rows inside a partition) is influenced by all the others rows inside the same partition.
Page 3 of 45
1.2 lecture 11 (FDE), examples, window functions
= 74602.81 + current_price
partition 1 = 197679.65 + current_price = 74602.81+ price_row_2+current_price
Page 4 of 45
SELECT o_custkey , o_orderdate, sum(o_totalprice) OVER (
PARTITION BY o_custkey
ORDER BY o_orderdate
RANGE BETWEEN current row AND unbounded following)
FROM orders;
Page 5 of 45
SELECT o_custkey, o_orderdate, sum(o_totalprice) OVER (
PARTITION BY o_custkey
ORDER BY o_orderdate
RANGE BETWEEN unbounded preceding AND unbounded following)
FROM orders;
Outputted values for each row, among rows which are in the
same partition, have the same value if you formulate the
window function as above. More specifically, if you specify the
range as:
„RANGE BETWEEN unbounded preceding AND unbounded
following“
This will mean that a single row inside a partition will be
influenced by all other rows from the same partition.
Page 6 of 45
SELECT o_custkey, o_orderdate, max(o_totalprice) OVER (
PARTITION BY o_custkey
ORDER BY o_orderdate
RANGE BETWEEN unbounded preceding AND current row)
FROM orders;
Page 7 of 45
SELECT o_custkey, o_orderdate, rank() OVER (
PARTITION BY o_custkey
ORDER BY o_orderdate) Notice that there is no “FRAMING”
FROM orders; clause.
Page 8 of 45
SELECT n_name, n_regionkey, rank() OVER (
PARTITION BY substr(n_name, 1, 1)
ORDER BY n_regionkey)
FROM nation;
Page 9 of 45
SELECT n_name, n_regionkey, row_number() OVER (
PARTITION BY substr(n_name, 1, 1)
ORDER BY n_regionkey)
FROM nation;
Page 10 of 45
SELECT n_name, n_regionkey, ntile(5) OVER (
PARTITION BY substr(n_name, 1, 1)
ORDER BY n_regionkey)
FROM nation;
_____________________________________________________________________________________________
Same as example above, with only difference that we just used „ntile(10)“
instead of „ntile(5)“.
Page 11 of 45
SELECT n_name, n_regionkey, ntile(10) OVER (
ORDER BY n_regionkey)
Page 12 of 45
SELECT n_name, n_regionkey, percent_rank() OVER (
ORDER BY n_regionkey)
FROM nation;
𝑟𝑎𝑛𝑘 − 1
𝑝𝑒𝑟𝑐𝑒𝑛𝑡_𝑟𝑎𝑛𝑘 =
𝑡𝑜𝑡𝑎𝑙_𝑛𝑢𝑚𝑏𝑒𝑟_𝑜𝑟_𝑟𝑜𝑤𝑠 − 1
Page 13 of 45
SELECT n_name, n_regionkey, cume_dist() OVER (
ORDER BY n_regionkey)
FROM nation;
Again, as there is no „PARTITION BY“ clause, all the rows belong to one big
partition.
In this example we use window function “cume_dist()”. This window function
outputs a value in “cume_dist” column according to the following formula:
𝑛𝑢𝑚𝑏𝑒𝑟_𝑜𝑓_𝑝𝑟𝑒𝑐𝑒𝑒𝑑𝑖𝑛𝑔_𝑟𝑜𝑤𝑠_𝑖𝑛_𝑎_𝑝𝑎𝑟𝑡𝑖𝑡𝑖𝑜𝑛
𝑐𝑢𝑚𝑒_𝑑𝑖𝑠𝑡 =
𝑡𝑜𝑡𝑎𝑙_𝑛𝑢𝑚𝑏𝑒𝑟_𝑜𝑓_𝑟𝑜𝑤𝑠
Page 14 of 45
SELECT o_custkey, o_orderdate, o_totalprice – lag(o_totalprice, 1) OVER (
PARTITION BY o_custkey
ORDER BY o_orderdate)
FROM orders;
Page 15 of 45
SELECT o_custkey, o_orderdate, o_totalprice, o_totalprice – lag(o_totalprice, 1, 0.0) OVER (
PARTITION BY o_custkey
ORDER BY o_orderdate)
FROM orders;
The error you get if you tried to put „0“ instead of „0.0“ in the „lag“ function as third parameter:
Page 16 of 45
1.3 lecture 12 (FDE), examples, window functions
Page 17 of 45
SELECT o_custkey, o_orderdate, o_totalprice, avg(o_totalprice) OVER (
PARTITION BY o_custkey
ORDER BY o_orderdate
ROWS BETWEEN 1 preceding AND 1 following)
FROM orders;
Same as previous except column „o_totalprice“ was added to the output to see better the average being
computed.
The result in the first row of each partition in the „avg“ column is not average of a sum of three values, but an
average of two values, as there is no preceding value in this case ( (74602.81+123079.65)/2 = 98839.825, for the
first partition, first row), and as well the last value in a partition in the „avg“ column is an average of two values, the
current value and the preceding, as there is no following number ( (95911.01+5404826)/2 = 74979.635, for the last
row in the first partition). All the values in between are an average of three values.
Page 18 of 45
SELECT o_custkey, o_orderdate, avg(o_totalprice) OVER (
PARTITION BY o_custkey
ORDER BY o_orderdate
ROWS BETWEEN 2 preceding AND 2 following)
FROM orders;
Notice the use of phrase „ROWS“ (not „RANGE“) in the framing clause of the window function, as these two have a
different meaning, which can be seen from the photo:
Page 19 of 45
SELECT o_custkey, o_orderdate, o_totalprice, sum(o_totalprice) OVER (
PARTITION BY o_custkey
ORDER BY o_orderdate)
FROM orders;
No „framing“ clause
Page 20 of 45
SELECT o_custkey, o_orderdate, o_totalprice, sum(o_totalprice) OVER (
PARTITION BY o_custkey)
FROM orders;
Page 21 of 45
SELECT o_custkey, o_orderdate, o_totalprice, sum(o_totalprice) OVER (
PARTITION BY o_custkey
ROWS BETWEEN 1 preceding AND 1 following)
FROM orders;
This doesn't make sense, and professors says that in his opinion this should display an error message, as it is not
clear what the output of such a window function should be, as there is no specified order (not sorted rows within a
partition) according to which the aggregate function of the window function should be applied. It is totally unclear
in which order are we going to get output of rows, if „ORDER BY“ clause is not specified. PostgeSQL by default does
not output an error (as you can see from the example above), but avoid such constructions of window functions.
Page 22 of 45
SELECT o_custkey, o_orderdate, o_totalprice, o_orderstatus, sum(o_totalprice) OVER (
PARTITION BY o_custkey
ORDER BY o_orderdate
ROWS BETWEEN
CASE WHEN o_orderstatus='F'
THEN 3
ELSE 1
END
preceding AND
CASE WHEN o_orderstatus='F'
THEN 3
ELSE 1
END
following)
FROM orders;
______________________________________________________________________________________________
(works on HyPer1)
1
http://www.hyper-db.de/interface.html#
Page 23 of 45
SELECT o_custkey, o_orderdate, sum(o_totalprice) OVER (
ORDER BY o_orderdate)
FROM orders;
Page 24 of 45
SELECT o_orderdate, sum(o_totalprice) OVER (
ORDER BY o_orderdate)
FROM (SELECT o_orderdate, sum(o_totalprice) AS o_totalprice
FROM orders
GROUP BY o_orderdate) AS s
Page 25 of 45
Exercise from slide 31, FDE, section „Using a Database“:
- For each customer from GERMANY compute the cumulative spending ( “sum(o_totalprice)” ) by year (
“extract(year from o_orderdate)” )?
Page 26 of 45
1.3.1 WITH statement
In this example we will introduce the “WITH” clause. It is a clause which has the following syntax:
From this we can see that a single “WITH” clause can introduce multiple queries by separating them
with a comma (i.e. the “WITH” word is not repeated). On the other hand, when there is only one
query in a “WITH” clause, there is no comma, nor any character after it, as well at there is no
comma, nor any other character after the last query specified in a “WITH” clause when there is
more than one query specified in it.
Also, we can see from the example above that when we specify multiple queries inside a single
“WITH” statement, inside one of the queries we are writing we can refer to any query which is
defined above the current one (like in the example above where in query “q2” we refer to the query
defined above it in the same “WITH” clause, i.e. query “q1”).
Also, a query from “WITH” can be referred at multiple times in one query, and in that case it is
executed only once, and its result is reused (i.e. it will not be executed multiple times).
The purpose of use of “WITH” clause is breaking a complex query into smaller parts.s
Page 27 of 45
WITH years AS ( SELECT DISTINCT extract(year FROM o_orderdate) AS year
FROM orders)
SELECT o_custkey, year, sum(revenue) OVER (
PARTITION BY o_custkey
ORDER BY year)
FROM ( SELECT o_custkey, year, sum(o_totalprice) AS revenue
FROM ( SELECT o_custkey, extract(year FROM o_orderdate) AS year, o_totalprice
FROM orders) AS s
GROUP BY o_custkey, year) AS s
You get an entry for each customer in each year, even if they did not make any orders in a year (with help of a
dummy variable we added for each year which adds an order of price 0 for each year).
In the case we use „sum“ as the aggregate function, the situation where there is no entries (purchases in this case)
for certain rows, it doesn't make a difference if we add dummy entries or not, but in the case we used „avg“ as an
aggregate function and refer to rows which are preceding and following to the current row, it important that we
have these dummy rows, as they influence the result. An example of this is given on the next two pages.
What would happen if we did not use the version with dummy rows, and instead of calculating the sum like in the
previous example we calculated the average, using the „avg“ function:
Page 28 of 45
SELECT o_custkey, year, avg(revnue) OVER (
PARTITION BY o_custkey
ORDER BY year
ROWS BETWEEN 1 preceding AND 1 following)
FROM ( SELECT o_custkey, year, sum(o_totalprice) AS revenue
FROM ( SELECT o_custkey, extract(year FROM o_orderdate) AS year, o_totalprice
FROM orders) AS s
GROUP BY o_custkey, year) AS s
In this case average of value in the row of first partition where there is „1992“ in the column „year“, the average
value in the „avg“ column is computed as average of values in 1992 (current) and 1996 (following, at least that is
how the program sees that), which should not be the case as following to 1992 value should be 1993 value, but the
customer to which the first partition refers to did not make any purchases in 1993, 1994, 1995 (so dummy rows
should be added, for all years, and then for the missing ones they will be displayed as zero, while these dummy
rows will have no effect on rows for which some purchases were made).
Page 29 of 45
WITH years AS ( SELECT DISTINCT extract(year FROM o_orderdate) AS year
FROM orders)
SELECT o_custkey, year, avg(revnue) OVER (
PARTITION BY o_custkey
ORDER BY year
ROWS BETWEEN 1 preceing AND 1 following)
FROM ( SELECT o_custkey, year, sum(o_totalprice) AS revenue
FROM ( SELECT o_custkey, extract(year FROM o_orderdate) AS year, o_totalprice
FROM orders
UNION ALL
SELECT c_custkey, year, 0
FROM customer, years) AS s
GROUP BY o_custkey, year) AS s
In contrast to the previous example, in this one we added dummy rows and got a different (the output which would
be most likely be wanted instead of the output from the previous example).
Page 30 of 45
2. ORDERED-SET FUNCTIONS
2.1 Introduction to ordered-set functions
Ordered-set Functions are functions which require that a table on which these functions are applied
on is sorted.
Firstly, instead of columns_which_will_be_outputted we write some column names from the table.
Secondly, instead of orderd_set_function() we write one ordered-set function, some of which are:
1) mode()
2) percentile_disc(p)
3) percentile_cont(p)
Page 31 of 45
2.2 lecture 12 (FDE), examples, ordered-set functions
If we use order-set function „percentile_cont(p)“, and we as well have an even number of rows, it is not really
clear which value should be taken as the median value (middle value), so in case of this ordered-set function
interpolation is used to determine
If we use order-set function „percentile_disc(p)“ instead of „percentile_cont(p)“, like above, we get the first value
of two if there is an even number of rows.
__________________________________________________________________________________
To demonstrate what is done by ordered-set functions, we will use window-functions to display the “15000/2”-th
row, and “(15000/2) +1”-th row:
This also shows how much would it more difficult be to calculate the median using window-functions, then order-
set functions, and in addition we would have to know in advance how many rows are there.
Page 32 of 45
With window-functions:
SELECT mode()
WITHIN GROUP (ORDER BY o_totalprice) SELECT o_totalprice, count(*) AS c
FROM orders; FROM orders
GROUP BY o_totalprice
ORDER BY c DESC
LIMIT 2;
We see from the window-function version of the result of this problem, that there was more than one element
which occurred the same number of times.
_____________________________________________________________________________________
SELECT mode()
WITHIN GROUP (ORDER BY o_orderstatus)
FROM orders;
Page 33 of 45
SELECT o_custkey, percentile_cont(0.5)
WITHIN GROUP ( ORDER BY o_totalprice)
FROM orders
GROUP BY o_custkey
It is also possible to combine order-set functions with „GROUP BY“ clause, like in this example where we calculate
the median value of spending of each customer.
Page 34 of 45
Not only does the order-set function „percentile_cont(p)“perform interpolation when calculating the median value,
but for values which depend on the value you specify in the function „percentile_count(p)“ as „p“.
_____________________________________________________________________________________________
Ordered-set function „percentile_cont(p)“ works only on numeric values, as it performs interpolation, so when you
give it a non-numeric value, it tries to perform interpolation and gives an error as output. On the other hand, when
you try function „percentile_disc(p)“ on a non-numeric value it will be able to work, as it just depends only on
numbers of rows.
Page 35 of 45
3. GROUPING SETS, ROLLUP AND CUBE
3.1 Introduction to Grouping sets, Rollup and Cube
We have types of SQL features that we use as a part of the “GROUP BY” clause, which we can use to
perform aggregation over multiple dimensions, and these are:
1. GROUPING SETS(sets_of_columns)
2. ROLLUP(list_of_columns)
3. CUBE(list_of_columns)
or
Page 36 of 45
3.1.2 GROUP BY ROLLUP
On the other hand, you could also use “ROLLUP(list_of_columns)” in the “GROUP BY” clause, and it is
more popular to use this SQL feature than the “GROUPING SETS(sets_of_columns)”.
It is also possible to specify more than two columns in the list_of_columns of the “GROUP BY
ROLLUP(list_of_columns)”, like for an example: “GROUP BY ROLLUP(col1, col2, col3)”.
Page 37 of 45
3.2 lecture 12 (FDE), examples, Grouping sets, Rollup and Cube
. ..
Another row is added to the output (for each year) where there is no value (NULL technically speaking) in the
column „month“, but there is a value in the column „sum“, which specifies the sum over all the months in a year.
Also, the last row of the output gives a sum of all values in all years. This is called „level of detail aggregation“, we
aggregate it over a month level, and over a year level, and over everything.
As mentioned earlier, the reason why we get such an output is because “GROUP BY ROLLUP(list_of_columns)” can
be written as an union between three queries. To bring closer to the reader what is the result of which of these
three queries three colors were used beside each row in the photos of the output above, the same colors used to
denote each of the three queries in the 3.1.1 section of this document.
Page 38 of 45
SELECT year, month, sum(o_totalprice)
FROM ( SELECT extract(year FROM o_orderdate) AS year, extract (quarter FROM o_orderdate) AS quarter, o_totalprice
FROM orders) AS s
GROUP BY ROLLUP(year, quarter)
ORDER BY year, quarter;
(This query is a little bit changed in comparison to the down original one, written by the professor. The thing that is
changed is that I put the column to be named „quarter“ instead of „month“, as professor forgot to do this, but the
output is the same)
...
The same as previous example, just a year is divided into quarters instead of months, so that all the rows of the
output can fit into the screen.
Page 39 of 45
SELECT year, month, sum(o_totalprice)
FROM ( SELECT extract(year FROM o_orderdate) AS year, extract (month FROM o_orderdate) AS month, o_totalprice
FROM orders) AS s
GROUP BY GROUPING SETS( (year, month), (year), (month) )
ORDER BY year, month;
...
Page 40 of 45
SELECT year, month, sum(o_totalprice)
FROM ( SELECT extract(year FROM o_orderdate) AS year, extract (month FROM o_orderdate) AS month, o_totalprice
FROM orders) AS s
GROUP BY GROUPING SETS( (year, month), (year), (month), () )
ORDER BY year, month;
...
By adding in the clause „GROUP BY GROUPING SETS“ value „()“, this is used to refer to all the values (to be summed
up in this case)
Page 41 of 45
SELECT year, month, sum(o_totalprice)
FROM ( SELECT extract(year FROM o_orderdate) AS year, extract (month FROM o_orderdate) AS month, o_totalprice
FROM orders) AS s
GROUP BY CUBE( year, month )
ORDER BY year, month;
...
This query written using „GROUP BY CUBE“ clause means all 2n combinations of groupings, so it could be written
with „GROUP BY GROUPING SETS“ clause like:
Using „CUBE“ may be very useful sometimes, but it should be used only if a few attributes are included in the clause, as
otherwise the output would be huge.
Page 42 of 45
SELECT year, quarter, month, sum(o_totalprice)
FROM ( SELECT extract(year FROM o_orderdate) AS year, extract (quarter FROM o_orderdate) AS quarter, extract
(month FROM o_orderdate) AS month, o_totalprice
FROM orders) AS s
GROUP BY ROLLUP(year, quarter, month )
ORDER BY year, month;
...
Page 43 of 45
If for the same problem as the pervious one we used „CUBE“, instead of „ROLLUP“, we would get the following:
...
This example is used to show how usage of „CUBE“ may drastically increase the number of output rows. In the
above example there were 115 outputted rows (using „ROLLUP“), while using „CUBE“ for the 'same' query we get
223 rows.
Page 44 of 45
Exercise from slide 36, FDE, section „Using a Database“:
- Aggregate revenue ( “sum(o_totalprice)” ): total, by region (r_name) by name (n_name), example output:
or equivalently:
Page 45 of 45