0% found this document useful (0 votes)

7 views

Explaining The Unexplainable

Uploaded by

Clailson

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views

Explaining The Unexplainable

Uploaded by

Clailson

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 31

Explaining the unexplainable .......................................................................................................................................................................................................................................

1
15 thoughts on “Explaining the unexplainable” .......................................................................................................................................................................................................... 5
Explaining the unexplainable – part 2 ......................................................................................................................................................................................................................... 7
16 thoughts on “Explaining the unexplainable – part 2”........................................................................................................................................................................................... 10
Explaining the unexplainable – part 3 ....................................................................................................................................................................................................................... 12
Function Scan ............................................................................................................................................................................................................................................................. 12
Sort............................................................................................................................................................................................................................................................................. 12
Limit ........................................................................................................................................................................................................................................................................... 13
HashAggregate ........................................................................................................................................................................................................................................................... 13
Hash Join / Hash......................................................................................................................................................................................................................................................... 14
Nested Loop ............................................................................................................................................................................................................................................................... 14
Merge Join ................................................................................................................................................................................................................................................................. 15
Hash Join / Nested Loop / Merge Join modifiers ....................................................................................................................................................................................................... 15
Materialize ................................................................................................................................................................................................................................................................. 16
5 thoughts on “Explaining the unexplainable – part 3” ............................................................................................................................................................................................. 17
Explaining the unexplainable – part 4 ....................................................................................................................................................................................................................... 18
Unique........................................................................................................................................................................................................................................................................ 18
Append....................................................................................................................................................................................................................................................................... 19
Result ......................................................................................................................................................................................................................................................................... 19
Values Scan ................................................................................................................................................................................................................................................................ 19
GroupAggregate......................................................................................................................................................................................................................................................... 20
HashSetOp ................................................................................................................................................................................................................................................................. 20
CTE Scan ..................................................................................................................................................................................................................................................................... 20
InitPlan ....................................................................................................................................................................................................................................................................... 21
SubPlan ...................................................................................................................................................................................................................................................................... 21
Other ? ....................................................................................................................................................................................................................................................................... 21
4 thoughts on “Explaining the unexplainable – part 4” ............................................................................................................................................................................................. 22
Explaining the unexplainable – part 5 ....................................................................................................................................................................................................................... 22
9 thoughts on “Explaining the unexplainable – part 5” ............................................................................................................................................................................................. 27
Explaining the unexplainable – part 6: buffers......................................................................................................................................................................................................... 29
One thought on “Explaining the unexplainable – part 6: buffers” ............................................................................................................................................................................ 31

Explaining the unexplainable

One of the first things new DBA hears is “Use the EXPLAIN". And upon first try he/she is greeted with incomprehensible:

QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------
Sort (cost=146.63..148.65 rows=808 width=138) (actual time=55.009..55.012 rows=71 loops=1)
Sort Key: n.nspname, p.proname, (pg_get_function_arguments(p.oid))
Sort Method: quicksort Memory: 43kB
-> Hash Join (cost=1.14..107.61 rows=808 width=138) (actual time=42.495..54.854 rows=71 loops=1)
Hash Cond: (p.pronamespace = n.oid)
-> Seq Scan on pg_proc p (cost=0.00..89.30 rows=808 width=78) (actual time=0.052..53.465 rows=2402 loops=1)
Filter: pg_function_is_visible(oid)
-> Hash (cost=1.09..1.09 rows=4 width=68) (actual time=0.011..0.011 rows=4 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 1kB
-> Seq Scan on pg_namespace n (cost=0.00..1.09 rows=4 width=68) (actual time=0.005..0.007 rows=4 loops=1)
Filter: ((nspname <> 'pg_catalog'::name) AND (nspname <> 'information_schema'::name))

What does it even mean?

Of course trying to understand the explain above, as a start point, is pretty futile. Let's start with something simpler. But even before that I want you to understand
one very important thing:

PostgreSQL knows

That means that PostgreSQL keeps some meta-information (information about information). Rowcounts, number of distinct values, most common values, and so
on. For large tables these are based on random sample, but in general, it (Pg) is pretty good at knowing stuff about our data.

So, now. Let's see simple query, and it's explain:

$ explain select * from test where i = 1;

QUERY PLAN
------------------------------------------------------
Seq Scan on test (cost=0.00..40.00 rows=12 width=4)
Filter: (i = 1)
(2 rows)
Query is pretty simple, and I take it that it doesn't require any special comments.

In explain – first line, and all lines that start with “->" are operations. Other lines as additional info for operation above.

In our case, we have just one operation: sequential scanning of table test.

There is additional information – about Filter.

Sequential scan means that PostgreSQL will “open" the table data, and read it all, potentially filtering (removing) rows, but generally ready to read and return
whole table.

Why ready? I'll talk about it in a minute.

So, the Seq scan line, informs us that we are scanning table in sequential mode. And that the table is named “test" (though in here lies one of the biggest problems
with explain, as it's not showing schema, which did bite my ass more than couple of times).

And what about the numbers afterwards?

Let me ask you a question: You have table like this:

Given this table definition, and query:

SELECT * FROM t where some_column = 123;

What do you think would be the best way to run the query? Sequentially scan the table, or use index?

If you say: use index of course, there is index on this column, so it will make it faster – I'll ask: what about case where the table has one row, and that it has
some_column = 123?

To do seq scan, I just need to read one page (8192 bytes) from table, and I get the row. To use index, I have to read page from index, check it to find if we have
row matching condition in the table and then read page from table.

In total – more than twice the work!

You could say – sure, but that's for very small tables, so the speed doesn't matter. OK. So let's imagine a table that has 10 billion rows, and all of them have
some_column = 123. Index doesn't help at all, and in reality it makes the situation much, much worse.

Of course – if you have million rows, and just one has some_column = 123 – index scan will be clearly better.

So – it is impossible to tell whether given query will use index, or even if it should use index (I'm talking about general case) – you need to know more. And this
leads us to simple thing: depending on situation one way of getting data will be better or worse then other.

PostgreSQL (up to a point) examines all possible plans. It knows how many rows you have, it knows how many rows will (likely) match the criteria, so it can
make pretty smart decisions.

But how are the decisions made? That's exactly what the first set of numbers in explain shows. It's the cost.

Some people think that cost is estimate shown in seconds. It's not. It's unit is “fetching single page in sequential manner". It is about time and resource usage.

In postgresql.conf, you might have noticed these params:

seq_page_cost = 1.0 # measured on an arbitrary scale

random_page_cost = 4.0 # same scale as above
cpu_tuple_cost = 0.01 # same scale as above
cpu_index_tuple_cost = 0.005 # same scale as above
cpu_operator_cost = 0.0025 # same scale as above

So, we can even change how much it costs to read sequential page. These parameters dictate costs that PostgreSQL assumes it would take to run various methods
of running the same query.

For example, let's make a simple 1000 row table, with some texts, and index:

create table test (id serial primary key, some_text text);

CREATE TABLE

insert into test (some_text) select 'whatever' from generate_series(1,1000);

INSERT 0 1000

Now, we can see that running explain with condition on id will show:

explain select * from test where id = 50;

QUERY PLAN
-----------------------------------------------------------------------
Index Scan using test_pkey on test (cost=0.28..8.29 rows=1 width=36)
Index Cond: (id = 50)
(2 rows)
And what if we'd tell pg that under no circumstances it can use index scan?

explain select * from test where id = 50;

QUERY PLAN
------------------------------------------------------------------------
Bitmap Heap Scan on test (cost=4.28..8.30 rows=1 width=13)
Recheck Cond: (id = 50)
-> Bitmap Index Scan on test_pkey (cost=0.00..4.28 rows=1 width=0)
Index Cond: (id = 50)
(4 rows)

and let's disable this one too:

explain select * from test where id = 50;

QUERY PLAN
------------------------------------------------------
Seq Scan on test (cost=0.00..18.50 rows=1 width=13)
Filter: (id = 50)
(2 rows)

OK, let's show them close to each other:

Index Scan using test_pkey on test (cost=0.28..8.29 rows=1 width=36)

Bitmap Heap Scan on test (cost=4.28..8.30 rows=1 width=13)
Seq Scan on test (cost=0.00..18.50 rows=1 width=13)

By default, Pg used Index Scan. Why? It's simple – it's the cheapest in this case. Total cost of 8.29, while bitmap heap scan (whatever that would be) would be
8.30 and seq scan – 18.5.

OK, But cost shows two numbers: number..number. What is this about, and why am I talking about the second number only? If we'd get into consideration first
number then seq scan is winner, as it has there 0 (zero), while index scan is 0.28, and bitmap heap scan – 4.28.

So, the range (number .. number) is because it shows cost for starting the operation row and cost for getting all rows (By all, I mean all returned by this operation,
not all in table).

What is the starting cost? Well, for seq scan there is none – you just read page, and return rows. That's all. But for example, for sorting dataset – you have to read
all the data, and actually sort it, before you can consider returning even first of the rows. Which can be nicely seen in this explain:

QUERY PLAN
-------------------------------------------------------------------
Sort (cost=22.88..23.61 rows=292 width=202)
Sort Key: relfilenode
-> Seq Scan on pg_class (cost=0.00..10.92 rows=292 width=202)
(3 rows)

Please note that startup cost for Sort is 22.88, while total cost is just 23.61. So returning rows from Sort is trivial (in terms of cost), but sorting them – not so much.

Next information in explain is “rows". This is estimate, how many rows does PostgreSQL think that this operation is capable of returning (it might return less, for
example in case of LIMIT). This is also important for some operations – joins for example. As joining two tables that have 20 rows together can be done in many
ways, and it doesn't really matter how, but when you join 1 million row table with 1 billion row table, the way you do the join is very important (I'm not talking
about “inner join/left join/…" but rather about “hash join", “nested loop", “merge join" – if you don't understand, don't worry – will write about it later).

This number can be of course misestimated – for many reasons. Sometimes it doesn't matter, and sometimes it does. But we'll talk about misestimates later.

Final bit of information is width. This is PostgreSQL idea on how many bytes, on average, there are in single row returned from given operation. For example:

explain select * from pg_class;

QUERY PLAN
-------------------------------------------------------------
Seq Scan on pg_class (cost=0.00..10.92 rows=292 width=202)
(1 row)

explain select relname, relkind from pg_class;

QUERY PLAN
------------------------------------------------------------
Seq Scan on pg_class (cost=0.00..10.92 rows=292 width=65)
(1 row)

As you can see limiting number of fields modified width, and in turn, total amount of data that will need to be passed through execution of the query.

Next, it the single most important bit of information. Explains are trees. Upper node needs data from nodes below.

Let's consider this plan.

There are 5 operations there: sort, hash join, seq scan, hash, and seq scan. PostgreSQL executes the top one – sort. Which in turn executes the ones directly below
(Hash join) and gets data from them. Hash join, to return data to sort, has to run seq scan (on pg_proc) and hash (#4). And the final hash, to be able to return data,
has to run seq scan on pg_namespace.

It is critical to understand that some operations can return data immediately, and, what's even more important, gradually. For example – Seq Scan. And some
others cannot. For example – in here we see that Hash (#4) has the same cost for startup as it's “suboperation" seq scan – for “all rows". This means, that for
hashing operation to start (well, to be able to return even single row), it has to read in all the rows from suboperation(s).

The part about returning rows gradually becomes very important when you'll start writing functions. Let's consider such functions:

1. CREATE OR REPLACE FUNCTION public.test()

2. RETURNS SETOF integer
3. LANGUAGE plpgsql
4. AS $function$
5. declare
6. i int4;
7. begin
8. for i in 1..3 loop
9. return next i;
10. perform pg_sleep(1);
11. end loop;
12. return;
13. end;
14. $function$;

If you don't understand it, don't worry. The functions returns 3 rows, each contains single integer – 1, 2 and 3. The important bit, though, is that it sleeps for 1
second after returning each row.

This means that if I'll do:

select * from test();

I can expect to wait 3 seconds for results.

But how long will it take to return from this:

select * from test() limit 1;

Let's see:

\timing
Timing is on.

select * from test() limit 1;

test
------
1
(1 row)

Time: 3005.334 ms

The same 3 seconds. Why? Because PL/pgSQL (and most, if not all, other PL/* languages) cannot return partial results. It looks like it can – with “return next",
but all these are stored in a buffer and returned together when function execution ends.

On the other hand – “normal" operations, usually, can return partial data. It can be seen with something trivial like seq scan, on non-trivial table:

create table t as
select i as id,
repeat('depesz', 100)::text as payload
from generate_series(1,1000000) i;

With this table, we can see:

explain analyze select * from t;

QUERY PLAN
----------------------------------------------------------------------------------------------------------------
Seq Scan on t (cost=0.00..185834.82 rows=10250082 width=36) (actual time=0.015..232.380 rows=1000000 loops=1)
Total runtime: 269.666 ms
(2 rows)

explain analyze select * from t limit 1;

QUERY PLAN
--------------------------------------------------------------------------------------------------------------
Limit (cost=0.00..0.02 rows=1 width=36) (actual time=0.003..0.003 rows=1 loops=1)
-> Seq Scan on t (cost=0.00..185834.82 rows=10250082 width=36) (actual time=0.003..0.003 rows=1 loops=1)
Total runtime: 0.016 ms
(3 rows)

(please look only at “Total runtime: .." for now)

As it can be seen – seq scan ended very fast – as soon as it satisfied Limit's appetite for exactly 1 row.

Please also note that in here, even the costs (which are not the best thing for comparing queries) show that top node (seq scan in first, and limit in second query)
have very different values for returning all rows – 185834.82 vs. 0.02.

So, the first 4 numbers for any given operation (two for cost plus rows and width) are all estimates. They might be correct, but but they as well might not.

The other 4 numbers, which you get when you run “EXPLAIN ANALYZE query" or “EXPLAIN ( ANALYZE on ) query" show the reality.

Time is again a range, but this time is real. It is how much time PostgreSQL actually did spend working on given operation (on average, because it could have run
the same operation multiple times). And just as with cost – time is a range. Startup time, and time to return all data. Let's check this plan:

$ explain analyze select * from t limit 100;

QUERY PLAN
--------------------------------------------------------------------------------------------------------------
Limit (cost=0.00..9.33 rows=100 width=608) (actual time=0.008..0.152 rows=100 loops=1)
-> Seq Scan on t (cost=0.00..93333.86 rows=999986 width=608) (actual time=0.007..0.133 rows=100 loops=1)
Total runtime: 0.181 ms
(3 rows)

As you can see – Limit has startup time of 0.008 (millisecond, that's the unit in here). This is because Seq Scan (which Limit called to get data) took 0.007ms to
return first row, and then there was 0.001ms of processing within limit itself.
Afterwards (after returning 1st row), limit kept getting data from Seq Scan, until it got 100 rows. Then it terminated Seq scan (which happened 0.133ms after start
of query), and it finished after another 0.019 ms.

Actual rows value, just as name suggests, shows how many rows (on average) this operation returned. And loops show how many times this operation was in total
ran.

In what case would an operation be called more than once? For example in some cases of joins, or subqueries. It looks like this plan.

Please note that loops in 3rd operation is 2. This means that this Seq Scan was ran twice, returning, on average 1 row, and it took, on average, 0.160ms to finish.
So the total time spent in this particular operation is 2 * 0.160ms = 0.32ms (that's what is in exclusive/inclusive columns on explain.depesz.com).

Very often poor performance of a query comes from the fact that it had to loop many times over something. As in here.

(of course it doesn't mean that it's Pg fault that it chose such plan – maybe there simply weren't other options, or other options were estimated as even more
expensive).

In above example, please note that while actual time for operation 3 is just 0.003ms, this operation was run over 26000 times, resulting in total time spent in here
of almost 79ms.

I think that wraps theoretical information required to read explains. You will probably still don't understand what the operations or other information mean, but at
the very least – you will know what the numbers mean, and what's the difference between explain (shows costs in abstract units, which are based on random-
sample estimates) and explain analyze (shows real life times, rowcounts and execution counts, in units that can be compared with different queries).

As always, I'm afraid that I skipped a lot of things that might be important but just escaped me, or (even worse) I assumed that these are “obvious". If you'll find
anything missing, please let me know, I'll fill in ASAP.

But, before, let me just say, that I plan to extend this blogpost in 2-3 another posts that would cover more about:

 what are various operations, how they work, and what to expect when you see them in explain output
 what are statistics, how Pg gets them, how to view them, and how to get the best out of it

Posted on 2013-04-16|Tags explain, postgresql, unexplainable|

15 thoughts on “Explaining the unexplainable”

1. Robins Tharakan says:

2013-04-17 at 18:17

Just wanted to help, thought you may want to correct:

1. examctly -> exactly

2. simple 100 row table -> simple 1000 row table
3. and actually sort if -> and actually sort it

2. depesz says:

2013-04-17 at 18:27

@Robins: thanks fixed.

3. lesovsky says:

2013-04-18 at 05:21

Excellent post, clear and understandable. Thanks!

4. kk says:

2013-05-09 at 10:49

Great.
Waiting for 2 next blogposts.

5. Tobu says:

2013-05-22 at 21:43

Yay I understood all of this.

I sort of expected an unit / decimal point mixup with 0.008 milliseconds and the timings after that, but the post is correct, that’s 8 microseconds. I do wish
PostgreSQL would print the units (or at least default to seconds).
6. depesz says:

2013-05-22 at 21:45

@Tobu: once you get used to it, it just makes sense. Adding “ms” to every value would take too much space. And defaulting to second would make the
times even worse. I see lots of queries that run in less than 1ms. and this would look absolutely awful: 0.000008

7. varg says:

2013-05-23 at 11:15

Thanks for writing this up, very informative.

8. boris says:

2013-07-02 at 13:32

>> So, we can even change how much it costs to read sequential page.
Are there any reasonable recommendations whether these settings worth changing ?

9. depesz says:

2013-07-02 at 15:20

I wouldn’t change seq scan cost, as this is basically an unit. If you have your database on SSD, it might be good to lower random_page_cost significantly.
As for the rest – test it. Play with it on your dataset and see what comes out of it.

10. manuscola says:

2013-07-03 at 08:39

In above example, please note that while actual time for operation 3 is just 0.003ms, this operation was run over 26000 times, resulting in total time spent
in here of almost 79ms.

not 79ms but 79s

You artical helps a lot

11. depesz says:

2013-07-03 at 10:42

@Manuscola: not sure where/how you got 79s, but it is 79ms.

12. bartek says:

2013-09-03 at 10:40

Fully qualified names in explain output can be reached by verbose option of explain (EXPLAIN VERBOSE). I had the same problem. Thanks to advice
from Pavel Stehule my problem has been resolved.

Keep writing posts like this one I learnt so much

13. Pavel Alexeev says:

2015-12-03 at 17:30

Greate post to understand basics. Thank you.

Could you please also in such detailed manner explain JOIN estimations?
F.e. I could not understand ( http://stackoverflow.com/questions/33654105/incorrect-rows-estimate-for-joins ) how Postgress use statistic to estimate
amount of rows in join operations. And how it may be fixed without external extensions.

14. Adam says:

2016-04-20 at 21:49
Hi, thanks for a nice write up.

I’m puzzled by

> And what if we’d tell pg that under no circumstances it can use index scan?

How did we actually tell that? It is unclear from the query snippet.

15. depesz says:

2016-04-21 at 14:06

@Adam:

oh, sorry. So you can do it by executing:

set enable_indexscan = false;

there is set of enable_* parameters which can be used to disable given type of scan/functionality.

Comments are closed.

Explaining the unexplainable – part 2

Last time I wrote about what explain output shows. Now I'd like to talk more about various types of “nodes" / operations that you might see in explain plans.

The most basic one is Seq Scan.

It looks like this:

explain analyze select * from pg_class;

QUERY PLAN
---------------------------------------------------------------------------------------------------------
Seq Scan on pg_class (cost=0.00..10.92 rows=292 width=202) (actual time=0.009..0.049 rows=295 loops=1)
Total runtime: 0.249 ms
(2 rows)

This is the simplest possible operation – PostgreSQL opens table file, and reads rows, one by one, returning them to user or to upper node in explain tree, for
example to limit, as in:

explain analyze select * from pg_class limit 2;

QUERY PLAN
-------------------------------------------------------------------------------------------------------------
Limit (cost=0.00..0.07 rows=2 width=202) (actual time=0.014..0.014 rows=2 loops=1)
-> Seq Scan on pg_class (cost=0.00..10.92 rows=292 width=202) (actual time=0.009..0.009 rows=2 loops=1)
Total runtime: 0.132 ms
(3 rows)

It is important to understand that the order of returned rows is not any specific. It's not “in order of insertion", or “last updated first" or anything like this.
Concurrent selects, updates, deletes, vacuums can modify the order of rows at any time.

Seq Scan can filter rows – that is reject some from being returned. This happens for example when you'll add “WHERE" clause:

explain analyze select * from pg_class where relname ~ 'a';

QUERY PLAN
---------------------------------------------------------------------------------------------------------
Seq Scan on pg_class (cost=0.00..11.65 rows=227 width=202) (actual time=0.030..0.294 rows=229 loops=1)
Filter: (relname ~ 'a'::text)
Rows Removed by Filter: 66
Total runtime: 0.379 ms
(4 rows)

As you can see now we have Filter: information. And because I'm on 9.2 or newer I got also “Rows removed by filter" line.

Next type of node is “Index Scan".

This type of scan seems to be very straight forward, and most people understand when it is used at least in one case:

explain analyze select * from pg_class where oid = 1247;

QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------
Index Scan using pg_class_oid_index on pg_class (cost=0.15..8.17 rows=1 width=202) (actual time=0.007..0.007 rows=1 loops=1)
Index Cond: (oid = 1247::oid)
Total runtime: 0.077 ms
(3 rows)

That is – we have index that matches the condition, so PostgreSQL does:

 opens the index

 in the index if finds where (in table data) there might be rows that match given condition
 opens table
 fetches row(s) pointed to by index
 if the rows can be returned – i.e. they are visible to current session – they are returned
You can of course ask: how can a row be invisible? It might happen for rows deleted that are still in the table (haven't been vacuumed). Or have been updated. Or
were inserted, but after current transaction.

Index Scan is also used when you want some data ordered using order from index. As in here:

explain analyze select * from pg_class order by oid limit 10;

QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
-------
Limit (cost=0.15..1.67 rows=10 width=206) (actual time=0.017..0.029 rows=10 loops=1)
-> Index Scan using pg_class_oid_index on pg_class (cost=0.15..44.53 rows=292 width=206) (actual time=0.014..0.026 rows=10
loops=1)
Total runtime: 0.145 ms
(3 rows)

There is no condition here, but we can add condition easily, like this:

explain analyze select * from pg_class where oid > 1247 order by oid limit 10;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
------
Limit (cost=0.15..4.03 rows=10 width=206) (actual time=0.021..0.035 rows=10 loops=1)
-> Index Scan using pg_class_oid_index on pg_class (cost=0.15..37.84 rows=97 width=206) (actual time=0.017..0.031 rows=10
loops=1)
Index Cond: (oid > 1247::oid)
Total runtime: 0.132 ms
(4 rows)

In these cases, Pg finds starting point in index (either first row that is > 1247, or simply smallest value in index, and then returns next rows/values until Limit will
be satisfied.

There is a version of Index Scan, called “Index Scan Backward" – which does the same thing, but is used for scanning in descending order:

explain analyze select * from pg_class where oid < 1247 order by oid desc limit 10;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
---------------
Limit (cost=0.15..4.03 rows=10 width=206) (actual time=0.012..0.026 rows=10 loops=1)
-> Index Scan Backward using pg_class_oid_index on pg_class (cost=0.15..37.84 rows=97 width=206) (actual time=0.009..0.022
rows=10 loops=1)
Index Cond: (oid < 1247::oid)
Total runtime: 0.119 ms
(4 rows)

This is the same kind of operation – open index, and for every row pointed to by index, fetch row from table, just it happens not “from small to big" but “from big
to small".

Another similar operation is “Index Only Scan".

Let's create simple table:

create table test (id serial primary key, i int4);

CREATE TABLE

insert into test (i) select random() * 1000000000 from generate_series(1,100000);

INSERT 0 100000

vacuum analyze test;

VACUUM

This gives me table like this one:

select * from test limit 10;

id | i
----+-----------
1 | 546119592
2 | 253476978
3 | 235791031
4 | 654694043
5 | 187647296
6 | 709050245
7 | 210316749
8 | 348927354
9 | 120463097
10 | 5611946
(10 rows)

In here, I have index on id:

So, if some conditions are met (more on it in a bit), I can get plan like this:

explain analyze select id from test order by id asc limit 10;

QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
--
Limit (cost=0.29..0.55 rows=10 width=4) (actual time=0.039..0.042 rows=10 loops=1)
-> Index Only Scan using test_pkey on test (cost=0.29..2604.29 rows=100000 width=4) (actual time=0.036..0.038 rows=10
loops=1)
Heap Fetches: 0
Total runtime: 0.092 ms
(4 rows)

Please note the “Only" word in “Index Only Scan".

This means that Pg realized that I select only data (columns) that are in the index. And it is possible that it doesn't need to check anything in the table files. So that
it will return the data straight from index.

These scans were the big change in PostgreSQL 9.2, as they have the ability to work way faster than normal Index Scans, because they don't have to verify
anything in table data.

The problem is that, in order to make it work, Index has to contain information that given rows are in pages, that didn't have any changes “recently". This means
that in order to utilize Index Only Scans, you have to have your table well vacuumed. But with autovacuum running, it shouldn't be that big of a deal.

Final kind of table scans is so called Bitmap Index Scan. It looks like this:

explain analyze select * from test where i < 100000;

QUERY PLAN
-------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on test (cost=4.37..39.99 rows=10 width=8) (actual time=0.025..0.110 rows=13 loops=1)
Recheck Cond: (i < 100000)
-> Bitmap Index Scan on i1 (cost=0.00..4.37 rows=10 width=0) (actual time=0.013..0.013 rows=13 loops=1)
Index Cond: (i < 100000)
Total runtime: 0.154 ms
(5 rows)

(if you're reading and paying attention, you'll notice that it's using index that I didn't talk about creating earlier, it's simple: create index i1 on test (i);).

Bitmap Scans are always in (at least) two nodes. First (lower level) there is Bitmap Index Scan, and then there is Bitmap Heap Scan.

How does it work?

Let's assume your table has 100000 pages (that would be ~ 780MB). Bitmap Index Scan would create a bitmap where there would be one bit for every page in
your table. So in this case, we'd get memory block of 100,000 bits ~ 12.5kB. All these bits would be set to 0. Then Bitmap Index Scan, would set some bits to 1,
depending on which page in table might contain row that should be returned.

This part doesn't touch table data at all. Just index. After it will be done – that is all pages that might contain row that should be returned will be “marked", this
bitmap is passed to upper node – Bitmap Heap Scan, which reads them in more sequential fashion.

What is the point of such operation? Well, Index Scans (normal) cause random IO – that is, pages from disk are loaded in random fashion. Which, at least on
spinning disks, is slow.

Sequential scan is faster for getting single page, but on the other hand – you not always need all the pages.

Bitmap Index Scans joins the two cases when you need many rows from the table, but not all, and when the rows that you'll be returning are not in single block
(which would be the case if I did “… where id < ..."). Bitmap scans have also one more interesting feature. That is - they can join two operations, two indexes,
together. Like in here:

explain analyze select * from test where i < 5000000 or i > 950000000;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on test (cost=107.36..630.60 rows=5323 width=8) (actual time=1.023..4.353 rows=5386 loops=1)
Recheck Cond: ((i < 5000000) OR (i > 950000000))
-> BitmapOr (cost=107.36..107.36 rows=5349 width=0) (actual time=0.922..0.922 rows=0 loops=1)
-> Bitmap Index Scan on i1 (cost=0.00..12.25 rows=527 width=0) (actual time=0.120..0.120 rows=491 loops=1)
Index Cond: (i < 5000000)
-> Bitmap Index Scan on i1 (cost=0.00..92.46 rows=4822 width=0) (actual time=0.799..0.799 rows=4895 loops=1)
Index Cond: (i > 950000000)
Total runtime: 4.765 ms
(8 rows)

In here we see two Bitmap Index Scans (there can be more of them), which are then joined (not as SQL “JOIN"!) using BitmapOr.

As you remember – output of Bitmap Index Scan is a bitmap – that is memory block with some zeros and some ones. Having multiple such bitmaps means that
you can easily do logical operations on it: Or, And or Not.

In here we see that two such bitmaps were joined together using Or operator, and resulting bitmap was passed to Bitmap Heap Scan which loaded appropriate rows
from the table.

While in here both Index Scans use the same index, it's not always the case. For example, let's add quickly some more columns:

alter table test add column j int4 default random() * 1000000000;

ALTER TABLE
alter table test add column h int4 default random() * 1000000000;
ALTER TABLE
create index i2 on test (j);
CREATE INDEX
create index i3 on test (h);
CREATE INDEX

And now:
explain analyze select * from test where j < 50000000 and i < 50000000 and h > 950000000;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on test (cost=280.76..323.61 rows=12 width=16) (actual time=2.295..2.352 rows=11 loops=1)
Recheck Cond: ((h > 950000000) AND (j < 50000000) AND (i < 50000000))
-> BitmapAnd (cost=280.76..280.76 rows=12 width=0) (actual time=2.278..2.278 rows=0 loops=1)
-> Bitmap Index Scan on i3 (cost=0.00..92.53 rows=4832 width=0) (actual time=0.546..0.546 rows=4938 loops=1)
Index Cond: (h > 950000000)
-> Bitmap Index Scan on i2 (cost=0.00..93.76 rows=4996 width=0) (actual time=0.783..0.783 rows=5021 loops=1)
Index Cond: (j < 50000000)
-> Bitmap Index Scan on i1 (cost=0.00..93.96 rows=5022 width=0) (actual time=0.798..0.798 rows=4998 loops=1)
Index Cond: (i < 50000000)
Total runtime: 2.428 ms
(10 rows)

Three Bitmap Index Scans, each using different index, bitmaps joined using “and" bit operation, and result fed to Bitmap Heap Scan.

In case you wonder – why is the BitmapAnd showing “Actual rows = 0" – it's simple. This node doesn't deal with rows at all (just bitmap of disk pages). So it can't
return any rows.

Thats about it for now – these are your possible table scans – how you get data from disk. Next time I'll talk about joining multiple sources together, and other
types of plans.

Posted on 2013-04-27|Tags bitmap, explain, heap, index, only, postgresql, scan, seq, unexplainable|

16 thoughts on “Explaining the unexplainable – part 2”

1. Anonymous says:

2013-04-28 at 19:33

Thank you!

2. jcd says:

2013-04-29 at 09:06

I should have read these two posts 5 years ago and it would have made my life so much easier. Thank you and keep up the awesomeness.

3. far says:

2013-04-30 at 12:49

Very useful. Can’t help thinking something like these posts should be in the official docs somewhere. EXPLAIN is such a useful tool.

4. Carl Michael Skog says:

2013-05-02 at 13:38

It was a long time ago this blog became required reading for advanced PostgreSQL subjects.
And the good stuff just keep piling up with regularity !

5. Daniel Serodio says:

2013-05-08 at 18:38

Thanks for this helpful post. Understanding the output of EXPLAIN is hard, and you made it a bit easier.

6. kk says:

2013-05-09 at 11:06

Great post again.

Waiting for last one.

7. depesz says:

2013-05-09 at 11:31

@kk: i think there will be more than one. I still have some operations to cover, and statistics will definitely need a separate blogpost.
8. Drew Taylor says:

2013-05-12 at 08:52

I had always wondered what the BitMap* functions were. Thank you for the informative post!

9. manuscola says:

2013-07-10 at 16:07

really useful . I know better on bitmap heap scan

10. dezso says:

2015-10-15 at 12:34

FYI, I refer to this page in http://dba.stackexchange.com/a/118120/6219

11. Henry Dance says:

2015-12-23 at 12:09

Thank you so much for taking the time to explain such a complicated subject in such a simple way!

12. Srinivas says:

2016-06-30 at 10:54

Thank you so much for post. I have tested the Index only scan in my system. It is not working for me. I am using postgres 9.5 version

postgres=# explain analyze select id from test;

QUERY PLAN
————————————————————————————————————
Seq Scan on test (cost=0.00..1541.00 rows=100000 width=4) (actual time=0.033..26.492 rows=100000 loops=1)
Planning time: 0.223 ms
Execution time: 31.822 ms
(3 rows)

Time: 33.355 ms
postgres=# vacuum full analyze test;
VACUUM
Time: 506.563 ms
postgres=# explain analyze select id from test;
QUERY PLAN
————————————————————————————————————
Seq Scan on test (cost=0.00..1541.00 rows=100000 width=4) (actual time=0.018..25.843 rows=100000 loops=1)
Planning time: 0.993 ms
Execution time: 31.164 ms
(3 rows)

13. depesz says:

2016-06-30 at 12:13

@Srinivas:

index only scan will be used only in some cases, when exactly – it’s not entirely clear to me. But – getting 100k rows, is clearly not normal situation.

try select id from test order by id asc limit 10;

14. Arun Kumar says:

2017-01-31 at 13:35

HiDEPESZ,
Could u plz explain to me about the below Explain Analyze result.

Sort (cost=717.34..717.59 rows=101 width=488) (actual time=7.761..7.774 rows=100 loops=1)

Sort Key: t1.fivethous
Sort Method: quicksort Memory: 77kB
-> Hash Join (cost=230.47..713.98 rows=101 width=488) (actual time=0.711..7.427 rows=100 loops=1)
Hash Cond: (t2.unique2 = t1.unique2)
-> Seq Scan on tenk2 t2 (cost=0.00..445.00 rows=10000 width=244) (actual time=0.007..2.583 rows=10000 loops=1)
-> Hash (cost=229.20..229.20 rows=101 width=244) (actual time=0.659..0.659 rows=100 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 28kB
-> Bitmap Heap Scan on tenk1 t1 (cost=5.07..229.20 rows=101 width=244) (actual time=0.080..0.526 rows=100 loops=1)
Recheck Cond: (unique1 Bitmap Index Scan on tenk1_unique1 (cost=0.00..5.04 rows=101 width=0) (actual time=0.049..0.049 rows=100 loops=1)
Index Cond: (unique1 < 100)
Planning time: 0.194 ms
Execution time: 8.008 ms

15. Thanh Pham says:

2018-06-06 at 09:24

In Bitmap Index Scan, you explained that:

“Then Bitmap Index Scan, would set some bits to 1, depending on which page in table might contain row that should be returned.”
Could you tell me how does PostgreSQL know a page contains row or not?

16. depesz says:

2018-06-06 at 14:13

@Thanh:

it opens the page, and checks if the row is there, and is visible to current transaction.

Comments are closed.

Explaining the unexplainable – part 3

In previous post in the series I wrote about how to interpret single line in explain analyze output, it's structure, and later on described all basic data-getting
operations (nodes in explain tree).

Today, we'll move towards more complicated operations.

Function Scan
Example:

$ explain analyze select * from generate_Series(1,10) i;

QUERY PLAN
---------------------------------------------------------------------------------------------------------------------
Function Scan on generate_series i (cost=0.00..10.00 rows=1000 width=4) (actual time=0.012..0.013 rows=10 loops=1)
Total runtime: 0.034 ms
(2 rows)

Generally it's so simple, that I shouldn't need describing it, but since I will use it in next examples, I decided to write a thing about it.

Function Scan, is very simple node – it runs a function that returns recordset – that is, it will not run function like “lower()", but a function that returns (at least
potentially) multiple rows, or multiple columns. After the function will return rows, these are returned to whatever is above Function Scan in plan tree, or to client,
if Function Scan is the top node.

The only additional logic it might have, is ability to filter returned rows, like in here:

$ explain analyze select * from generate_Series(1,10) i where i < 3;

QUERY PLAN
-------------------------------------------------------------------------------------------------------------------
Function Scan on generate_series i (cost=0.00..12.50 rows=333 width=4) (actual time=0.012..0.014 rows=2 loops=1)
Filter: (i < 3)
Rows Removed by Filter: 8
Total runtime: 0.030 ms
(4 rows)

Sort
This seems to be easy to understand – sort gets given records and returns them sorted in some way.

Example:
$ explain analyze select * from pg_class order by relname;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------
Sort (cost=22.88..23.61 rows=292 width=203) (actual time=0.230..0.253 rows=295 loops=1)
Sort Key: relname
Sort Method: quicksort Memory: 103kB
-> Seq Scan on pg_class (cost=0.00..10.92 rows=292 width=203) (actual time=0.007..0.048 rows=295 loops=1)
Total runtime: 0.326 ms
(5 rows)

While it is simple, it has some cool logic inside. For starters – if memory used for sorting would be more than work_mem, it will switch to using disk based
sorting:

$ explain analyze select random() as x from generate_series(1,14000) i order by x;

QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------
Sort (cost=62.33..64.83 rows=1000 width=0) (actual time=16.713..18.090 rows=14000 loops=1)
Sort Key: (random())
Sort Method: quicksort Memory: 998kB
-> Function Scan on generate_series i (cost=0.00..12.50 rows=1000 width=0) (actual time=2.036..4.533 rows=14000 loops=1)
Total runtime: 18.942 ms
(5 rows)

$ explain analyze select random() as x from generate_series(1,15000) i order by x;

QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------
Sort (cost=62.33..64.83 rows=1000 width=0) (actual time=27.052..28.780 rows=15000 loops=1)
Sort Key: (random())
Sort Method: external merge Disk: 264kB
-> Function Scan on generate_series i (cost=0.00..12.50 rows=1000 width=0) (actual time=2.171..4.894 rows=15000 loops=1)
Total runtime: 29.767 ms
(5 rows)

Please note the Sort Method: change above.

To handle such cases, Pg will use temporary files stored in $PGDATA/base/pgsql_tmp/ directory. They will of course be removed as soon as they are not needed.

One additional feature is that Sort can change it's method of working if it's called by Limit operation, like here:

$ explain analyze select * from pg_class order by relfilenode limit 5;

QUERY PLAN
---------------------------------------------------------------------------------------------------------------------
Limit (cost=15.77..15.78 rows=5 width=203) (actual time=0.119..0.120 rows=5 loops=1)
-> Sort (cost=15.77..16.50 rows=292 width=203) (actual time=0.118..0.118 rows=5 loops=1)
Sort Key: relfilenode
Sort Method: top-N heapsort Memory: 26kB
-> Seq Scan on pg_class (cost=0.00..10.92 rows=292 width=203) (actual time=0.005..0.047 rows=295 loops=1)
Total runtime: 0.161 ms
(6 rows)

Normally, to sort given dataset, you need to process it in whole. But Pg knows, that if you need only some small number of rows, it doesn't have to sort whole
dataset, and it's good enough to get just the first values.

In Big O notation, general sort has complexity of O(m * log(m)), but Top-N has complexity of O(m * log(n)) – where m is number of rows in table, and n is
number of returned rows. What's most important – this kind of sort also uses much less memory (after all, it doesn't have to construct whole dataset of sorted rows,
just couple of rows), so it's less likely to use slow disk for temporary files.

Limit
I used limit many times, because it's so simple, but let's describe it fully. Limit operation runs it's sub-operation, and returns just first N rows from what it returned.
Usually it also stops sub-operation afterwards, but in some cases (pl/PgSQL functions for example), the sub-operation is already finished when it returned first
row.

Simple example:

$ explain analyze select * from pg_class;

QUERY PLAN
---------------------------------------------------------------------------------------------------------
Seq Scan on pg_class (cost=0.00..10.92 rows=292 width=203) (actual time=0.008..0.047 rows=295 loops=1)
Total runtime: 0.096 ms
(2 rows)

$ explain analyze select * from pg_class limit 2;

QUERY PLAN
-------------------------------------------------------------------------------------------------------------
Limit (cost=0.00..0.07 rows=2 width=203) (actual time=0.009..0.010 rows=2 loops=1)
-> Seq Scan on pg_class (cost=0.00..10.92 rows=292 width=203) (actual time=0.008..0.009 rows=2 loops=1)
Total runtime: 0.045 ms
(3 rows)

As you can see using limit in the 2nd case caused underlying Seq Scan to finish it's work immediately after finding two rows.

HashAggregate
This operation is used basically whenever you are using GROUP BY and some aggregates, like sum(), avg(), min(), max() or others.

Example:

$ explain analyze select relkind, count(*) from pg_Class group by relkind;

QUERY PLAN
-------------------------------------------------------------------------------------------------------------
HashAggregate (cost=12.38..12.42 rows=4 width=1) (actual time=0.223..0.224 rows=5 loops=1)
-> Seq Scan on pg_class (cost=0.00..10.92 rows=292 width=1) (actual time=0.008..0.053 rows=295 loops=1)
Total runtime: 0.273 ms
(3 rows)

HashAggregate does something like this: for every row it gets, it finds GROUP BY “key" (in this case relkind). Then, in hash (associative array, dictionary), puts
given row into bucket designated by given key.

After all rows have been processed, it scans the hash, and returns single row per each key value, when necessary – doing appropriate calculations (sum, min, avg,
and so on).

It is important to understand that HashAggregate has to scan all rows before it can return even single row.

Now, if you understand it, you should see potential problem: well, what about case when there are millions of rows? The hash will be too big to fit in memory.
And here, again, we'll be using work_mem. If generated hash is too big, it will “spill" to disk (again in the $PGDATA/base/pgsql_tmp).

This means that if we have plan that has both HashAggregate and Sort – we can use up to 2 * work_mem. And such plan is simple to get:

$ explain analyze select relkind, count(*) from pg_Class group by relkind order by relkind;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------
Sort (cost=12.46..12.47 rows=4 width=1) (actual time=0.260..0.261 rows=5 loops=1)
Sort Key: relkind
Sort Method: quicksort Memory: 25kB
-> HashAggregate (cost=12.38..12.42 rows=4 width=1) (actual time=0.221..0.222 rows=5 loops=1)
-> Seq Scan on pg_class (cost=0.00..10.92 rows=292 width=1) (actual time=0.006..0.044 rows=295 loops=1)
Total runtime: 0.312 ms
(6 rows)

In reality – single query can use many times work_mem, as work_mem is a limit per operation. So, if your query uses 1000 of HashAggregates and Sorts (and
other work_mem using operations) total memory usage can get pretty high.

Hash Join / Hash

Since we were discussing HashAggregate, it seemed natural to move to Hash Join.

This operation, unlike all other that we previously discussed, has two sub operations. One of them is always “Hash", and the other is something else.

Hash Join is used, as name suggests to join two recordsets. For example like here:

$ explain analyze select * from pg_class c join pg_namespace n on c.relnamespace = n.oid;

QUERY PLAN
----------------------------------------------------------------------------------------------------------------------
Hash Join (cost=1.14..16.07 rows=292 width=316) (actual time=0.036..0.343 rows=295 loops=1)
Hash Cond: (c.relnamespace = n.oid)
-> Seq Scan on pg_class c (cost=0.00..10.92 rows=292 width=203) (actual time=0.007..0.044 rows=295 loops=1)
-> Hash (cost=1.06..1.06 rows=6 width=117) (actual time=0.012..0.012 rows=6 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 1kB
-> Seq Scan on pg_namespace n (cost=0.00..1.06 rows=6 width=117) (actual time=0.004..0.005 rows=6 loops=1)
Total runtime: 0.462 ms
(7 rows)

It works like this – first Hash Join calls “Hash", which in turns calls something else (Seq Scan on pg_namespace in our case). Then, Hash makes a memory (or
disk, depending on size) hash/associative-array/dictionary with rows from the source, hashed using whatever is used to join the data (in our case, it's OID column
in pg_namespace).

Of course – you can have many rows for given join key (well, not in this case, as I'm joining using primary key, but generally, it's perfectly possible to have
multiple rows for single hash key.

So, using Perl notation, output of Hash is something like:

{
'123' => [ { data for row with OID = 123 }, ],
'256' => [ { data for row with OID = 256 }, ],
...
}

Then, Hash Join runs the second suboperation (Seq Scan on pg_class in our case), and for each row from it, it does:

1. check if join key (pg_class.relnamespace in our case) is in hash returned by Hash operation
2. if it is not – given row from suboperation is ignored (will not be returned)
3. if it exists – Hash Join fetches rows from hash, and based on row from one side, and all rows from hash, it generates output rows

It is important to note that both sides are run only once ( in our case, these both are seq scans), but first (the one called by Hash) has to return all rows, which have
to be stored in hash, and the other is processed one row at a time, and some rows will get skipped if they don't exist in hash from the other side (hope the sentence
is clear, there are many “hash"es there).

Of course, since both subscans can be any type of operation, these can do filter or index scan or whatever you can imagine.

Final note for Hash Join/Hash is that the Hash operation, just like Sort and HashAggregate – will use up to work_mem of memory.

Nested Loop
Since we're at joins – we have to discuss Nested Loop. Example:

$ explain analyze select a.* from pg_class c join pg_attribute a on c.oid = a.attrelid where c.relname in ( 'pg_class',
'pg_namespace' );
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
------------------------
Nested Loop (cost=0.28..52.32 rows=16 width=203) (actual time=0.057..0.134 rows=46 loops=1)
-> Seq Scan on pg_class c (cost=0.00..11.65 rows=2 width=4) (actual time=0.043..0.080 rows=2 loops=1)
Filter: (relname = ANY ('{pg_class,pg_namespace}'::name[]))
Rows Removed by Filter: 291
-> Index Scan using pg_attribute_relid_attnum_index on pg_attribute a (cost=0.28..20.25 rows=8 width=203) (actual
time=0.007..0.015 rows=23 loops=2)
Index Cond: (attrelid = c.oid)
Total runtime: 0.182 ms

This is very interesting plan as it can run given operations multiple times.

Just as Hash Join, Nested Loop has two “children". First it runs “Seq Scan" (in our example, generally, first it runs the first node that is there, and then, for every
row it returns (2 rows in our example), it runs 2nd operation (Index Scan on pg_attribute in our case).

You might notices that Index Scan has “loops=2" in it's actual run metainfo. This means that this operation has been run twice, and the other values (rows, time)
are averages across all runs.

Let's check this plain from explain.depesz.com. Note that the actual times for the categories index scan are 0.002 to 0.003 ms. But total time on this node is
78.852ms, because this index scan has been ran over 26k times.

So, the processing looks like this:

1. Nested Loop runs one side of join, once. Let's name it “A".
2. For every row in “A", it runs second operation (let's name it “B")
3. if “B" didn't return any rows – data from “A" is ignored
4. if “B" did return rows, for every row it returned, new row is returned by Nested Loop, based on current row from A, and current row from B

Merge Join
Another method of joining data is called Merge Join. This is used, if joined datasets are (or can be cheaply) sorted using join key.

I don't have nice example of this, so I will force it by using subselects that sort data before joining:

$ explain analyze select * from

( select oid, * from pg_class order by oid) as c
join
( select * from pg_attribute a order by attrelid) as a
on c.oid = a.attrelid;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
------------------------------------
Merge Join (cost=23.16..268.01 rows=2273 width=410) (actual time=0.658..3.779 rows=2274 loops=1)
Merge Cond: (pg_class.oid = a.attrelid)
-> Sort (cost=22.88..23.61 rows=292 width=207) (actual time=0.624..0.655 rows=293 loops=1)
Sort Key: pg_class.oid
Sort Method: quicksort Memory: 102kB
-> Seq Scan on pg_class (cost=0.00..10.92 rows=292 width=207) (actual time=0.011..0.211 rows=293 loops=1)
-> Materialize (cost=0.28..212.34 rows=2273 width=203) (actual time=0.028..1.264 rows=2274 loops=1)
-> Index Scan using pg_attribute_relid_attnum_index on pg_attribute a (cost=0.28..183.92 rows=2273 width=203) (actual
time=0.015..0.752 rows=2274 loops=1)
Total runtime: 4.009 ms
(9 rows)

Merge Join, as other joins, runs two sub operations (Sort and Materialize in this case). Because both of these return data sorted and the sort order is the same as
join operation, Pg can scan both returnsets from suboperations at the same time, and simply check whether ids match.

The procedure looks like this:

1. if join column on right side is the same as join column on left side:
o return new joined row, based on current rows on the right and left sides
o get next row from right side (or, if there are no more rows, on left side)
o go to step 1
2. if join column on right side is “smaller" than join column on left side:
o get next row from right side (if there are no more rows, finish processing)
o go to step 1
3. if join column on right side is “larger" than join column on left side:
o get next row from left side (if there are no more rows, finish processing)
o go to step 1

This is very cool way of joining datasets, but it works only for sorted sources. Based on current db of explain.depesz.com, there are:

 44,721 plans which contain “Nested Loop" operation

 34,305 plans with “Hash Join"
 only 8,889 that uses “Merge Join"

Hash Join / Nested Loop / Merge Join modifiers

In all examples above I showed that Join operation returns row only when it gets rows from both sides of join.
But this is not always the case. We can have LEFT/RIGHT/FULL outer joins. And there are so called anti-joins.

In case of left/right joins, the operation names get changed to:

 Hash Left Join

 Hash Right Join
 Merge Left Join
 Merge Right Join
 Nested Loop Left Join

There is no Nested Loop Right Join, because Nested Loop always starts with left side as basis to looping. So join that uses RIGHT JOIN, that would use Nested
Loop, will get internally transformed to LEFT JOIN so that Nested Loop can work.

In all those cases the logic is simple – we have two sides of join – left and right. And when side is mentioned in join, then join will return new row even if the
other side doesn't have matching rows.

This all happens with queries like:

select * from a left join b on ...

(or right join).

All other information for Hash Join/Merge Join or Nested Loop are the same, it's just a slight change in logic on when to generate output row.

There is also a version called Full Join, with operation names:

 Hash Full Join

 Merge Full Join

In which case join generates new output row regardless of whether data on either side is missing (as long as the data is there for one side). This happens in case of:

select * from a full join b ...

Of course all processing is the same as previously.

There are also so called Anti Joins. Their operation names look like:

 Hash Anti Join

 Merge Anti Join
 Nested Loop Anti Join

In these cases Join emits row only if the right side doesn't find any row. This is useful when you're doing things like “WHERE not exists ()" or “left join … where
right_table.column is null".

Like in here:

$ explain analyze select * from pg_class c where not exists (select * from pg_attribute a where a.attrelid = c.oid and a.attnum =
10);
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
----------------------------------
Hash Anti Join (cost=62.27..78.66 rows=250 width=203) (actual time=0.145..0.448 rows=251 loops=1)
Hash Cond: (c.oid = a.attrelid)
-> Seq Scan on pg_class c (cost=0.00..10.92 rows=292 width=207) (actual time=0.009..0.195 rows=293 loops=1)
-> Hash (cost=61.75..61.75 rows=42 width=4) (actual time=0.123..0.123 rows=42 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 2kB
-> Index Only Scan using pg_attribute_relid_attnum_index on pg_attribute a (cost=0.28..61.75 rows=42 width=4) (actual
time=0.021..0.109 rows=42 loops=1)
Index Cond: (attnum = 10)
Heap Fetches: 0
Total runtime: 0.521 ms
(9 rows)

In here, Pg ran the right side (Index Scan on pg_attribute), hashed it, and then ran left side (Seq Scan on pg_class), returning only rows where there was no item in
Hash for given pg_class.oid.

Materialize
This operation showed earlier in example for Merge Join, but it is also usable in another cases.

psql has many internal commands. One of them is \dTS – which lists all system datatypes. Internally \dTS runs this query:

SELECT n.nspname as "Schema",

pg_catalog.format_type(t.oid, NULL) AS "Name",
pg_catalog.obj_description(t.oid, 'pg_type') as "Description"
FROM pg_catalog.pg_type t
LEFT JOIN pg_catalog.pg_namespace n ON n.oid = t.typnamespace
WHERE (t.typrelid = 0 OR (SELECT c.relkind = 'c' FROM pg_catalog.pg_class c WHERE c.oid = t.typrelid))
AND NOT EXISTS(SELECT 1 FROM pg_catalog.pg_type el WHERE el.oid = t.typelem AND el.typarray = t.oid)
AND pg_catalog.pg_type_is_visible(t.oid)
ORDER BY 1, 2;

It's plan is:

QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
------------------------
Sort (cost=2783.00..2783.16 rows=65 width=68) (actual time=3.883..3.888 rows=87 loops=1)
Sort Key: n.nspname, (format_type(t.oid, NULL::integer))
Sort Method: quicksort Memory: 39kB
-> Nested Loop Left Join (cost=16.32..2781.04 rows=65 width=68) (actual time=0.601..3.657 rows=87 loops=1)
Join Filter: (n.oid = t.typnamespace)
Rows Removed by Join Filter: 435
-> Hash Anti Join (cost=16.32..2757.70 rows=65 width=8) (actual time=0.264..0.981 rows=87 loops=1)
Hash Cond: ((t.typelem = el.oid) AND (t.oid = el.typarray))
-> Seq Scan on pg_type t (cost=0.00..2740.26 rows=81 width=12) (actual time=0.012..0.662 rows=157 loops=1)
Filter: (pg_type_is_visible(oid) AND ((typrelid = 0::oid) OR (SubPlan 1)))
Rows Removed by Filter: 185
SubPlan 1
-> Index Scan using pg_class_oid_index on pg_class c (cost=0.15..8.17 rows=1 width=1) (actual
time=0.002..0.002 rows=1 loops=98)
Index Cond: (oid = t.typrelid)
-> Hash (cost=11.33..11.33 rows=333 width=8) (actual time=0.241..0.241 rows=342 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 14kB
-> Seq Scan on pg_type el (cost=0.00..11.33 rows=333 width=8) (actual time=0.002..0.130 rows=342 loops=1)
-> Materialize (cost=0.00..1.09 rows=6 width=68) (actual time=0.000..0.001 rows=6 loops=87)
-> Seq Scan on pg_namespace n (cost=0.00..1.06 rows=6 width=68) (actual time=0.002..0.003 rows=6 loops=1)
Total runtime: 3.959 ms

For easier viewing, I also uploaded this plan to explain.depesz.com.

Please note that operation #9 there is Materialize. Why is that so?

Materialize is called by Nested Loop Left Join – operation #2. We know that Nested Loop causes given operation to be ran multiple times, in this case – 87 times.

Right side of the join is Seq Scan on pg_namespace. So Pg, theoretically, should run Sequential Scan on pg_namespace 87 times. Given that single Seq Scan of
this table takes 0.003ms, we could expect total time of ~ 0.25ms.

But Pg is smarter than that. It realized that it will be cheaper to scan the table just once, and build memory representation of all the rows in there. So, that next
time, it will not have to scan the table, check visibility information, parse data pages. It will just get the data from memory.

Thanks to this total time of: reading the table once, preparing memory representation of the data and scanning this representation 87 times was 0.087ms.

You might then ask, OK, but why did the merge join earlier on use materialize – it was just doing one scan? Let's remind the plan:

$ explain analyze select * from

Yes. It was run just once. The problem is, though, that source of data for Merge Join has to match several criteria. Some are obvious (data has to be sorted) and
some are not so obvious as are more technical (data has to be scrollable back and forth).

Because of this (these not so obvious criteria) sometimes Pg will have to Materialize the data coming from source (Index Scan in our case) so that it will have all
the necessary features when using it.

Long story short – Materialize gets data from underlying operation and stores it in memory (or partially in memory) so that it can be used faster, or with additional
features that underlying operation doesn't provide.

And that's it for today. I thought that I will be done, but there are still many operations that need to be described. So, we will have at least two more posts in the
series (rest of the operations, and statistics info).

Posted on 2013-05-09|Tags aggregate, explain, function, hash, join, limit, loop, materialize, merge, nested, postgresql, scan, sort, unexplainable|

5 thoughts on “Explaining the unexplainable – part 3”

1. David Johnston says:

2013-05-09 at 22:58

With respect to these numbers:

44,721 plans which contain “Nested Loop” operation

34,305 plans with “Hash Join”
only 8,889 that uses “Merge Join”
There is major selection bias going on here since most plans that are submitted to depesz are likely poorly performing. Since “Merge Join” is a fairly fast
operation for joining it is likely that few of these would end up being explained.

2. Victor says:

2013-05-27 at 00:20

Very informative series, thanks a lot!

As you speak of Anti joins, perhaps you should also mention Semi joins, the ones produced by “WHERE EXISTS()” or “… IN()” constructs?

Like the one produced by:

EXPLAIN SELECT * FROM pg_class WHERE relnamespace IN (
SELECT oid FROM pg_namespace WHERE nspname=’public’);

3. depesz says:

2013-05-27 at 15:16

@Victor:

yes, I forgot about these. Interestingly – the query that you showed generates (in my pg 9.3) Hash join – https://explain.depesz.com/s/IqG .

But, to the point.

Semi Join is basically a reverse of Anti Join. When doing semi join of table “a” and table “b” using some comparison, emitted are only rows from “a” that
there is row in b that matches the condition.

So, Semi Join, is similar to:

select a.* from a join b on (a.x=b.x);

but, in case b.x had duplicates, join above would duplicate rows from “a”. But semi join would not, as it only checks if a row is there on “b-side”, and if
yes, it emits row from a.

There can be all three variants of semi joins (Nested Loop Semi Join, Hash Semi Join and Merge Semi Join).

4. Victor says:

2013-05-28 at 13:13

Interesting, 9.2 gives: https://explain.depesz.com/s/mLx

Is this smth new in 9.3 maybe?

5. depesz says:

2013-05-28 at 14:49

@Victor: it could be related to statistics. This is what the next (and final) part of the series will be about, but I have yet to write it.

Comments are closed.

Explaining the unexplainable – part 4

In this, hopefully 2nd to last, post in the series, I will cover the rest of usually happening operations that you can see in your explain outputs.

Unique
Name seems to be clear about what's going on here – it removes duplicate data.

This might happen, for example, when you're doing:

select distinct field from table

But in newer Pgs this query will usually be done using HashAggregate.

The problem with Unique is that is requires data to be sorted. Not because it needs data in any particular order – but it needs it so that all rows with the same value
will be “together".

This makes it really cool (when possible to use) because it doesn't use virtually any memory. It just checks if value in previous row is the same as in current, and if
yes – discards it. That's all.
So, we can force usage of it, by pre-sorting data:

$ explain select distinct relkind from (select relkind from pg_class order by relkind) as x;
QUERY PLAN
-----------------------------------------------------------------------
Unique (cost=22.88..27.26 rows=4 width=1)
-> Sort (cost=22.88..23.61 rows=292 width=1)
Sort Key: pg_class.relkind
-> Seq Scan on pg_class (cost=0.00..10.92 rows=292 width=1)
(4 rows)

Append
This plan simply runs multiple sub-operations, and returns all the rows that were returned as one resultset.

This is used by UNION/UNION ALL queries:

$ explain select oid from pg_class union all select oid from pg_proc union all select oid from pg_database;
QUERY PLAN
-----------------------------------------------------------------
Append (cost=0.00..104.43 rows=2943 width=4)
-> Seq Scan on pg_class (cost=0.00..10.92 rows=292 width=4)
-> Seq Scan on pg_proc (cost=0.00..92.49 rows=2649 width=4)
-> Seq Scan on pg_database (cost=0.00..1.02 rows=2 width=4)
(4 rows)

In here you can see append running three scans on three tables and returning all the rows together.

Please note that I used UNION ALL. If I'd used UNION, we would get:

$ explain select oid from pg_class union select oid from pg_proc union select oid from pg_database;
QUERY PLAN
-----------------------------------------------------------------------
HashAggregate (cost=141.22..170.65 rows=2943 width=4)
-> Append (cost=0.00..133.86 rows=2943 width=4)
-> Seq Scan on pg_class (cost=0.00..10.92 rows=292 width=4)
-> Seq Scan on pg_proc (cost=0.00..92.49 rows=2649 width=4)
-> Seq Scan on pg_database (cost=0.00..1.02 rows=2 width=4)
(5 rows)

This is because UNION removes duplicate rows – which is, in this case, done using HashAggregate operation.

Result
This happens mostly in very simple test queries. This operation is used when your query selects some constant value (or values):

$ explain select 1, 2;
QUERY PLAN
------------------------------------------
Result (cost=0.00..0.01 rows=1 width=0)
(1 row)

Aside from test queries it can be sometimes seen in queries that do “insert, but don't if it would be duplicate" kind of thing:

$ explain insert into t (i) select 1 where not exists (select * from t where i = 1);
QUERY PLAN
---------------------------------------------------------------------
Insert on t (cost=3.33..3.35 rows=1 width=4)
-> Result (cost=3.33..3.34 rows=1 width=0)
One-Time Filter: (NOT $0)
InitPlan 1 (returns $0)
-> Seq Scan on t t_1 (cost=0.00..40.00 rows=12 width=0)
Filter: (i = 1)
(6 rows)

Values Scan
Just like Result above, Values Scan is for returning simple, entered in query, data, but this time – it can be whole recordset, based on VALUES() functionality.

In case you don't know, you can select multiple rows with multiple columns, without any table, just by using VALUES syntax, like here:

$ select * from ( values (1, 'hubert'), (2, 'depesz'), (3, 'lubaczewski') ) as t (a,b);
a | b
---+-------------
1 | hubert
2 | depesz
3 | lubaczewski
(3 rows)

Such query plan looks like:

QUERY PLAN
--------------------------------------------------------------
Values Scan on "*VALUES*" (cost=0.00..0.04 rows=3 width=36)
(1 row)

It is also most commonly used in INSERTs, but it has other uses too, like custom sorting.
GroupAggregate
This is similar to previously described HashAggregate.

The difference is that for GroupAggregate to work data has to be sorted using whatever column(s) you used for your GROUP BY clause.

Just like Unique – GroupAggregate uses very little memory, but forces ordering of data.

Example:

$ explain select relkind, count(*) from (select relkind from pg_class order by relkind) x group by relkind;
QUERY PLAN
-----------------------------------------------------------------------
GroupAggregate (cost=22.88..28.03 rows=4 width=1)
-> Sort (cost=22.88..23.61 rows=292 width=1)
Sort Key: pg_class.relkind
-> Seq Scan on pg_class (cost=0.00..10.92 rows=292 width=1)
(4 rows)

HashSetOp
This operation is used by INTERSECT/EXCEPT operations (with optional “ALL" modifier).

It works by running sub-operation of Append for a pair of sub-queries, and then, based on result and optional ALL modifier, it figures which rows should be
returned. I haven't digged in the source code so I can't tell you exactly how it works, but given the name and the operation, it looks like a simple counter-based
solution.

In here we can see that unlike UNION, these operations work on two sources of data:

$ explain select * from (select oid from pg_Class order by oid) x intersect all select * from (select oid from pg_proc order by
oid) y;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------
HashSetOp Intersect All (cost=0.15..170.72 rows=292 width=4)
-> Append (cost=0.15..163.36 rows=2941 width=4)
-> Subquery Scan on "*SELECT* 1" (cost=0.15..18.37 rows=292 width=4)
-> Index Only Scan using pg_class_oid_index on pg_class (cost=0.15..12.53 rows=292 width=4)
-> Subquery Scan on "*SELECT* 2" (cost=0.28..145.00 rows=2649 width=4)
-> Index Only Scan using pg_proc_oid_index on pg_proc (cost=0.28..92.02 rows=2649 width=4)
(6 rows)

but with three sources, we get more complex tree:

$ explain select * from (select oid from pg_Class order by oid) x intersect all select * from (select oid from pg_proc order by
oid) y intersect all select * from (Select oid from pg_database order by oid) as w;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------
HashSetOp Intersect All (cost=1.03..172.53 rows=2 width=4)
-> Append (cost=1.03..171.79 rows=294 width=4)
-> Subquery Scan on "*SELECT* 3" (cost=1.03..1.07 rows=2 width=4)
-> Sort (cost=1.03..1.03 rows=2 width=4)
Sort Key: pg_database.oid
-> Seq Scan on pg_database (cost=0.00..1.02 rows=2 width=4)
-> Result (cost=0.15..170.72 rows=292 width=4)
-> HashSetOp Intersect All (cost=0.15..170.72 rows=292 width=4)
-> Append (cost=0.15..163.36 rows=2941 width=4)
-> Subquery Scan on "*SELECT* 1" (cost=0.15..18.37 rows=292 width=4)
-> Index Only Scan using pg_class_oid_index on pg_class (cost=0.15..12.53 rows=292 width=4)
-> Subquery Scan on "*SELECT* 2" (cost=0.28..145.00 rows=2649 width=4)
-> Index Only Scan using pg_proc_oid_index on pg_proc (cost=0.28..92.02 rows=2649 width=4)
(13 rows)

CTE Scan
This is similar to previously mentioned Materialized operation. It runs a part of a query, and stores output so that it can be used by other part (or parts) of the
query.

Example:

1. $ explain analyze with x as (select relname, relkind from pg_class) select relkind, count(*), (select count(*) from x) from
x group by relkind;
2. QUERY PLAN
3. -----------------------------------------------------------------------------------------------------------------
4. HashAggregate (cost=24.80..26.80 rows=200 width=1) (actual time=0.466..0.468 rows=6 loops=1)
5. CTE x
6. -> Seq Scan on pg_class (cost=0.00..10.92 rows=292 width=65) (actual time=0.009..0.127 rows=295 loops=1)
7. InitPlan 2 (returns $1)
8. -> Aggregate (cost=6.57..6.58 rows=1 width=0) (actual time=0.085..0.085 rows=1 loops=1)
9. -> CTE Scan on x x_1 (cost=0.00..5.84 rows=292 width=0) (actual time=0.000..0.055 rows=295 loops=1)
10. -> CTE Scan on x (cost=0.00..5.84 rows=292 width=1) (actual time=0.012..0.277 rows=295 loops=1)
11. Total runtime: 0.524 ms
12. (8 rows)

Please note that pg_class is scanned only once – line #6. But it's results are stored in “x", and then scanned twice – inside aggregate (line #9) and HashAggregate
(10).
How is it different from Materialize? To answer fully, one would need to jump into sources, but I would say that the difference stems from simple fact that CTE's
are user defined. While Materialize is helper operation that Pg chooses to use when (it thinks) it makes sense.

The very important thing is that CTEs are ran just as specified. So they can be used to circumvent some not-so-good optimizations that planner normally can do.

InitPlan
This plan happens whenever there is a part of your query that can (or have to) be calculated before anything else, and it doesn't depend on anything in the rest of
your query.

For example, let's assume you'd want such query:

$ explain select * from pg_class where relkind = (select relkind from pg_class order by random() limit 1);
QUERY PLAN
------------------------------------------------------------------------------------------
Seq Scan on pg_class (cost=13.11..24.76 rows=73 width=203)
Filter: (relkind = $0)
InitPlan 1 (returns $0)
-> Limit (cost=13.11..13.11 rows=1 width=1)
-> Sort (cost=13.11..13.84 rows=292 width=1)
Sort Key: (random())
-> Seq Scan on pg_class pg_class_1 (cost=0.00..11.65 rows=292 width=1)
(7 rows)

In this case – getting the limit/sort/seq-scan is needed to run before normal seq scan on pg_class – because Pg will have to compare relkind value with the value
returned by subquery.

On the other hand, if I'd write:

$ explain select *, (select length('depesz')) from pg_class;

QUERY PLAN
-------------------------------------------------------------
Seq Scan on pg_class (cost=0.01..10.93 rows=292 width=203)
InitPlan 1 (returns $0)
-> Result (cost=0.00..0.01 rows=1 width=0)
(3 rows)

Pg correctly sees that the subselect column does not depend on any data from pg_class table, so it can be run just once, and doesn't have to redo the length-
calculation for every row.

Of course you can have many init plans, like in here:

$ explain select *, (select length('depesz')) from pg_class where relkind = (select relkind from pg_class order by random() limit
1);
QUERY PLAN
------------------------------------------------------------------------------------------
Seq Scan on pg_class (cost=13.12..24.77 rows=73 width=203)
Filter: (relkind = $1)
InitPlan 1 (returns $0)
-> Result (cost=0.00..0.01 rows=1 width=0)
InitPlan 2 (returns $1)
-> Limit (cost=13.11..13.11 rows=1 width=1)
-> Sort (cost=13.11..13.84 rows=292 width=1)
Sort Key: (random())
-> Seq Scan on pg_class pg_class_1 (cost=0.00..11.65 rows=292 width=1)
(9 rows)

There is one important thing, though – numbering of init plans within single query is “global", and not “per operation".

SubPlan
SubPlans are a bit similar to NestedLoop. In this way that these can be called many times.

SubPlan is called to calculate data from a subquery, that actually does depends on current row.

For example:

$ explain analyze select c.relname, c.relkind, (Select count(*) from pg_Class x where c.relkind = x.relkind) from pg_Class c;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------
Seq Scan on pg_class c (cost=0.00..3468.93 rows=292 width=65) (actual time=0.135..26.717 rows=295 loops=1)
SubPlan 1
-> Aggregate (cost=11.83..11.84 rows=1 width=0) (actual time=0.090..0.090 rows=1 loops=295)
-> Seq Scan on pg_class x (cost=0.00..11.65 rows=73 width=0) (actual time=0.010..0.081 rows=93 loops=295)
Filter: (c.relkind = relkind)
Rows Removed by Filter: 202
Total runtime: 26.783 ms
(7 rows)

For every row that is returned by scan on “pg_class as c", Pg has to run SubPlan, which checks how many rows in pg_class have the same (as currently processed
row) value in relkind column.

Please note “loops=295" in the “Seq Scan on pg_class x" line, and matching “rows=295" in the earlier “Seq Scan on pg_class c" node.

Other ?
Yes. There are other operations. Some of them are too rare to care about (especially since you do have the great source of knowledge: sources), some are (I
suspect) older versions of newer nodes.

If you have a plan with operation I did not cover, and you don't understand it – please let me know in comments – link to explain output on explain.depesz.com,
operation name, and Pg version that you have it in. Will comment on such cases with whatever information I might find.

Posted on 2013-05-19|Tags append, cte, explain, groupaggregate, hashsetop, initplan, postgresql, result, setop, subplan, unexplainable, unique, values|

4 thoughts on “Explaining the unexplainable – part 4”

1. kim says:

2015-06-16 at 11:04

This makes it really cool (when possible to use) because it doesn’t use virtually any memory.

What if work_mem is enough to make quicksort run in memory?

2. depesz says:

2015-06-16 at 12:59

@kim:
sorry, but I don’t understand what you’re asking about. The blogpost is rather long, and described many different execution nodes, but none of these are
sorts, so I just don’t get the context you’re asking in.

3. Yang Liu says:

2016-06-16 at 00:18

I’m not sure if this is the right place to ask. It’s related to the source code.

I’m studying queries that involve initplan and subplan. Nodes such as hash joins, seq scan are executed by the function ExecProcNode(). But I can’t find
which function executes a subplan or initplan .

4. depesz says:

2016-06-16 at 11:31

@Yang Liu:

sorry, can’t answer. but for source-code level questions, I can suggest that you ask on pgsql-hackers mailing list.

Comments are closed.

Explaining the unexplainable – part 5

In previous posts in this series, I talked about how to read EXPLAIN output, and what each line (operation/node) means.

Now, in the final post, I will try to explain how it happens that Pg chooses “Operation X" over “Operation Y".

You could have heard that PostgreSQL's planner is choosing operations based on statistics. What statistics?

Let's imagine a simplest possible case:

SELECT * FROM table WHERE column = some_value;

If all rows in have the same some_value – then using (potentially existing) index on column doesn't make sense.

On the other hand – if column is unique (or almost unique) – usage of index is really good idea.

Let's see what's happening:

create table test ( all_the_same int4, almost_unique int4 );

CREATE TABLE

insert into test ( all_the_same, almost_unique )

select 123, random() * 1000000 from generate_series(1,100000);
INSERT 0 100000

So, now I have a 100,000 row table, where “all_the_same" column has always the same value (123), and almost_unique column is, well, almost unique:

select count(*), count(distinct almost_unique) from test;

count | count
--------+-------
100000 | 95142
(1 row)

Now, to make it all equal, I will create two simple indexes:

create index i1 on test (all_the_same);

CREATE INDEX

create index i2 on test (almost_unique);

CREATE INDEX

OK. Test setup is ready. And how are the plans:

explain select * from test where all_the_same = 123;

QUERY PLAN
------------------------------------------------------------
Seq Scan on test (cost=0.00..1693.00 rows=100000 width=8)
Filter: (all_the_same = 123)
(2 rows)

explain select * from test where almost_unique = 123;

QUERY PLAN
---------------------------------------------------------------
Index Scan using i2 on test (cost=0.29..8.31 rows=1 width=8)
Index Cond: (almost_unique = 123)
(2 rows)

As you can see Pg chose wisely. But the interesting thing is the “rows=" estimate. How does it know how many rows the query might return?

The answer lies in the ANALYZE command, or VACUUM ANALYZE.

When doing “ANALYZE" of the table, pg gets some “random sample" (more on it in a moment), and gets some stats. What are the stats, where are they, and can
we see them? Sure we can:

select * from pg_statistic where starelid = 'test'::regclass;

-[ RECORD 1 ]---------------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------------------------------------
---------------
starelid | 16882
staattnum | 1
stainherit | f
stanullfrac | 0
stawidth | 4
stadistinct | 1
stakind1 | 1
stakind2 | 3
stakind3 | 0
stakind4 | 0
stakind5 | 0
staop1 | 96
staop2 | 97
staop3 | 0
staop4 | 0
staop5 | 0
stanumbers1 | {1}
stanumbers2 | {1}
stanumbers3 | [null]
stanumbers4 | [null]
stanumbers5 | [null]
stavalues1 | {123}
stavalues2 | [null]
stavalues3 | [null]
stavalues4 | [null]
stavalues5 | [null]
-[ RECORD 2 ]---------------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------------------------------------
---------------
starelid | 16882
staattnum | 2
stainherit | f
stanullfrac | 0
stawidth | 4
stadistinct | -0.92146
stakind1 | 1
stakind2 | 2
stakind3 | 3
stakind4 | 0
stakind5 | 0
staop1 | 96
staop2 | 97
staop3 | 97
staop4 | 0
staop5 | 0
stanumbers1 | {0.0001,0.0001,0.0001,0.0001,0.0001,0.0001,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-
05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-
05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-
05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-
05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-
05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-
05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-
05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-
05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-
05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05}
stanumbers2 | [null]
stanumbers3 | {-0.000468686}
stanumbers4 | [null]
stanumbers5 | [null]
stavalues1 |
{21606,27889,120502,289914,417495,951355,283,1812,3774,6028,6229,10372,12234,13291,18309,18443,21758,22565,26634,28392,28413,31208
,32890,36563,39277,40574,44527,49954,53344,53863,56492,56715,60856,62993,64294,65275,65355,68353,71194,74718,77205,82096,82783,847
64,85301,87498,90990,94043,97304,98779,101181,103700,103889,106288,108562,110796,113154,117850,121578,122643,123874,126299,129236,
129332,129512,134430,134980,136987,137368,138175,139001,141519,142934,143432,143707,144501,148633,152481,154327,157067,157799,1624
37,164072,164337,165942,167611,170319,171047,177383,184134,188702,189005,191786,192718,196330,197851,199457,202652,202689,205983}
stavalues2 |
{2,10560,20266,31061,40804,50080,59234,69240,79094,89371,99470,109557,119578,130454,140809,152052,162656,173855,183914,194263,2045
93,214876,224596,233758,243246,253552,264145,273855,283780,294475,303972,314544,324929,335008,346169,356505,367395,376639,387302,3
97004,407093,416615,426646,436146,445701,455588,466463,475910,485228,495434,505425,515853,525374,534824,545387,554794,563591,57372
1,584021,593368,602935,613238,623317,633947,643431,653397,664177,673976,684042,694791,703922,714113,724602,735848,745596,754477,76
4171,772535,781924,791652,801703,812487,822196,831618,841665,850722,861532,872067,881570,891654,901595,910975,921698,931785,940716
,950623,960551,970261,979855,989540,999993}
stavalues3 | [null]
stavalues4 | [null]
stavalues5 | [null]

This table (pg_statistic) is, of course, described the docs. But it is pretty cryptic. Of course, you can find very precise explanation in sources, but that's not (usually)
best solution.

Luckily, there is a view over this table, that contains the same data, in more readable way:

select * from pg_stats where tablename = 'test';

-[ RECORD 1 ]----------+----------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------------------------------------
--------------------------
schemaname | public
tablename | test
attname | all_the_same
inherited | f
null_frac | 0
avg_width | 4
n_distinct | 1
most_common_vals | {123}
most_common_freqs | {1}
histogram_bounds | [null]
correlation | 1
most_common_elems | [null]
most_common_elem_freqs | [null]
elem_count_histogram | [null]
-[ RECORD 2 ]----------+----------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------------------------------------
--------------------------
schemaname | public
tablename | test
attname | almost_unique
inherited | f
null_frac | 0
avg_width | 4
n_distinct | -0.92146
most_common_vals |
{21606,27889,120502,289914,417495,951355,283,1812,3774,6028,6229,10372,12234,13291,18309,18443,21758,22565,26634,28392,28413,31208
,32890,36563,39277,40574,44527,49954,53344,53863,56492,56715,60856,62993,64294,65275,65355,68353,71194,74718,77205,82096,82783,847
64,85301,87498,90990,94043,97304,98779,101181,103700,103889,106288,108562,110796,113154,117850,121578,122643,123874,126299,129236,
129332,129512,134430,134980,136987,137368,138175,139001,141519,142934,143432,143707,144501,148633,152481,154327,157067,157799,1624
37,164072,164337,165942,167611,170319,171047,177383,184134,188702,189005,191786,192718,196330,197851,199457,202652,202689,205983}
most_common_freqs | {0.0001,0.0001,0.0001,0.0001,0.0001,0.0001,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-
05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-
05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-
05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-
05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-
05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-
05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-
05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-
05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-
05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05}
histogram_bounds |
{2,10560,20266,31061,40804,50080,59234,69240,79094,89371,99470,109557,119578,130454,140809,152052,162656,173855,183914,194263,2045
93,214876,224596,233758,243246,253552,264145,273855,283780,294475,303972,314544,324929,335008,346169,356505,367395,376639,387302,3
97004,407093,416615,426646,436146,445701,455588,466463,475910,485228,495434,505425,515853,525374,534824,545387,554794,563591,57372
1,584021,593368,602935,613238,623317,633947,643431,653397,664177,673976,684042,694791,703922,714113,724602,735848,745596,754477,76
4171,772535,781924,791652,801703,812487,822196,831618,841665,850722,861532,872067,881570,891654,901595,910975,921698,931785,940716
,950623,960551,970261,979855,989540,999993}
correlation | -0.000468686
most_common_elems | [null]
most_common_elem_freqs | [null]
elem_count_histogram | [null]

OK. So what can we learn from this?

schemaname, tablename and attname columns seem to be obvious. Inherited simply says whether values for this table contain values from any tables that did
inherit this column.

So, if I'd have:

create table z () inherits (test);

and then I'd put some data in table z, then statistics for table test would have “inherited = true".

The rest of columns means:

 null_frac – how many rows contains null value in given column. This is a fraction, so it's value from 0 to 1
 avg_width – average width of data in this column. In case of constant-width types (like int4 here) it is not really interesting, but in case of any datatype with
variable width (like text/varchar/numeric) – it is potentially interesting.
 n_distinct – very intersting value. If it is positive ( 1+ ) – it will be be simply estimated number (not fraction!) of distinct values – this we can see in case of
all_the_same column, there n_distinct is correctly 1. If it is negative, though it's meaning is different. It means then which fraction of rows has unique
value. So, in case of almost_unique, stats suggest that 92.146% of rows have unique value (which is a bit short of 95.142% which I showed earlier). The
values can be incorrect due to the “random sample" thing I mentioned earlier, and will explain more in a bit.
 most_common_vals – array of most common values in this table
 most_common_freqs – how frequent are the values from most_common_vals – again, it's fraction, so it can be at most 1 (though then we'd have only one
value in most_common_vals). In her, in almost_unique, we see that Pg “thinks" that values 21606, 27889, 120502, 289914, 417495, 951355 are the ones
that happen most often – which are not, but this is, again, caused by “random sample" effect
 histogram_bounds – array of values which divide (or should divide – again “random sample" thing) whole recordset into groups of the same number of
rows. That is – number of rows with almost_unique between 2 and 10560 is the same (more or less) as number of rows with almost_unique between
931785 and 940716
 correlation – this is interesting statistic – it shows whether there is correlation between physical row ordering on disk, and values. This can go from -1 to 1,
and generally the closer it is to -1/1 – the more correlation there is. For example – after doing “CLUSTER test using i2" – that is reordering table in
almost_unique order, I got correlation of -0.919358 – much better than shown above -0.000468686.
 most_common_elems, most_common_elem_freqs and elem_count_histogram are like most_common_vals, most_common_freqs and histogram_bounds
but for non-scalar datatypes (think: arrays, tsvectors and alike).

Based on this data, PostgreSQL can estimate how many rows will be returned by any given part of query, and based on this information it can decide whether it's
better to use seq scan or index scan or bitmap index scan. And when joining – which one should be faster – Hash Join, Merge Join or perhaps Nested Loop.

If you looked at the data above you could have asked yourself – it's pretty wide output – there are many values in the
most_common_vals/most_common_freqs/histogram_bounds arrays. Why are there so many?

Reason is simple – it's configuration. In postgresql.conf you can find default_statistics_target variable. This variable tells Pg how many values to keep in these
arrays. In my case (default) it's 100. But you can easily change it. Either by changing postgresql.conf, or even on a per-column basis, with:

alter table test alter column almost_unique set statistics 5;

After such ALTER (and ANALYZE), data in pg_stats is much shorter:

select * from pg_stats where tablename = 'test' and not inherited and attname = 'almost_unique';
-[ RECORD 1 ]----------+---------------------------------------------------------
schemaname | public
tablename | test
attname | almost_unique
inherited | f
null_frac | 0
avg_width | 4
n_distinct | -0.92112
most_common_vals | {114832,3185,3774,6642,11984}
most_common_freqs | {0.0001,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05}
histogram_bounds | {2,199470,401018,596414,798994,999964}
correlation | 1
most_common_elems | [null]
most_common_elem_freqs | [null]
elem_count_histogram | [null]

Changing statistic target has also one more effect.

Let me show you. First I'll revert the change on statistic count I did with ALTER TABLE:

alter table test alter column almost_unique set statistics -1;

And now:

$ analyze verbose test;

INFO: analyzing "public.test"
INFO: "test": scanned 443 of 443 pages, containing 100000 live rows and 0 dead rows; 30000 rows in sample, 100000 estimated total
rows
ANALYZE

$ alter table test alter column almost_unique set statistics 10;

ALTER TABLE
$ alter table test alter column all_the_same set statistics 10;
ALTER TABLE

$ analyze verbose test;

INFO: analyzing "public.test"
INFO: "test": scanned 443 of 443 pages, containing 100000 live rows and 0 dead rows; 3000 rows in sample, 100000 estimated total
rows
ANALYZE

Please note that the second analyze tested only 3000 rows – not 30000 as the first one.

This is the “random sample".

Analyzing all rows in a table would be prohibitively expensive in any medium or large table.

So, Pg does something rather smart.

First – it reads random part of pages in a table (reminder: each page is 8kB of data). How many – 300 * statistics_target.

Which means, in my case, with default_statistics_target = 100, it would read 30000 pages. (my table doesn't have that many, so instead it did read all of them).

From these pages, ANALYZE gets just information about live and dead rows. Afterwards, it gets data on random sample of rows – again 300 * statistics target,
and calculated the column statistics based on these data.

In my case – table had 100,000 rows, but with default_statistics_target = 100, only a third was analyzed. With statistics target the count of rows that is analyzed is
even lower – just 3000.

You could have said: OK, but then these statistics are not accurate. It could be that some super-common value just doesn't happen to be in the scanned rows. Sure.
You're right. It's possible. Though not really likely. I mean – you are getting random part of data. The chances that you'll get the x% of the table that just doesn't
happen to have any row with some value that exists in all other rows are small.

This also means that sometimes, it can happen, that running analyze will “break" your queries. For example – you'll get statistics on another pages, and it will
happen that some values will get skipped (or, on the contrary – you will get in most_common_vals things that aren't really all that common, just Pg happened to
pick right pages/rows to see it). And based on such stats, Pg will generate suboptimal plans.

If such case would hit you, the solution is rather simple – bump statistics target. This will make analyze work harder, and scan more rows, so the chances of this
happening again will get even smaller.

There is a drawback in setting large targets, though. ANALYZE has to work more, of course, but this is maintenance thing, so we don't really care about it
(usually). The problem is that having more data in pg_statistic means that more data has to be taken into consideration by Pg planner. So, while it might look
tempting to set default_statistics_target to it's max of 10,000, in reality I haven't seen database which would have it set that high.

Current default of 100 is there since 8.4. Before that it was set at 10, and it was pretty common to see on irc suggestions to increase it. Now, with default 100,
you're more or less set.

One final thing I have to talk about, though I really don't want, are settings that make Pg planner use different operations.

First – why I don't want to talk about it – I know for a fact that this can be easily abused. So please remember – these settings are for debugging problems, not for
solving problems. Application that would use them in normal operation mode is at least suspected if not outright broken. And yes, I know that sometimes you have
to. But the sometimes is very rare.

Now, with this behind me, let's see the options.

In postgresql.conf you have several settings like these:

enable_bitmapscan = on
enable_hashagg = on
enable_hashjoin = on
enable_indexscan = on
enable_indexonlyscan = on
enable_material = on
enable_mergejoin = on
enable_nestloop = on
enable_seqscan = on
enable_sort = on
enable_tidscan = on

These settings, are for disabling given operations.

For example – setting enable_seqscan to false (which can be done with SET command in SQL session, you don't have to modify postgresql.conf) will cause
planner to use anything else it might, just to avoid seq scan.

Since sometimes it's not possible to avoid seq scan (there are no indexes on the table, for example) – these settings don't actually disable operations, just associate
huge cost with using them.

For example. With our test table, we know what searching with “all_the_same = 123" will use seq scan, as it's cheap:

explain select * from test where all_the_same = 123;

QUERY PLAN
------------------------------------------------------------
Seq Scan on test (cost=0.00..1693.00 rows=100000 width=8)
Filter: (all_the_same = 123)
(2 rows)

But if I'd disable seq scans:

set enable_seqscan = false;
SET
explain select * from test where all_the_same = 123;
QUERY PLAN
-----------------------------------------------------------------------
Index Scan using i1 on test (cost=0.29..3300.29 rows=100000 width=8)
Index Cond: (all_the_same = 123)
(2 rows)

We see that estimated cost of getting the same data with index scan is ~ two times higher (3300.29 vs 1693).

If I'd drop i1 index:

drop index i1;

DROP INDEX
set enable_seqscan = false;
SET
explain select * from test where all_the_same = 123;
QUERY PLAN
-----------------------------------------------------------------------------
Seq Scan on test (cost=10000000000.00..10000001693.00 rows=100000 width=8)
Filter: (all_the_same = 123)
(2 rows)

And now we see that when there is no other option – just a seq scan (it's interesting it didn't choose to do index scan on i2, after all, it has pointers to all rows in the
table), the cost skyrocketed to 10,000,000,000 – which is exactly what enable_* = false does.

I think that's about it. If you read the whole series you should have enough knowledge to understand what's going on, and, more importantly, why.

Posted on 2013-05-30|Tags analyze, cost, explain, pg_statistic, pg_stats, postgresql, random, sample, statistics, unexplainable, vacuum|

9 thoughts on “Explaining the unexplainable – part 5”

1. Nate Teller says:

2013-05-30 at 16:15

great articles … enjoy following your knowledge.

how do you address ‘many_of_the_same’, for the same size table … when there only 10 or so different values.

we do max(many_of_the_same) selects and sometimes max(many_of_the_same) < arbitrary_value.

thank you.

2. depesz says:

2013-05-30 at 16:16

@Nate:
Sorry, not sure I understand your question. Can you show me (via a pastesite perhaps to avoid formatting issues) the problem/question?

3. Nate Teller says:

2013-05-30 at 16:48

create table test ( many_the_same int4, almost_unique int4 );

CREATE TABLE
insert into test ( many_the_same, almost_unique )
select random() * 10,
random() * 1000000
from generate_series(1,10000000);
INSERT 0 10000000
create index test_i1 on test (many_the_same);
CREATE INDEX
create index test_i2 on test (many_the_same, almost_unique);
CREATE INDEX
vacuum analyze verbose test;
psql:max_test.sql:13: INFO: vacuuming “ants.test”
psql:max_test.sql:13: INFO: index “test_i1” now contains 10000000 row versions in 27422 pages
DETAIL: 0 index row versions were removed.
0 index pages have been deleted, 0 are currently reusable.
CPU 0.10s/0.03u sec elapsed 0.14 sec.
psql:max_test.sql:13: INFO: index “test_i2” now contains 10000000 row versions in 27422 pages
DETAIL: 0 index row versions were removed.
0 index pages have been deleted, 0 are currently reusable.
CPU 0.10s/0.03u sec elapsed 0.14 sec.
psql:max_test.sql:13: INFO: “test”: found 0 removable, 10000000 nonremovable row versions in 44248 out of 44248 pages
DETAIL: 0 dead row versions cannot be removed yet.
There were 0 unused item pointers.
0 pages are entirely empty.
CPU 0.25s/0.64u sec elapsed 0.91 sec.
psql:max_test.sql:13: INFO: analyzing “ants.test”
psql:max_test.sql:13: INFO: “test”: scanned 44248 of 44248 pages, containing 10000000 live rows and 0 dead rows; 300000 rows in sample, 10000000
estimated total rows
VACUUM
explain select max(many_the_same) from test;
QUERY PLAN
——————————————————————————————————-
Result (cost=0.04..0.05 rows=1 width=0)
InitPlan 1 (returns $0)
-> Limit (cost=0.00..0.04 rows=1 width=4)
-> Index Scan Backward using test_i1 on test (cost=0.00..435477.47 rows=10000000 width=4)
Index Cond: (many_the_same IS NOT NULL)
(5 rows)

explain select max(many_the_same) from test where many_the_same Limit (cost=0.00..0.07 rows=1 width=4)
-> Index Scan Backward using test_i1 on test (cost=0.00..303585.45 rows=4496733 width=4)
Index Cond: ((many_the_same IS NOT NULL) AND (many_the_same Limit (cost=0.00..0.04 rows=1 width=4)
-> Index Scan Backward using test_i1 on test (cost=0.00..435477.47 rows=10000000 width=4)
Index Cond: (many_the_same IS NOT NULL)
(5 rows)

My apologies, this portion of a more complex query is performing well.

4. depesz says:

2013-05-30 at 16:50

well, no need to apologize, but the queries (i think) got mangled:

“explain select max(many_the_same) from test where many_the_same Limit (” looks like either bad copy/paste or something got broken somewhere else.

You should be able to use code html tags: < CODE >put your code in here < / CODE > to get better formatting. It should look like this:

whatever

5. boris says:

2013-07-02 at 13:35

In the 1st blogpost you wrote about settings:

seq_page_cost = 1.0 # measured on an arbitrary scale
random_page_cost = 4.0 # same scale as above
cpu_tuple_cost = 0.01 # same scale as above
cpu_index_tuple_cost = 0.005 # same scale as above
cpu_operator_cost = 0.0025 # same scale as above

Here you introduced new portion

How do these settings treat to each other ? Seems that they have rather similar meaning, or not ?

6. depesz says:

2013-07-02 at 15:21

They have very different meaning. Not sure how you got into conclusion that they are similar. *_cost values are some values attached to specific low-level
operations. enable_* are basically a tools to skyrocket cost of some high-level operations, so that pg will not use it.

7. Éric says:
2014-10-16 at 23:45

Hi. Just want to thank and congratulate you for this series of articles. They are just invaluable!

8. depesz says:

2014-10-17 at 13:28

Thanks Eric, glad you liked it.

9. Santosh says:

2017-06-15 at 07:06

Just wanted to say Thank you! for writing such a detailed article.
I went through complete series and I feel now I’m better equipped to write optimal queries.

Comments are closed.

Explaining the unexplainable – part 6: buffers

Back in 2013 I wrote a series of 5 posts about how to read explain analyze output.

Figured that there is one big part missing – buffers info.

You don't see this part in normal explain analyze, you have to specifically enable it: explain (analyze on, buffers on).

Well, technically you don't need analyze part, but then you'll be shown explain like:

$ explain (buffers on) select count(*) from z;

QUERY PLAN
──────────────────────────────────────────────────────────────────────────────────────
Finalize Aggregate (cost=40496.38..40496.39 rows=1 width=8)
-> Gather (cost=40496.17..40496.38 rows=2 width=8)
Workers Planned: 2
-> Partial Aggregate (cost=39496.17..39496.18 rows=1 width=8)
-> Parallel Seq Scan on z (cost=0.00..35225.33 rows=1708333 width=0)
Planning:
Buffers: shared hit=3
(7 rows)

Which shows buffers info only for planning. Which is interesting information on its own (this means that to plan this query planner had to get 3 pages from
storage, but luckily all 3 were in cache (shared_buffers). Since (by default) each page is 8kB, we know that PostgreSQL would read 24kB of data just to plan this
query.

But if I'd run the same query with analyze more info will be shown:

$ explain (analyze on, buffers on) select count(*) from z;

QUERY PLAN
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Aggregate (cost=69392.00..69392.01 rows=1 width=8) (actual time=368.522..368.523 rows=1 loops=1)
Buffers: shared hit=480 read=17662
I/O Timings: read=24.848
-> Seq Scan on z (cost=0.00..59142.00 rows=4100000 width=0) (actual time=0.095..213.068 rows=4100000 loops=1)
Buffers: shared hit=480 read=17662
I/O Timings: read=24.848
Planning:
Buffers: shared hit=19
Planning Time: 0.448 ms
Execution Time: 368.742 ms
(10 rows)

Please note that we now see buffers info for both Seq Scan, and for top-level node: Aggregate.

There is also I/O Timings: information, which is there thanks to track_io_timing config parameter being set to on.

So, let's get down to details. First – is shared hit/read all that can be there, and what exactly is the meaning of the values?

Luckily, I have BIG database of plans, so I can easily find lots of information. For example, in this plan we can see:

Buffers: shared hit=13,538,373 read=85,237 dirtied=204 written=6,263, local hit=10,047,224 read=773,475 dirtied=271,617
written=271,617, temp read=620,674 written=604,150

This is max number of different fields that can be in buffers line:

 shared:
o hit
o read
o dirtied
o written
 local:
o hit
o read
o dirtied
o written
 temp:
o read
o written

All values are in blocks, where block is very usually 8192 bytes (you can check your block size by doing: show block_size;).

So, for example, the function scan did read:

 85,237 pages in shared

 773,475 pages in local
 629,674 pages in temp

This is equal to (assuming 8kB page size):

$ select pg_size_pretty( ( 85237 + 773475 + 629674 )::int8 * 8192 );

pg_size_pretty
────────────────
11 GB
(1 row)

So, in total it read ~ 11GB of data off disk.

But what is exact meaning of hit, read, dirtied, written, shared, local, and temp? and why temp has only read and written?

PostgreSQL has some part of memory set aside as cache. It's size is set in shared_buffers guc.

Whenever Pg has to read some data from disk it can check if it's in shared_buffers. If it is there – page can be returned from cache. If it's not there – it's being read
from disk (and stored in shared_buffers for reuse). This leads to first split in our parameters: hit number tells how many pages were already in cache. And read
tells us how many pages were not in cache and had to be read from disk.

Next thing we can see is written. This is rather simple – this many pages have been written to disk. It is possible that select will generate writes. For example, let's
assume we have shared_buffers with only 10 buffers size. And two of these pages are already used to cache some disk pages, but, they were modified in memory,
but not yet on disk – for example because writing transaction hasn't committed yet. This means that these pages, in memory, are dirty and we'd want to use the
place in memory that they are in, we have first to write them.

This is what written generally is – how much data was written to disk, because we needed to free some space in shared_buffers for other data.

And finally – dirtied. Sometimes, it just so happens, that there are new pages in a table/index, for example, due to index or update operations, and if we select at
just the right moment, PostgreSQL will do something called “updating hint bits". Finding what hint bits actually are proved to be rather tricky. As far as I was able
to figure it out this is just information whether transactions that inserted or deleted given row are already fully in the past. This information is generally updated on
vacuum, but sometimes it can happen on select time. If it will happen so (that is: select will update hint bits), page will be marked as dirty – to be written to disk
whenever it will be possible.

So, we know what is hit/read/written/dirtied. What are share/local/temp? That's WAY simpler. Pages that belong to normal objects, shared between many database
connections (tables, indexes, materialized views) are counted as shared. Pages that belong to temporary objects – that is objects that belong only to current session
– are counted as local. And temp is for temporary data access needed for sorting, hashing, or materializing.

With this information we can now read data from previously mentioned explain node with a bit more understanding. Refresh of what we see in explain:

-> Function Scan on ...

...
Buffers: shared hit=13538373 read=85237 dirtied=204 written=6263, local hit=10047224 read=773475 dirtied=271617
written=271617, temp read=620674 written=604150

This means that while processing this function, PostgreSQL did:

 Wanted to read 13,623,610 pages from normal db objects, found 13,538,373 of them in cache, and had to read 85,237 from disk.
 Dirtied (could have written new rows, update existing ones, delete or even just set hint bits, we don't know as it's function, so we don't know what it did)
204 pages in normal, shared, db objects.
 Wrote, to make space in memory, 6,263 buffers that belonged to normal, shared, db objects.
 Wanted to read 10,820,699 pages from temporary db objects, found 10,047,224 of them in cache, and had to read 773,475 from disk.
 Dirtied 271,617 pages in temporary db objects.
 Wrote, to make space in memory, 271,617 buffers that belonged to temporary db objects.
 Read 620,674 blocks from temporary files, for operations that use temp files when work_mem is not enough.
 Wrote 603,150 blocks to temporary files.

In total, this single function scan generated:

 85,237 + 773,475 + 620,674 = 1,479,386 blocks = 11.28GB reads

 6,263 + 271,617 + 604,150 = 882,020 blocks = 6.73GB writes

Next bit of information. Let's consider this plan:

$ explain (analyze on, buffers on) select count(distinct id) from x;

QUERY PLAN
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Aggregate (cost=73413.00..73413.01 rows=1 width=8) (actual time=829.034..829.034 rows=1 loops=1)
Buffers: shared hit=2144 read=20019, temp read=8586 written=8609
I/O Timings: read=27.700
-> Seq Scan on x (cost=0.00..63163.00 rows=4100000 width=4) (actual time=0.075..217.427 rows=4100000 loops=1)
Buffers: shared hit=2144 read=20019
I/O Timings: read=27.700
Planning:
Buffers: shared hit=6 dirtied=1
Planning Time: 0.172 ms
Execution Time: 829.078 ms
(10 rows)

We can see that Seq Scan node had only shared hit and read, but Aggregate node had more. The thing is that buffers info is summarized. Any node reports total of
what it used itself, plus all that its subnodes used.

In our case that means that Aggregate node used only what is reported in temp part, because shared hit/read numbers are exactly the same as for it's subnode: Seq
Scan.

This leaves two last bits of information: what about loops=, and what about parallel queries.

First, normal loops:

$ explain (analyze on, buffers on) select i, (select count(*) from z where id + 100 > i) from generate_series(1,3) i;
QUERY PLAN
──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Function Scan on generate_series i (cost=0.00..6079.06 rows=3 width=12) (actual time=41.353..68.639 rows=3 loops=1)
Buffers: shared hit=1329
SubPlan 1
-> Aggregate (cost=2026.33..2026.34 rows=1 width=8) (actual time=22.868..22.869 rows=1 loops=3)
Buffers: shared hit=1329
-> Seq Scan on z (cost=0.00..1943.00 rows=33333 width=0) (actual time=0.007..15.753 rows=100000 loops=3)
Filter: ((id + 100) > i.i)
Buffers: shared hit=1329
Planning:
Buffers: shared hit=2
Planning Time: 0.189 ms
Execution Time: 68.695 ms
(12 rows)

With this query I forced Pg to run 3 separate Seq Scans of table Z. At the time of test z had 443 pages. Number shown, 1,329 shared/hit pages, is exactly 3 * 443.
Which shows that the number is total across all loops.

And parallel queries?

Made larger version of the table (44,248 pages), and ran test:

$ explain (analyze on, buffers on) select count(*) from z;

QUERY PLAN
──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
──────
Finalize Aggregate (cost=97331.80..97331.81 rows=1 width=8) (actual time=269.856..271.856 rows=1 loops=1)
Buffers: shared hit=16111 read=28137
I/O Timings: read=36.838
-> Gather (cost=97331.58..97331.79 rows=2 width=8) (actual time=269.773..271.850 rows=3 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=16111 read=28137
I/O Timings: read=36.838
-> Partial Aggregate (cost=96331.58..96331.59 rows=1 width=8) (actual time=268.094..268.095 rows=1 loops=3)
Buffers: shared hit=16111 read=28137
I/O Timings: read=36.838
-> Parallel Seq Scan on z (cost=0.00..85914.87 rows=4166687 width=0) (actual time=0.015..153.453 rows=3333333
loops=3)
Buffers: shared hit=16111 read=28137
I/O Timings: read=36.838
Planning:
Buffers: shared hit=5 read=3
I/O Timings: read=0.011
Planning Time: 0.069 ms
Execution Time: 271.923 ms
(19 rows)

Now, in here we see that there were 3 concurrent workers (2 workers launched + main process). But buffers reported for Parallel Seq Scan are: hit=16,111 and
read=28,137, total: 44,284 – exactly full size of the table.

This blogpost is a foundation I needed to write before I will add buffers info parsing to Pg::Explain, and display to explain.depesz.com. So now, you know that
soon(ish) – there will be buffers info somewhere there.

I would like to thank to all people that asked me about it over the years – it took some time, but I'm finally getting to do it