Explaining The Unexplainable
Explaining The Unexplainable
1
15 thoughts on “Explaining the unexplainable” .......................................................................................................................................................................................................... 5
Explaining the unexplainable – part 2 ......................................................................................................................................................................................................................... 7
16 thoughts on “Explaining the unexplainable – part 2”........................................................................................................................................................................................... 10
Explaining the unexplainable – part 3 ....................................................................................................................................................................................................................... 12
Function Scan ............................................................................................................................................................................................................................................................. 12
Sort............................................................................................................................................................................................................................................................................. 12
Limit ........................................................................................................................................................................................................................................................................... 13
HashAggregate ........................................................................................................................................................................................................................................................... 13
Hash Join / Hash......................................................................................................................................................................................................................................................... 14
Nested Loop ............................................................................................................................................................................................................................................................... 14
Merge Join ................................................................................................................................................................................................................................................................. 15
Hash Join / Nested Loop / Merge Join modifiers ....................................................................................................................................................................................................... 15
Materialize ................................................................................................................................................................................................................................................................. 16
5 thoughts on “Explaining the unexplainable – part 3” ............................................................................................................................................................................................. 17
Explaining the unexplainable – part 4 ....................................................................................................................................................................................................................... 18
Unique........................................................................................................................................................................................................................................................................ 18
Append....................................................................................................................................................................................................................................................................... 19
Result ......................................................................................................................................................................................................................................................................... 19
Values Scan ................................................................................................................................................................................................................................................................ 19
GroupAggregate......................................................................................................................................................................................................................................................... 20
HashSetOp ................................................................................................................................................................................................................................................................. 20
CTE Scan ..................................................................................................................................................................................................................................................................... 20
InitPlan ....................................................................................................................................................................................................................................................................... 21
SubPlan ...................................................................................................................................................................................................................................................................... 21
Other ? ....................................................................................................................................................................................................................................................................... 21
4 thoughts on “Explaining the unexplainable – part 4” ............................................................................................................................................................................................. 22
Explaining the unexplainable – part 5 ....................................................................................................................................................................................................................... 22
9 thoughts on “Explaining the unexplainable – part 5” ............................................................................................................................................................................................. 27
Explaining the unexplainable – part 6: buffers......................................................................................................................................................................................................... 29
One thought on “Explaining the unexplainable – part 6: buffers” ............................................................................................................................................................................ 31
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------
Sort (cost=146.63..148.65 rows=808 width=138) (actual time=55.009..55.012 rows=71 loops=1)
Sort Key: n.nspname, p.proname, (pg_get_function_arguments(p.oid))
Sort Method: quicksort Memory: 43kB
-> Hash Join (cost=1.14..107.61 rows=808 width=138) (actual time=42.495..54.854 rows=71 loops=1)
Hash Cond: (p.pronamespace = n.oid)
-> Seq Scan on pg_proc p (cost=0.00..89.30 rows=808 width=78) (actual time=0.052..53.465 rows=2402 loops=1)
Filter: pg_function_is_visible(oid)
-> Hash (cost=1.09..1.09 rows=4 width=68) (actual time=0.011..0.011 rows=4 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 1kB
-> Seq Scan on pg_namespace n (cost=0.00..1.09 rows=4 width=68) (actual time=0.005..0.007 rows=4 loops=1)
Filter: ((nspname <> 'pg_catalog'::name) AND (nspname <> 'information_schema'::name))
Of course trying to understand the explain above, as a start point, is pretty futile. Let's start with something simpler. But even before that I want you to understand
one very important thing:
PostgreSQL knows
That means that PostgreSQL keeps some meta-information (information about information). Rowcounts, number of distinct values, most common values, and so
on. For large tables these are based on random sample, but in general, it (Pg) is pretty good at knowing stuff about our data.
In explain – first line, and all lines that start with “->" are operations. Other lines as additional info for operation above.
In our case, we have just one operation: sequential scanning of table test.
Sequential scan means that PostgreSQL will “open" the table data, and read it all, potentially filtering (removing) rows, but generally ready to read and return
whole table.
So, the Seq scan line, informs us that we are scanning table in sequential mode. And that the table is named “test" (though in here lies one of the biggest problems
with explain, as it's not showing schema, which did bite my ass more than couple of times).
Table "public.t"
Column | Type | Modifiers
-------------+---------+------------------------------------------------
id | integer | not null default nextval('t_id_seq'::regclass)
some_column | integer |
something | text |
Indexes:
"t_pkey" PRIMARY KEY, btree (id)
"q" btree (some_column)
What do you think would be the best way to run the query? Sequentially scan the table, or use index?
If you say: use index of course, there is index on this column, so it will make it faster – I'll ask: what about case where the table has one row, and that it has
some_column = 123?
To do seq scan, I just need to read one page (8192 bytes) from table, and I get the row. To use index, I have to read page from index, check it to find if we have
row matching condition in the table and then read page from table.
You could say – sure, but that's for very small tables, so the speed doesn't matter. OK. So let's imagine a table that has 10 billion rows, and all of them have
some_column = 123. Index doesn't help at all, and in reality it makes the situation much, much worse.
Of course – if you have million rows, and just one has some_column = 123 – index scan will be clearly better.
So – it is impossible to tell whether given query will use index, or even if it should use index (I'm talking about general case) – you need to know more. And this
leads us to simple thing: depending on situation one way of getting data will be better or worse then other.
PostgreSQL (up to a point) examines all possible plans. It knows how many rows you have, it knows how many rows will (likely) match the criteria, so it can
make pretty smart decisions.
But how are the decisions made? That's exactly what the first set of numbers in explain shows. It's the cost.
Some people think that cost is estimate shown in seconds. It's not. It's unit is “fetching single page in sequential manner". It is about time and resource usage.
So, we can even change how much it costs to read sequential page. These parameters dictate costs that PostgreSQL assumes it would take to run various methods
of running the same query.
For example, let's make a simple 1000 row table, with some texts, and index:
Now, we can see that running explain with condition on id will show:
By default, Pg used Index Scan. Why? It's simple – it's the cheapest in this case. Total cost of 8.29, while bitmap heap scan (whatever that would be) would be
8.30 and seq scan – 18.5.
OK, But cost shows two numbers: number..number. What is this about, and why am I talking about the second number only? If we'd get into consideration first
number then seq scan is winner, as it has there 0 (zero), while index scan is 0.28, and bitmap heap scan – 4.28.
So, the range (number .. number) is because it shows cost for starting the operation row and cost for getting all rows (By all, I mean all returned by this operation,
not all in table).
What is the starting cost? Well, for seq scan there is none – you just read page, and return rows. That's all. But for example, for sorting dataset – you have to read
all the data, and actually sort it, before you can consider returning even first of the rows. Which can be nicely seen in this explain:
QUERY PLAN
-------------------------------------------------------------------
Sort (cost=22.88..23.61 rows=292 width=202)
Sort Key: relfilenode
-> Seq Scan on pg_class (cost=0.00..10.92 rows=292 width=202)
(3 rows)
Please note that startup cost for Sort is 22.88, while total cost is just 23.61. So returning rows from Sort is trivial (in terms of cost), but sorting them – not so much.
Next information in explain is “rows". This is estimate, how many rows does PostgreSQL think that this operation is capable of returning (it might return less, for
example in case of LIMIT). This is also important for some operations – joins for example. As joining two tables that have 20 rows together can be done in many
ways, and it doesn't really matter how, but when you join 1 million row table with 1 billion row table, the way you do the join is very important (I'm not talking
about “inner join/left join/…" but rather about “hash join", “nested loop", “merge join" – if you don't understand, don't worry – will write about it later).
This number can be of course misestimated – for many reasons. Sometimes it doesn't matter, and sometimes it does. But we'll talk about misestimates later.
Final bit of information is width. This is PostgreSQL idea on how many bytes, on average, there are in single row returned from given operation. For example:
As you can see limiting number of fields modified width, and in turn, total amount of data that will need to be passed through execution of the query.
Next, it the single most important bit of information. Explains are trees. Upper node needs data from nodes below.
There are 5 operations there: sort, hash join, seq scan, hash, and seq scan. PostgreSQL executes the top one – sort. Which in turn executes the ones directly below
(Hash join) and gets data from them. Hash join, to return data to sort, has to run seq scan (on pg_proc) and hash (#4). And the final hash, to be able to return data,
has to run seq scan on pg_namespace.
It is critical to understand that some operations can return data immediately, and, what's even more important, gradually. For example – Seq Scan. And some
others cannot. For example – in here we see that Hash (#4) has the same cost for startup as it's “suboperation" seq scan – for “all rows". This means, that for
hashing operation to start (well, to be able to return even single row), it has to read in all the rows from suboperation(s).
The part about returning rows gradually becomes very important when you'll start writing functions. Let's consider such functions:
If you don't understand it, don't worry. The functions returns 3 rows, each contains single integer – 1, 2 and 3. The important bit, though, is that it sleeps for 1
second after returning each row.
Let's see:
\timing
Timing is on.
Time: 3005.334 ms
The same 3 seconds. Why? Because PL/pgSQL (and most, if not all, other PL/* languages) cannot return partial results. It looks like it can – with “return next",
but all these are stored in a buffer and returned together when function execution ends.
On the other hand – “normal" operations, usually, can return partial data. It can be seen with something trivial like seq scan, on non-trivial table:
create table t as
select i as id,
repeat('depesz', 100)::text as payload
from generate_series(1,1000000) i;
As it can be seen – seq scan ended very fast – as soon as it satisfied Limit's appetite for exactly 1 row.
Please also note that in here, even the costs (which are not the best thing for comparing queries) show that top node (seq scan in first, and limit in second query)
have very different values for returning all rows – 185834.82 vs. 0.02.
So, the first 4 numbers for any given operation (two for cost plus rows and width) are all estimates. They might be correct, but but they as well might not.
The other 4 numbers, which you get when you run “EXPLAIN ANALYZE query" or “EXPLAIN ( ANALYZE on ) query" show the reality.
Time is again a range, but this time is real. It is how much time PostgreSQL actually did spend working on given operation (on average, because it could have run
the same operation multiple times). And just as with cost – time is a range. Startup time, and time to return all data. Let's check this plan:
As you can see – Limit has startup time of 0.008 (millisecond, that's the unit in here). This is because Seq Scan (which Limit called to get data) took 0.007ms to
return first row, and then there was 0.001ms of processing within limit itself.
Afterwards (after returning 1st row), limit kept getting data from Seq Scan, until it got 100 rows. Then it terminated Seq scan (which happened 0.133ms after start
of query), and it finished after another 0.019 ms.
Actual rows value, just as name suggests, shows how many rows (on average) this operation returned. And loops show how many times this operation was in total
ran.
In what case would an operation be called more than once? For example in some cases of joins, or subqueries. It looks like this plan.
Please note that loops in 3rd operation is 2. This means that this Seq Scan was ran twice, returning, on average 1 row, and it took, on average, 0.160ms to finish.
So the total time spent in this particular operation is 2 * 0.160ms = 0.32ms (that's what is in exclusive/inclusive columns on explain.depesz.com).
Very often poor performance of a query comes from the fact that it had to loop many times over something. As in here.
(of course it doesn't mean that it's Pg fault that it chose such plan – maybe there simply weren't other options, or other options were estimated as even more
expensive).
In above example, please note that while actual time for operation 3 is just 0.003ms, this operation was run over 26000 times, resulting in total time spent in here
of almost 79ms.
I think that wraps theoretical information required to read explains. You will probably still don't understand what the operations or other information mean, but at
the very least – you will know what the numbers mean, and what's the difference between explain (shows costs in abstract units, which are based on random-
sample estimates) and explain analyze (shows real life times, rowcounts and execution counts, in units that can be compared with different queries).
As always, I'm afraid that I skipped a lot of things that might be important but just escaped me, or (even worse) I assumed that these are “obvious". If you'll find
anything missing, please let me know, I'll fill in ASAP.
But, before, let me just say, that I plan to extend this blogpost in 2-3 another posts that would cover more about:
what are various operations, how they work, and what to expect when you see them in explain output
what are statistics, how Pg gets them, how to view them, and how to get the best out of it
2013-04-17 at 18:17
2. depesz says:
2013-04-17 at 18:27
3. lesovsky says:
2013-04-18 at 05:21
4. kk says:
2013-05-09 at 10:49
Great.
Waiting for 2 next blogposts.
5. Tobu says:
2013-05-22 at 21:43
I sort of expected an unit / decimal point mixup with 0.008 milliseconds and the timings after that, but the post is correct, that’s 8 microseconds. I do wish
PostgreSQL would print the units (or at least default to seconds).
6. depesz says:
2013-05-22 at 21:45
@Tobu: once you get used to it, it just makes sense. Adding “ms” to every value would take too much space. And defaulting to second would make the
times even worse. I see lots of queries that run in less than 1ms. and this would look absolutely awful: 0.000008
7. varg says:
2013-05-23 at 11:15
8. boris says:
2013-07-02 at 13:32
>> So, we can even change how much it costs to read sequential page.
Are there any reasonable recommendations whether these settings worth changing ?
9. depesz says:
2013-07-02 at 15:20
I wouldn’t change seq scan cost, as this is basically an unit. If you have your database on SSD, it might be good to lower random_page_cost significantly.
As for the rest – test it. Play with it on your dataset and see what comes out of it.
2013-07-03 at 08:39
In above example, please note that while actual time for operation 3 is just 0.003ms, this operation was run over 26000 times, resulting in total time spent
in here of almost 79ms.
2013-07-03 at 10:42
2013-09-03 at 10:40
Fully qualified names in explain output can be reached by verbose option of explain (EXPLAIN VERBOSE). I had the same problem. Thanks to advice
from Pavel Stehule my problem has been resolved.
2015-12-03 at 17:30
Could you please also in such detailed manner explain JOIN estimations?
F.e. I could not understand ( http://stackoverflow.com/questions/33654105/incorrect-rows-estimate-for-joins ) how Postgress use statistic to estimate
amount of rows in join operations. And how it may be fixed without external extensions.
2016-04-20 at 21:49
Hi, thanks for a nice write up.
I’m puzzled by
> And what if we’d tell pg that under no circumstances it can use index scan?
How did we actually tell that? It is unclear from the query snippet.
2016-04-21 at 14:06
@Adam:
there is set of enable_* parameters which can be used to disable given type of scan/functionality.
This is the simplest possible operation – PostgreSQL opens table file, and reads rows, one by one, returning them to user or to upper node in explain tree, for
example to limit, as in:
It is important to understand that the order of returned rows is not any specific. It's not “in order of insertion", or “last updated first" or anything like this.
Concurrent selects, updates, deletes, vacuums can modify the order of rows at any time.
Seq Scan can filter rows – that is reject some from being returned. This happens for example when you'll add “WHERE" clause:
As you can see now we have Filter: information. And because I'm on 9.2 or newer I got also “Rows removed by filter" line.
This type of scan seems to be very straight forward, and most people understand when it is used at least in one case:
Index Scan is also used when you want some data ordered using order from index. As in here:
There is no condition here, but we can add condition easily, like this:
explain analyze select * from pg_class where oid > 1247 order by oid limit 10;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
------
Limit (cost=0.15..4.03 rows=10 width=206) (actual time=0.021..0.035 rows=10 loops=1)
-> Index Scan using pg_class_oid_index on pg_class (cost=0.15..37.84 rows=97 width=206) (actual time=0.017..0.031 rows=10
loops=1)
Index Cond: (oid > 1247::oid)
Total runtime: 0.132 ms
(4 rows)
In these cases, Pg finds starting point in index (either first row that is > 1247, or simply smallest value in index, and then returns next rows/values until Limit will
be satisfied.
There is a version of Index Scan, called “Index Scan Backward" – which does the same thing, but is used for scanning in descending order:
explain analyze select * from pg_class where oid < 1247 order by oid desc limit 10;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
---------------
Limit (cost=0.15..4.03 rows=10 width=206) (actual time=0.012..0.026 rows=10 loops=1)
-> Index Scan Backward using pg_class_oid_index on pg_class (cost=0.15..37.84 rows=97 width=206) (actual time=0.009..0.022
rows=10 loops=1)
Index Cond: (oid < 1247::oid)
Total runtime: 0.119 ms
(4 rows)
This is the same kind of operation – open index, and for every row pointed to by index, fetch row from table, just it happens not “from small to big" but “from big
to small".
\d test
Table "public.test"
Column | Type | Modifiers
--------+---------+---------------------------------------------------
id | integer | not null default nextval('test_id_seq'::regclass)
i | integer |
Indexes:
"test_pkey" PRIMARY KEY, btree (id)
So, if some conditions are met (more on it in a bit), I can get plan like this:
This means that Pg realized that I select only data (columns) that are in the index. And it is possible that it doesn't need to check anything in the table files. So that
it will return the data straight from index.
These scans were the big change in PostgreSQL 9.2, as they have the ability to work way faster than normal Index Scans, because they don't have to verify
anything in table data.
The problem is that, in order to make it work, Index has to contain information that given rows are in pages, that didn't have any changes “recently". This means
that in order to utilize Index Only Scans, you have to have your table well vacuumed. But with autovacuum running, it shouldn't be that big of a deal.
Final kind of table scans is so called Bitmap Index Scan. It looks like this:
(if you're reading and paying attention, you'll notice that it's using index that I didn't talk about creating earlier, it's simple: create index i1 on test (i);).
Bitmap Scans are always in (at least) two nodes. First (lower level) there is Bitmap Index Scan, and then there is Bitmap Heap Scan.
Let's assume your table has 100000 pages (that would be ~ 780MB). Bitmap Index Scan would create a bitmap where there would be one bit for every page in
your table. So in this case, we'd get memory block of 100,000 bits ~ 12.5kB. All these bits would be set to 0. Then Bitmap Index Scan, would set some bits to 1,
depending on which page in table might contain row that should be returned.
This part doesn't touch table data at all. Just index. After it will be done – that is all pages that might contain row that should be returned will be “marked", this
bitmap is passed to upper node – Bitmap Heap Scan, which reads them in more sequential fashion.
What is the point of such operation? Well, Index Scans (normal) cause random IO – that is, pages from disk are loaded in random fashion. Which, at least on
spinning disks, is slow.
Sequential scan is faster for getting single page, but on the other hand – you not always need all the pages.
Bitmap Index Scans joins the two cases when you need many rows from the table, but not all, and when the rows that you'll be returning are not in single block
(which would be the case if I did “… where id < ..."). Bitmap scans have also one more interesting feature. That is - they can join two operations, two indexes,
together. Like in here:
explain analyze select * from test where i < 5000000 or i > 950000000;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on test (cost=107.36..630.60 rows=5323 width=8) (actual time=1.023..4.353 rows=5386 loops=1)
Recheck Cond: ((i < 5000000) OR (i > 950000000))
-> BitmapOr (cost=107.36..107.36 rows=5349 width=0) (actual time=0.922..0.922 rows=0 loops=1)
-> Bitmap Index Scan on i1 (cost=0.00..12.25 rows=527 width=0) (actual time=0.120..0.120 rows=491 loops=1)
Index Cond: (i < 5000000)
-> Bitmap Index Scan on i1 (cost=0.00..92.46 rows=4822 width=0) (actual time=0.799..0.799 rows=4895 loops=1)
Index Cond: (i > 950000000)
Total runtime: 4.765 ms
(8 rows)
In here we see two Bitmap Index Scans (there can be more of them), which are then joined (not as SQL “JOIN"!) using BitmapOr.
As you remember – output of Bitmap Index Scan is a bitmap – that is memory block with some zeros and some ones. Having multiple such bitmaps means that
you can easily do logical operations on it: Or, And or Not.
In here we see that two such bitmaps were joined together using Or operator, and resulting bitmap was passed to Bitmap Heap Scan which loaded appropriate rows
from the table.
While in here both Index Scans use the same index, it's not always the case. For example, let's add quickly some more columns:
And now:
explain analyze select * from test where j < 50000000 and i < 50000000 and h > 950000000;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on test (cost=280.76..323.61 rows=12 width=16) (actual time=2.295..2.352 rows=11 loops=1)
Recheck Cond: ((h > 950000000) AND (j < 50000000) AND (i < 50000000))
-> BitmapAnd (cost=280.76..280.76 rows=12 width=0) (actual time=2.278..2.278 rows=0 loops=1)
-> Bitmap Index Scan on i3 (cost=0.00..92.53 rows=4832 width=0) (actual time=0.546..0.546 rows=4938 loops=1)
Index Cond: (h > 950000000)
-> Bitmap Index Scan on i2 (cost=0.00..93.76 rows=4996 width=0) (actual time=0.783..0.783 rows=5021 loops=1)
Index Cond: (j < 50000000)
-> Bitmap Index Scan on i1 (cost=0.00..93.96 rows=5022 width=0) (actual time=0.798..0.798 rows=4998 loops=1)
Index Cond: (i < 50000000)
Total runtime: 2.428 ms
(10 rows)
Three Bitmap Index Scans, each using different index, bitmaps joined using “and" bit operation, and result fed to Bitmap Heap Scan.
In case you wonder – why is the BitmapAnd showing “Actual rows = 0" – it's simple. This node doesn't deal with rows at all (just bitmap of disk pages). So it can't
return any rows.
Thats about it for now – these are your possible table scans – how you get data from disk. Next time I'll talk about joining multiple sources together, and other
types of plans.
Posted on 2013-04-27|Tags bitmap, explain, heap, index, only, postgresql, scan, seq, unexplainable|
1. Anonymous says:
2013-04-28 at 19:33
Thank you!
2. jcd says:
2013-04-29 at 09:06
I should have read these two posts 5 years ago and it would have made my life so much easier. Thank you and keep up the awesomeness.
3. far says:
2013-04-30 at 12:49
Very useful. Can’t help thinking something like these posts should be in the official docs somewhere. EXPLAIN is such a useful tool.
2013-05-02 at 13:38
It was a long time ago this blog became required reading for advanced PostgreSQL subjects.
And the good stuff just keep piling up with regularity !
2013-05-08 at 18:38
Thanks for this helpful post. Understanding the output of EXPLAIN is hard, and you made it a bit easier.
6. kk says:
2013-05-09 at 11:06
7. depesz says:
2013-05-09 at 11:31
@kk: i think there will be more than one. I still have some operations to cover, and statistics will definitely need a separate blogpost.
8. Drew Taylor says:
2013-05-12 at 08:52
I had always wondered what the BitMap* functions were. Thank you for the informative post!
9. manuscola says:
2013-07-10 at 16:07
2015-10-15 at 12:34
2015-12-23 at 12:09
Thank you so much for taking the time to explain such a complicated subject in such a simple way!
2016-06-30 at 10:54
Thank you so much for post. I have tested the Index only scan in my system. It is not working for me. I am using postgres 9.5 version
Time: 33.355 ms
postgres=# vacuum full analyze test;
VACUUM
Time: 506.563 ms
postgres=# explain analyze select id from test;
QUERY PLAN
————————————————————————————————————
Seq Scan on test (cost=0.00..1541.00 rows=100000 width=4) (actual time=0.018..25.843 rows=100000 loops=1)
Planning time: 0.993 ms
Execution time: 31.164 ms
(3 rows)
Time: 33.195 ms
postgres=# \d+ test
Table “public.test”
Column | Type | Modifiers | Storage | Stats target | Description
——–+————————+—————————————————+———-+————–+————-
id | integer | not null default nextval(‘test_id_seq’::regclass) | plain | |
name | character varying(200) | | extended | |
Indexes:
“test_pkey” PRIMARY KEY, btree (id)
2016-06-30 at 12:13
@Srinivas:
index only scan will be used only in some cases, when exactly – it’s not entirely clear to me. But – getting 100k rows, is clearly not normal situation.
HiDEPESZ,
Could u plz explain to me about the below Explain Analyze result.
2018-06-06 at 09:24
2018-06-06 at 14:13
@Thanh:
it opens the page, and checks if the row is there, and is visible to current transaction.
Function Scan
Example:
Generally it's so simple, that I shouldn't need describing it, but since I will use it in next examples, I decided to write a thing about it.
Function Scan, is very simple node – it runs a function that returns recordset – that is, it will not run function like “lower()", but a function that returns (at least
potentially) multiple rows, or multiple columns. After the function will return rows, these are returned to whatever is above Function Scan in plan tree, or to client,
if Function Scan is the top node.
The only additional logic it might have, is ability to filter returned rows, like in here:
Sort
This seems to be easy to understand – sort gets given records and returns them sorted in some way.
Example:
$ explain analyze select * from pg_class order by relname;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------
Sort (cost=22.88..23.61 rows=292 width=203) (actual time=0.230..0.253 rows=295 loops=1)
Sort Key: relname
Sort Method: quicksort Memory: 103kB
-> Seq Scan on pg_class (cost=0.00..10.92 rows=292 width=203) (actual time=0.007..0.048 rows=295 loops=1)
Total runtime: 0.326 ms
(5 rows)
While it is simple, it has some cool logic inside. For starters – if memory used for sorting would be more than work_mem, it will switch to using disk based
sorting:
To handle such cases, Pg will use temporary files stored in $PGDATA/base/pgsql_tmp/ directory. They will of course be removed as soon as they are not needed.
One additional feature is that Sort can change it's method of working if it's called by Limit operation, like here:
Normally, to sort given dataset, you need to process it in whole. But Pg knows, that if you need only some small number of rows, it doesn't have to sort whole
dataset, and it's good enough to get just the first values.
In Big O notation, general sort has complexity of O(m * log(m)), but Top-N has complexity of O(m * log(n)) – where m is number of rows in table, and n is
number of returned rows. What's most important – this kind of sort also uses much less memory (after all, it doesn't have to construct whole dataset of sorted rows,
just couple of rows), so it's less likely to use slow disk for temporary files.
Limit
I used limit many times, because it's so simple, but let's describe it fully. Limit operation runs it's sub-operation, and returns just first N rows from what it returned.
Usually it also stops sub-operation afterwards, but in some cases (pl/PgSQL functions for example), the sub-operation is already finished when it returned first
row.
Simple example:
As you can see using limit in the 2nd case caused underlying Seq Scan to finish it's work immediately after finding two rows.
HashAggregate
This operation is used basically whenever you are using GROUP BY and some aggregates, like sum(), avg(), min(), max() or others.
Example:
HashAggregate does something like this: for every row it gets, it finds GROUP BY “key" (in this case relkind). Then, in hash (associative array, dictionary), puts
given row into bucket designated by given key.
After all rows have been processed, it scans the hash, and returns single row per each key value, when necessary – doing appropriate calculations (sum, min, avg,
and so on).
It is important to understand that HashAggregate has to scan all rows before it can return even single row.
Now, if you understand it, you should see potential problem: well, what about case when there are millions of rows? The hash will be too big to fit in memory.
And here, again, we'll be using work_mem. If generated hash is too big, it will “spill" to disk (again in the $PGDATA/base/pgsql_tmp).
This means that if we have plan that has both HashAggregate and Sort – we can use up to 2 * work_mem. And such plan is simple to get:
$ explain analyze select relkind, count(*) from pg_Class group by relkind order by relkind;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------
Sort (cost=12.46..12.47 rows=4 width=1) (actual time=0.260..0.261 rows=5 loops=1)
Sort Key: relkind
Sort Method: quicksort Memory: 25kB
-> HashAggregate (cost=12.38..12.42 rows=4 width=1) (actual time=0.221..0.222 rows=5 loops=1)
-> Seq Scan on pg_class (cost=0.00..10.92 rows=292 width=1) (actual time=0.006..0.044 rows=295 loops=1)
Total runtime: 0.312 ms
(6 rows)
In reality – single query can use many times work_mem, as work_mem is a limit per operation. So, if your query uses 1000 of HashAggregates and Sorts (and
other work_mem using operations) total memory usage can get pretty high.
This operation, unlike all other that we previously discussed, has two sub operations. One of them is always “Hash", and the other is something else.
Hash Join is used, as name suggests to join two recordsets. For example like here:
It works like this – first Hash Join calls “Hash", which in turns calls something else (Seq Scan on pg_namespace in our case). Then, Hash makes a memory (or
disk, depending on size) hash/associative-array/dictionary with rows from the source, hashed using whatever is used to join the data (in our case, it's OID column
in pg_namespace).
Of course – you can have many rows for given join key (well, not in this case, as I'm joining using primary key, but generally, it's perfectly possible to have
multiple rows for single hash key.
{
'123' => [ { data for row with OID = 123 }, ],
'256' => [ { data for row with OID = 256 }, ],
...
}
Then, Hash Join runs the second suboperation (Seq Scan on pg_class in our case), and for each row from it, it does:
1. check if join key (pg_class.relnamespace in our case) is in hash returned by Hash operation
2. if it is not – given row from suboperation is ignored (will not be returned)
3. if it exists – Hash Join fetches rows from hash, and based on row from one side, and all rows from hash, it generates output rows
It is important to note that both sides are run only once ( in our case, these both are seq scans), but first (the one called by Hash) has to return all rows, which have
to be stored in hash, and the other is processed one row at a time, and some rows will get skipped if they don't exist in hash from the other side (hope the sentence
is clear, there are many “hash"es there).
Of course, since both subscans can be any type of operation, these can do filter or index scan or whatever you can imagine.
Final note for Hash Join/Hash is that the Hash operation, just like Sort and HashAggregate – will use up to work_mem of memory.
Nested Loop
Since we're at joins – we have to discuss Nested Loop. Example:
$ explain analyze select a.* from pg_class c join pg_attribute a on c.oid = a.attrelid where c.relname in ( 'pg_class',
'pg_namespace' );
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
------------------------
Nested Loop (cost=0.28..52.32 rows=16 width=203) (actual time=0.057..0.134 rows=46 loops=1)
-> Seq Scan on pg_class c (cost=0.00..11.65 rows=2 width=4) (actual time=0.043..0.080 rows=2 loops=1)
Filter: (relname = ANY ('{pg_class,pg_namespace}'::name[]))
Rows Removed by Filter: 291
-> Index Scan using pg_attribute_relid_attnum_index on pg_attribute a (cost=0.28..20.25 rows=8 width=203) (actual
time=0.007..0.015 rows=23 loops=2)
Index Cond: (attrelid = c.oid)
Total runtime: 0.182 ms
This is very interesting plan as it can run given operations multiple times.
Just as Hash Join, Nested Loop has two “children". First it runs “Seq Scan" (in our example, generally, first it runs the first node that is there, and then, for every
row it returns (2 rows in our example), it runs 2nd operation (Index Scan on pg_attribute in our case).
You might notices that Index Scan has “loops=2" in it's actual run metainfo. This means that this operation has been run twice, and the other values (rows, time)
are averages across all runs.
Let's check this plain from explain.depesz.com. Note that the actual times for the categories index scan are 0.002 to 0.003 ms. But total time on this node is
78.852ms, because this index scan has been ran over 26k times.
1. Nested Loop runs one side of join, once. Let's name it “A".
2. For every row in “A", it runs second operation (let's name it “B")
3. if “B" didn't return any rows – data from “A" is ignored
4. if “B" did return rows, for every row it returned, new row is returned by Nested Loop, based on current row from A, and current row from B
Merge Join
Another method of joining data is called Merge Join. This is used, if joined datasets are (or can be cheaply) sorted using join key.
I don't have nice example of this, so I will force it by using subselects that sort data before joining:
Merge Join, as other joins, runs two sub operations (Sort and Materialize in this case). Because both of these return data sorted and the sort order is the same as
join operation, Pg can scan both returnsets from suboperations at the same time, and simply check whether ids match.
1. if join column on right side is the same as join column on left side:
o return new joined row, based on current rows on the right and left sides
o get next row from right side (or, if there are no more rows, on left side)
o go to step 1
2. if join column on right side is “smaller" than join column on left side:
o get next row from right side (if there are no more rows, finish processing)
o go to step 1
3. if join column on right side is “larger" than join column on left side:
o get next row from left side (if there are no more rows, finish processing)
o go to step 1
This is very cool way of joining datasets, but it works only for sorted sources. Based on current db of explain.depesz.com, there are:
There is no Nested Loop Right Join, because Nested Loop always starts with left side as basis to looping. So join that uses RIGHT JOIN, that would use Nested
Loop, will get internally transformed to LEFT JOIN so that Nested Loop can work.
In all those cases the logic is simple – we have two sides of join – left and right. And when side is mentioned in join, then join will return new row even if the
other side doesn't have matching rows.
All other information for Hash Join/Merge Join or Nested Loop are the same, it's just a slight change in logic on when to generate output row.
In which case join generates new output row regardless of whether data on either side is missing (as long as the data is there for one side). This happens in case of:
There are also so called Anti Joins. Their operation names look like:
In these cases Join emits row only if the right side doesn't find any row. This is useful when you're doing things like “WHERE not exists ()" or “left join … where
right_table.column is null".
Like in here:
$ explain analyze select * from pg_class c where not exists (select * from pg_attribute a where a.attrelid = c.oid and a.attnum =
10);
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
----------------------------------
Hash Anti Join (cost=62.27..78.66 rows=250 width=203) (actual time=0.145..0.448 rows=251 loops=1)
Hash Cond: (c.oid = a.attrelid)
-> Seq Scan on pg_class c (cost=0.00..10.92 rows=292 width=207) (actual time=0.009..0.195 rows=293 loops=1)
-> Hash (cost=61.75..61.75 rows=42 width=4) (actual time=0.123..0.123 rows=42 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 2kB
-> Index Only Scan using pg_attribute_relid_attnum_index on pg_attribute a (cost=0.28..61.75 rows=42 width=4) (actual
time=0.021..0.109 rows=42 loops=1)
Index Cond: (attnum = 10)
Heap Fetches: 0
Total runtime: 0.521 ms
(9 rows)
In here, Pg ran the right side (Index Scan on pg_attribute), hashed it, and then ran left side (Seq Scan on pg_class), returning only rows where there was no item in
Hash for given pg_class.oid.
Materialize
This operation showed earlier in example for Merge Join, but it is also usable in another cases.
psql has many internal commands. One of them is \dTS – which lists all system datatypes. Internally \dTS runs this query:
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
------------------------
Sort (cost=2783.00..2783.16 rows=65 width=68) (actual time=3.883..3.888 rows=87 loops=1)
Sort Key: n.nspname, (format_type(t.oid, NULL::integer))
Sort Method: quicksort Memory: 39kB
-> Nested Loop Left Join (cost=16.32..2781.04 rows=65 width=68) (actual time=0.601..3.657 rows=87 loops=1)
Join Filter: (n.oid = t.typnamespace)
Rows Removed by Join Filter: 435
-> Hash Anti Join (cost=16.32..2757.70 rows=65 width=8) (actual time=0.264..0.981 rows=87 loops=1)
Hash Cond: ((t.typelem = el.oid) AND (t.oid = el.typarray))
-> Seq Scan on pg_type t (cost=0.00..2740.26 rows=81 width=12) (actual time=0.012..0.662 rows=157 loops=1)
Filter: (pg_type_is_visible(oid) AND ((typrelid = 0::oid) OR (SubPlan 1)))
Rows Removed by Filter: 185
SubPlan 1
-> Index Scan using pg_class_oid_index on pg_class c (cost=0.15..8.17 rows=1 width=1) (actual
time=0.002..0.002 rows=1 loops=98)
Index Cond: (oid = t.typrelid)
-> Hash (cost=11.33..11.33 rows=333 width=8) (actual time=0.241..0.241 rows=342 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 14kB
-> Seq Scan on pg_type el (cost=0.00..11.33 rows=333 width=8) (actual time=0.002..0.130 rows=342 loops=1)
-> Materialize (cost=0.00..1.09 rows=6 width=68) (actual time=0.000..0.001 rows=6 loops=87)
-> Seq Scan on pg_namespace n (cost=0.00..1.06 rows=6 width=68) (actual time=0.002..0.003 rows=6 loops=1)
Total runtime: 3.959 ms
Materialize is called by Nested Loop Left Join – operation #2. We know that Nested Loop causes given operation to be ran multiple times, in this case – 87 times.
Right side of the join is Seq Scan on pg_namespace. So Pg, theoretically, should run Sequential Scan on pg_namespace 87 times. Given that single Seq Scan of
this table takes 0.003ms, we could expect total time of ~ 0.25ms.
But Pg is smarter than that. It realized that it will be cheaper to scan the table just once, and build memory representation of all the rows in there. So, that next
time, it will not have to scan the table, check visibility information, parse data pages. It will just get the data from memory.
Thanks to this total time of: reading the table once, preparing memory representation of the data and scanning this representation 87 times was 0.087ms.
You might then ask, OK, but why did the merge join earlier on use materialize – it was just doing one scan? Let's remind the plan:
Yes. It was run just once. The problem is, though, that source of data for Merge Join has to match several criteria. Some are obvious (data has to be sorted) and
some are not so obvious as are more technical (data has to be scrollable back and forth).
Because of this (these not so obvious criteria) sometimes Pg will have to Materialize the data coming from source (Index Scan in our case) so that it will have all
the necessary features when using it.
Long story short – Materialize gets data from underlying operation and stores it in memory (or partially in memory) so that it can be used faster, or with additional
features that underlying operation doesn't provide.
And that's it for today. I thought that I will be done, but there are still many operations that need to be described. So, we will have at least two more posts in the
series (rest of the operations, and statistics info).
Posted on 2013-05-09|Tags aggregate, explain, function, hash, join, limit, loop, materialize, merge, nested, postgresql, scan, sort, unexplainable|
2013-05-09 at 22:58
2. Victor says:
2013-05-27 at 00:20
As you speak of Anti joins, perhaps you should also mention Semi joins, the ones produced by “WHERE EXISTS()” or “… IN()” constructs?
3. depesz says:
2013-05-27 at 15:16
@Victor:
yes, I forgot about these. Interestingly – the query that you showed generates (in my pg 9.3) Hash join – https://explain.depesz.com/s/IqG .
Semi Join is basically a reverse of Anti Join. When doing semi join of table “a” and table “b” using some comparison, emitted are only rows from “a” that
there is row in b that matches the condition.
but, in case b.x had duplicates, join above would duplicate rows from “a”. But semi join would not, as it only checks if a row is there on “b-side”, and if
yes, it emits row from a.
There can be all three variants of semi joins (Nested Loop Semi Join, Hash Semi Join and Merge Semi Join).
4. Victor says:
2013-05-28 at 13:13
5. depesz says:
2013-05-28 at 14:49
@Victor: it could be related to statistics. This is what the next (and final) part of the series will be about, but I have yet to write it.
Unique
Name seems to be clear about what's going on here – it removes duplicate data.
But in newer Pgs this query will usually be done using HashAggregate.
The problem with Unique is that is requires data to be sorted. Not because it needs data in any particular order – but it needs it so that all rows with the same value
will be “together".
This makes it really cool (when possible to use) because it doesn't use virtually any memory. It just checks if value in previous row is the same as in current, and if
yes – discards it. That's all.
So, we can force usage of it, by pre-sorting data:
$ explain select distinct relkind from (select relkind from pg_class order by relkind) as x;
QUERY PLAN
-----------------------------------------------------------------------
Unique (cost=22.88..27.26 rows=4 width=1)
-> Sort (cost=22.88..23.61 rows=292 width=1)
Sort Key: pg_class.relkind
-> Seq Scan on pg_class (cost=0.00..10.92 rows=292 width=1)
(4 rows)
Append
This plan simply runs multiple sub-operations, and returns all the rows that were returned as one resultset.
$ explain select oid from pg_class union all select oid from pg_proc union all select oid from pg_database;
QUERY PLAN
-----------------------------------------------------------------
Append (cost=0.00..104.43 rows=2943 width=4)
-> Seq Scan on pg_class (cost=0.00..10.92 rows=292 width=4)
-> Seq Scan on pg_proc (cost=0.00..92.49 rows=2649 width=4)
-> Seq Scan on pg_database (cost=0.00..1.02 rows=2 width=4)
(4 rows)
In here you can see append running three scans on three tables and returning all the rows together.
Please note that I used UNION ALL. If I'd used UNION, we would get:
$ explain select oid from pg_class union select oid from pg_proc union select oid from pg_database;
QUERY PLAN
-----------------------------------------------------------------------
HashAggregate (cost=141.22..170.65 rows=2943 width=4)
-> Append (cost=0.00..133.86 rows=2943 width=4)
-> Seq Scan on pg_class (cost=0.00..10.92 rows=292 width=4)
-> Seq Scan on pg_proc (cost=0.00..92.49 rows=2649 width=4)
-> Seq Scan on pg_database (cost=0.00..1.02 rows=2 width=4)
(5 rows)
This is because UNION removes duplicate rows – which is, in this case, done using HashAggregate operation.
Result
This happens mostly in very simple test queries. This operation is used when your query selects some constant value (or values):
$ explain select 1, 2;
QUERY PLAN
------------------------------------------
Result (cost=0.00..0.01 rows=1 width=0)
(1 row)
Aside from test queries it can be sometimes seen in queries that do “insert, but don't if it would be duplicate" kind of thing:
$ explain insert into t (i) select 1 where not exists (select * from t where i = 1);
QUERY PLAN
---------------------------------------------------------------------
Insert on t (cost=3.33..3.35 rows=1 width=4)
-> Result (cost=3.33..3.34 rows=1 width=0)
One-Time Filter: (NOT $0)
InitPlan 1 (returns $0)
-> Seq Scan on t t_1 (cost=0.00..40.00 rows=12 width=0)
Filter: (i = 1)
(6 rows)
Values Scan
Just like Result above, Values Scan is for returning simple, entered in query, data, but this time – it can be whole recordset, based on VALUES() functionality.
In case you don't know, you can select multiple rows with multiple columns, without any table, just by using VALUES syntax, like here:
$ select * from ( values (1, 'hubert'), (2, 'depesz'), (3, 'lubaczewski') ) as t (a,b);
a | b
---+-------------
1 | hubert
2 | depesz
3 | lubaczewski
(3 rows)
QUERY PLAN
--------------------------------------------------------------
Values Scan on "*VALUES*" (cost=0.00..0.04 rows=3 width=36)
(1 row)
It is also most commonly used in INSERTs, but it has other uses too, like custom sorting.
GroupAggregate
This is similar to previously described HashAggregate.
The difference is that for GroupAggregate to work data has to be sorted using whatever column(s) you used for your GROUP BY clause.
Just like Unique – GroupAggregate uses very little memory, but forces ordering of data.
Example:
$ explain select relkind, count(*) from (select relkind from pg_class order by relkind) x group by relkind;
QUERY PLAN
-----------------------------------------------------------------------
GroupAggregate (cost=22.88..28.03 rows=4 width=1)
-> Sort (cost=22.88..23.61 rows=292 width=1)
Sort Key: pg_class.relkind
-> Seq Scan on pg_class (cost=0.00..10.92 rows=292 width=1)
(4 rows)
HashSetOp
This operation is used by INTERSECT/EXCEPT operations (with optional “ALL" modifier).
It works by running sub-operation of Append for a pair of sub-queries, and then, based on result and optional ALL modifier, it figures which rows should be
returned. I haven't digged in the source code so I can't tell you exactly how it works, but given the name and the operation, it looks like a simple counter-based
solution.
In here we can see that unlike UNION, these operations work on two sources of data:
$ explain select * from (select oid from pg_Class order by oid) x intersect all select * from (select oid from pg_proc order by
oid) y;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------
HashSetOp Intersect All (cost=0.15..170.72 rows=292 width=4)
-> Append (cost=0.15..163.36 rows=2941 width=4)
-> Subquery Scan on "*SELECT* 1" (cost=0.15..18.37 rows=292 width=4)
-> Index Only Scan using pg_class_oid_index on pg_class (cost=0.15..12.53 rows=292 width=4)
-> Subquery Scan on "*SELECT* 2" (cost=0.28..145.00 rows=2649 width=4)
-> Index Only Scan using pg_proc_oid_index on pg_proc (cost=0.28..92.02 rows=2649 width=4)
(6 rows)
$ explain select * from (select oid from pg_Class order by oid) x intersect all select * from (select oid from pg_proc order by
oid) y intersect all select * from (Select oid from pg_database order by oid) as w;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------
HashSetOp Intersect All (cost=1.03..172.53 rows=2 width=4)
-> Append (cost=1.03..171.79 rows=294 width=4)
-> Subquery Scan on "*SELECT* 3" (cost=1.03..1.07 rows=2 width=4)
-> Sort (cost=1.03..1.03 rows=2 width=4)
Sort Key: pg_database.oid
-> Seq Scan on pg_database (cost=0.00..1.02 rows=2 width=4)
-> Result (cost=0.15..170.72 rows=292 width=4)
-> HashSetOp Intersect All (cost=0.15..170.72 rows=292 width=4)
-> Append (cost=0.15..163.36 rows=2941 width=4)
-> Subquery Scan on "*SELECT* 1" (cost=0.15..18.37 rows=292 width=4)
-> Index Only Scan using pg_class_oid_index on pg_class (cost=0.15..12.53 rows=292 width=4)
-> Subquery Scan on "*SELECT* 2" (cost=0.28..145.00 rows=2649 width=4)
-> Index Only Scan using pg_proc_oid_index on pg_proc (cost=0.28..92.02 rows=2649 width=4)
(13 rows)
CTE Scan
This is similar to previously mentioned Materialized operation. It runs a part of a query, and stores output so that it can be used by other part (or parts) of the
query.
Example:
1. $ explain analyze with x as (select relname, relkind from pg_class) select relkind, count(*), (select count(*) from x) from
x group by relkind;
2. QUERY PLAN
3. -----------------------------------------------------------------------------------------------------------------
4. HashAggregate (cost=24.80..26.80 rows=200 width=1) (actual time=0.466..0.468 rows=6 loops=1)
5. CTE x
6. -> Seq Scan on pg_class (cost=0.00..10.92 rows=292 width=65) (actual time=0.009..0.127 rows=295 loops=1)
7. InitPlan 2 (returns $1)
8. -> Aggregate (cost=6.57..6.58 rows=1 width=0) (actual time=0.085..0.085 rows=1 loops=1)
9. -> CTE Scan on x x_1 (cost=0.00..5.84 rows=292 width=0) (actual time=0.000..0.055 rows=295 loops=1)
10. -> CTE Scan on x (cost=0.00..5.84 rows=292 width=1) (actual time=0.012..0.277 rows=295 loops=1)
11. Total runtime: 0.524 ms
12. (8 rows)
Please note that pg_class is scanned only once – line #6. But it's results are stored in “x", and then scanned twice – inside aggregate (line #9) and HashAggregate
(10).
How is it different from Materialize? To answer fully, one would need to jump into sources, but I would say that the difference stems from simple fact that CTE's
are user defined. While Materialize is helper operation that Pg chooses to use when (it thinks) it makes sense.
The very important thing is that CTEs are ran just as specified. So they can be used to circumvent some not-so-good optimizations that planner normally can do.
InitPlan
This plan happens whenever there is a part of your query that can (or have to) be calculated before anything else, and it doesn't depend on anything in the rest of
your query.
$ explain select * from pg_class where relkind = (select relkind from pg_class order by random() limit 1);
QUERY PLAN
------------------------------------------------------------------------------------------
Seq Scan on pg_class (cost=13.11..24.76 rows=73 width=203)
Filter: (relkind = $0)
InitPlan 1 (returns $0)
-> Limit (cost=13.11..13.11 rows=1 width=1)
-> Sort (cost=13.11..13.84 rows=292 width=1)
Sort Key: (random())
-> Seq Scan on pg_class pg_class_1 (cost=0.00..11.65 rows=292 width=1)
(7 rows)
In this case – getting the limit/sort/seq-scan is needed to run before normal seq scan on pg_class – because Pg will have to compare relkind value with the value
returned by subquery.
Pg correctly sees that the subselect column does not depend on any data from pg_class table, so it can be run just once, and doesn't have to redo the length-
calculation for every row.
$ explain select *, (select length('depesz')) from pg_class where relkind = (select relkind from pg_class order by random() limit
1);
QUERY PLAN
------------------------------------------------------------------------------------------
Seq Scan on pg_class (cost=13.12..24.77 rows=73 width=203)
Filter: (relkind = $1)
InitPlan 1 (returns $0)
-> Result (cost=0.00..0.01 rows=1 width=0)
InitPlan 2 (returns $1)
-> Limit (cost=13.11..13.11 rows=1 width=1)
-> Sort (cost=13.11..13.84 rows=292 width=1)
Sort Key: (random())
-> Seq Scan on pg_class pg_class_1 (cost=0.00..11.65 rows=292 width=1)
(9 rows)
There is one important thing, though – numbering of init plans within single query is “global", and not “per operation".
SubPlan
SubPlans are a bit similar to NestedLoop. In this way that these can be called many times.
SubPlan is called to calculate data from a subquery, that actually does depends on current row.
For example:
$ explain analyze select c.relname, c.relkind, (Select count(*) from pg_Class x where c.relkind = x.relkind) from pg_Class c;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------
Seq Scan on pg_class c (cost=0.00..3468.93 rows=292 width=65) (actual time=0.135..26.717 rows=295 loops=1)
SubPlan 1
-> Aggregate (cost=11.83..11.84 rows=1 width=0) (actual time=0.090..0.090 rows=1 loops=295)
-> Seq Scan on pg_class x (cost=0.00..11.65 rows=73 width=0) (actual time=0.010..0.081 rows=93 loops=295)
Filter: (c.relkind = relkind)
Rows Removed by Filter: 202
Total runtime: 26.783 ms
(7 rows)
For every row that is returned by scan on “pg_class as c", Pg has to run SubPlan, which checks how many rows in pg_class have the same (as currently processed
row) value in relkind column.
Please note “loops=295" in the “Seq Scan on pg_class x" line, and matching “rows=295" in the earlier “Seq Scan on pg_class c" node.
Other ?
Yes. There are other operations. Some of them are too rare to care about (especially since you do have the great source of knowledge: sources), some are (I
suspect) older versions of newer nodes.
If you have a plan with operation I did not cover, and you don't understand it – please let me know in comments – link to explain output on explain.depesz.com,
operation name, and Pg version that you have it in. Will comment on such cases with whatever information I might find.
Posted on 2013-05-19|Tags append, cte, explain, groupaggregate, hashsetop, initplan, postgresql, result, setop, subplan, unexplainable, unique, values|
1. kim says:
2015-06-16 at 11:04
This makes it really cool (when possible to use) because it doesn’t use virtually any memory.
2. depesz says:
2015-06-16 at 12:59
@kim:
sorry, but I don’t understand what you’re asking about. The blogpost is rather long, and described many different execution nodes, but none of these are
sorts, so I just don’t get the context you’re asking in.
2016-06-16 at 00:18
I’m not sure if this is the right place to ask. It’s related to the source code.
I’m studying queries that involve initplan and subplan. Nodes such as hash joins, seq scan are executed by the function ExecProcNode(). But I can’t find
which function executes a subplan or initplan .
4. depesz says:
2016-06-16 at 11:31
@Yang Liu:
sorry, can’t answer. but for source-code level questions, I can suggest that you ask on pgsql-hackers mailing list.
Now, in the final post, I will try to explain how it happens that Pg chooses “Operation X" over “Operation Y".
You could have heard that PostgreSQL's planner is choosing operations based on statistics. What statistics?
If all rows in have the same some_value – then using (potentially existing) index on column doesn't make sense.
On the other hand – if column is unique (or almost unique) – usage of index is really good idea.
So, now I have a 100,000 row table, where “all_the_same" column has always the same value (123), and almost_unique column is, well, almost unique:
As you can see Pg chose wisely. But the interesting thing is the “rows=" estimate. How does it know how many rows the query might return?
When doing “ANALYZE" of the table, pg gets some “random sample" (more on it in a moment), and gets some stats. What are the stats, where are they, and can
we see them? Sure we can:
This table (pg_statistic) is, of course, described the docs. But it is pretty cryptic. Of course, you can find very precise explanation in sources, but that's not (usually)
best solution.
Luckily, there is a view over this table, that contains the same data, in more readable way:
schemaname, tablename and attname columns seem to be obvious. Inherited simply says whether values for this table contain values from any tables that did
inherit this column.
and then I'd put some data in table z, then statistics for table test would have “inherited = true".
null_frac – how many rows contains null value in given column. This is a fraction, so it's value from 0 to 1
avg_width – average width of data in this column. In case of constant-width types (like int4 here) it is not really interesting, but in case of any datatype with
variable width (like text/varchar/numeric) – it is potentially interesting.
n_distinct – very intersting value. If it is positive ( 1+ ) – it will be be simply estimated number (not fraction!) of distinct values – this we can see in case of
all_the_same column, there n_distinct is correctly 1. If it is negative, though it's meaning is different. It means then which fraction of rows has unique
value. So, in case of almost_unique, stats suggest that 92.146% of rows have unique value (which is a bit short of 95.142% which I showed earlier). The
values can be incorrect due to the “random sample" thing I mentioned earlier, and will explain more in a bit.
most_common_vals – array of most common values in this table
most_common_freqs – how frequent are the values from most_common_vals – again, it's fraction, so it can be at most 1 (though then we'd have only one
value in most_common_vals). In her, in almost_unique, we see that Pg “thinks" that values 21606, 27889, 120502, 289914, 417495, 951355 are the ones
that happen most often – which are not, but this is, again, caused by “random sample" effect
histogram_bounds – array of values which divide (or should divide – again “random sample" thing) whole recordset into groups of the same number of
rows. That is – number of rows with almost_unique between 2 and 10560 is the same (more or less) as number of rows with almost_unique between
931785 and 940716
correlation – this is interesting statistic – it shows whether there is correlation between physical row ordering on disk, and values. This can go from -1 to 1,
and generally the closer it is to -1/1 – the more correlation there is. For example – after doing “CLUSTER test using i2" – that is reordering table in
almost_unique order, I got correlation of -0.919358 – much better than shown above -0.000468686.
most_common_elems, most_common_elem_freqs and elem_count_histogram are like most_common_vals, most_common_freqs and histogram_bounds
but for non-scalar datatypes (think: arrays, tsvectors and alike).
Based on this data, PostgreSQL can estimate how many rows will be returned by any given part of query, and based on this information it can decide whether it's
better to use seq scan or index scan or bitmap index scan. And when joining – which one should be faster – Hash Join, Merge Join or perhaps Nested Loop.
If you looked at the data above you could have asked yourself – it's pretty wide output – there are many values in the
most_common_vals/most_common_freqs/histogram_bounds arrays. Why are there so many?
Reason is simple – it's configuration. In postgresql.conf you can find default_statistics_target variable. This variable tells Pg how many values to keep in these
arrays. In my case (default) it's 100. But you can easily change it. Either by changing postgresql.conf, or even on a per-column basis, with:
select * from pg_stats where tablename = 'test' and not inherited and attname = 'almost_unique';
-[ RECORD 1 ]----------+---------------------------------------------------------
schemaname | public
tablename | test
attname | almost_unique
inherited | f
null_frac | 0
avg_width | 4
n_distinct | -0.92112
most_common_vals | {114832,3185,3774,6642,11984}
most_common_freqs | {0.0001,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05}
histogram_bounds | {2,199470,401018,596414,798994,999964}
correlation | 1
most_common_elems | [null]
most_common_elem_freqs | [null]
elem_count_histogram | [null]
Let me show you. First I'll revert the change on statistic count I did with ALTER TABLE:
And now:
Please note that the second analyze tested only 3000 rows – not 30000 as the first one.
Analyzing all rows in a table would be prohibitively expensive in any medium or large table.
First – it reads random part of pages in a table (reminder: each page is 8kB of data). How many – 300 * statistics_target.
Which means, in my case, with default_statistics_target = 100, it would read 30000 pages. (my table doesn't have that many, so instead it did read all of them).
From these pages, ANALYZE gets just information about live and dead rows. Afterwards, it gets data on random sample of rows – again 300 * statistics target,
and calculated the column statistics based on these data.
In my case – table had 100,000 rows, but with default_statistics_target = 100, only a third was analyzed. With statistics target the count of rows that is analyzed is
even lower – just 3000.
You could have said: OK, but then these statistics are not accurate. It could be that some super-common value just doesn't happen to be in the scanned rows. Sure.
You're right. It's possible. Though not really likely. I mean – you are getting random part of data. The chances that you'll get the x% of the table that just doesn't
happen to have any row with some value that exists in all other rows are small.
This also means that sometimes, it can happen, that running analyze will “break" your queries. For example – you'll get statistics on another pages, and it will
happen that some values will get skipped (or, on the contrary – you will get in most_common_vals things that aren't really all that common, just Pg happened to
pick right pages/rows to see it). And based on such stats, Pg will generate suboptimal plans.
If such case would hit you, the solution is rather simple – bump statistics target. This will make analyze work harder, and scan more rows, so the chances of this
happening again will get even smaller.
There is a drawback in setting large targets, though. ANALYZE has to work more, of course, but this is maintenance thing, so we don't really care about it
(usually). The problem is that having more data in pg_statistic means that more data has to be taken into consideration by Pg planner. So, while it might look
tempting to set default_statistics_target to it's max of 10,000, in reality I haven't seen database which would have it set that high.
Current default of 100 is there since 8.4. Before that it was set at 10, and it was pretty common to see on irc suggestions to increase it. Now, with default 100,
you're more or less set.
One final thing I have to talk about, though I really don't want, are settings that make Pg planner use different operations.
First – why I don't want to talk about it – I know for a fact that this can be easily abused. So please remember – these settings are for debugging problems, not for
solving problems. Application that would use them in normal operation mode is at least suspected if not outright broken. And yes, I know that sometimes you have
to. But the sometimes is very rare.
enable_bitmapscan = on
enable_hashagg = on
enable_hashjoin = on
enable_indexscan = on
enable_indexonlyscan = on
enable_material = on
enable_mergejoin = on
enable_nestloop = on
enable_seqscan = on
enable_sort = on
enable_tidscan = on
For example – setting enable_seqscan to false (which can be done with SET command in SQL session, you don't have to modify postgresql.conf) will cause
planner to use anything else it might, just to avoid seq scan.
Since sometimes it's not possible to avoid seq scan (there are no indexes on the table, for example) – these settings don't actually disable operations, just associate
huge cost with using them.
For example. With our test table, we know what searching with “all_the_same = 123" will use seq scan, as it's cheap:
We see that estimated cost of getting the same data with index scan is ~ two times higher (3300.29 vs 1693).
And now we see that when there is no other option – just a seq scan (it's interesting it didn't choose to do index scan on i2, after all, it has pointers to all rows in the
table), the cost skyrocketed to 10,000,000,000 – which is exactly what enable_* = false does.
I think that's about it. If you read the whole series you should have enough knowledge to understand what's going on, and, more importantly, why.
Posted on 2013-05-30|Tags analyze, cost, explain, pg_statistic, pg_stats, postgresql, random, sample, statistics, unexplainable, vacuum|
2013-05-30 at 16:15
how do you address ‘many_of_the_same’, for the same size table … when there only 10 or so different values.
thank you.
2. depesz says:
2013-05-30 at 16:16
@Nate:
Sorry, not sure I understand your question. Can you show me (via a pastesite perhaps to avoid formatting issues) the problem/question?
2013-05-30 at 16:48
explain select max(many_the_same) from test where many_the_same Limit (cost=0.00..0.07 rows=1 width=4)
-> Index Scan Backward using test_i1 on test (cost=0.00..303585.45 rows=4496733 width=4)
Index Cond: ((many_the_same IS NOT NULL) AND (many_the_same Limit (cost=0.00..0.04 rows=1 width=4)
-> Index Scan Backward using test_i1 on test (cost=0.00..435477.47 rows=10000000 width=4)
Index Cond: (many_the_same IS NOT NULL)
(5 rows)
explain select max(many_the_same) from test where many_the_same Limit (cost=0.00..0.07 rows=1 width=4)
-> Index Scan Backward using test_i1 on test (cost=0.00..303585.45 rows=4496733 width=4)
Index Cond: ((many_the_same IS NOT NULL) AND (many_the_same < 5))
(5 rows)
4. depesz says:
2013-05-30 at 16:50
“explain select max(many_the_same) from test where many_the_same Limit (” looks like either bad copy/paste or something got broken somewhere else.
You should be able to use code html tags: < CODE >put your code in here < / CODE > to get better formatting. It should look like this:
whatever
5. boris says:
2013-07-02 at 13:35
How do these settings treat to each other ? Seems that they have rather similar meaning, or not ?
6. depesz says:
2013-07-02 at 15:21
They have very different meaning. Not sure how you got into conclusion that they are similar. *_cost values are some values attached to specific low-level
operations. enable_* are basically a tools to skyrocket cost of some high-level operations, so that pg will not use it.
7. Éric says:
2014-10-16 at 23:45
Hi. Just want to thank and congratulate you for this series of articles. They are just invaluable!
8. depesz says:
2014-10-17 at 13:28
9. Santosh says:
2017-06-15 at 07:06
Just wanted to say Thank you! for writing such a detailed article.
I went through complete series and I feel now I’m better equipped to write optimal queries.
You don't see this part in normal explain analyze, you have to specifically enable it: explain (analyze on, buffers on).
Well, technically you don't need analyze part, but then you'll be shown explain like:
Which shows buffers info only for planning. Which is interesting information on its own (this means that to plan this query planner had to get 3 pages from
storage, but luckily all 3 were in cache (shared_buffers). Since (by default) each page is 8kB, we know that PostgreSQL would read 24kB of data just to plan this
query.
But if I'd run the same query with analyze more info will be shown:
Please note that we now see buffers info for both Seq Scan, and for top-level node: Aggregate.
There is also I/O Timings: information, which is there thanks to track_io_timing config parameter being set to on.
So, let's get down to details. First – is shared hit/read all that can be there, and what exactly is the meaning of the values?
Luckily, I have BIG database of plans, so I can easily find lots of information. For example, in this plan we can see:
Buffers: shared hit=13,538,373 read=85,237 dirtied=204 written=6,263, local hit=10,047,224 read=773,475 dirtied=271,617
written=271,617, temp read=620,674 written=604,150
shared:
o hit
o read
o dirtied
o written
local:
o hit
o read
o dirtied
o written
temp:
o read
o written
All values are in blocks, where block is very usually 8192 bytes (you can check your block size by doing: show block_size;).
But what is exact meaning of hit, read, dirtied, written, shared, local, and temp? and why temp has only read and written?
PostgreSQL has some part of memory set aside as cache. It's size is set in shared_buffers guc.
Whenever Pg has to read some data from disk it can check if it's in shared_buffers. If it is there – page can be returned from cache. If it's not there – it's being read
from disk (and stored in shared_buffers for reuse). This leads to first split in our parameters: hit number tells how many pages were already in cache. And read
tells us how many pages were not in cache and had to be read from disk.
Next thing we can see is written. This is rather simple – this many pages have been written to disk. It is possible that select will generate writes. For example, let's
assume we have shared_buffers with only 10 buffers size. And two of these pages are already used to cache some disk pages, but, they were modified in memory,
but not yet on disk – for example because writing transaction hasn't committed yet. This means that these pages, in memory, are dirty and we'd want to use the
place in memory that they are in, we have first to write them.
This is what written generally is – how much data was written to disk, because we needed to free some space in shared_buffers for other data.
And finally – dirtied. Sometimes, it just so happens, that there are new pages in a table/index, for example, due to index or update operations, and if we select at
just the right moment, PostgreSQL will do something called “updating hint bits". Finding what hint bits actually are proved to be rather tricky. As far as I was able
to figure it out this is just information whether transactions that inserted or deleted given row are already fully in the past. This information is generally updated on
vacuum, but sometimes it can happen on select time. If it will happen so (that is: select will update hint bits), page will be marked as dirty – to be written to disk
whenever it will be possible.
So, we know what is hit/read/written/dirtied. What are share/local/temp? That's WAY simpler. Pages that belong to normal objects, shared between many database
connections (tables, indexes, materialized views) are counted as shared. Pages that belong to temporary objects – that is objects that belong only to current session
– are counted as local. And temp is for temporary data access needed for sorting, hashing, or materializing.
With this information we can now read data from previously mentioned explain node with a bit more understanding. Refresh of what we see in explain:
Wanted to read 13,623,610 pages from normal db objects, found 13,538,373 of them in cache, and had to read 85,237 from disk.
Dirtied (could have written new rows, update existing ones, delete or even just set hint bits, we don't know as it's function, so we don't know what it did)
204 pages in normal, shared, db objects.
Wrote, to make space in memory, 6,263 buffers that belonged to normal, shared, db objects.
Wanted to read 10,820,699 pages from temporary db objects, found 10,047,224 of them in cache, and had to read 773,475 from disk.
Dirtied 271,617 pages in temporary db objects.
Wrote, to make space in memory, 271,617 buffers that belonged to temporary db objects.
Read 620,674 blocks from temporary files, for operations that use temp files when work_mem is not enough.
Wrote 603,150 blocks to temporary files.
We can see that Seq Scan node had only shared hit and read, but Aggregate node had more. The thing is that buffers info is summarized. Any node reports total of
what it used itself, plus all that its subnodes used.
In our case that means that Aggregate node used only what is reported in temp part, because shared hit/read numbers are exactly the same as for it's subnode: Seq
Scan.
This leaves two last bits of information: what about loops=, and what about parallel queries.
$ explain (analyze on, buffers on) select i, (select count(*) from z where id + 100 > i) from generate_series(1,3) i;
QUERY PLAN
──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Function Scan on generate_series i (cost=0.00..6079.06 rows=3 width=12) (actual time=41.353..68.639 rows=3 loops=1)
Buffers: shared hit=1329
SubPlan 1
-> Aggregate (cost=2026.33..2026.34 rows=1 width=8) (actual time=22.868..22.869 rows=1 loops=3)
Buffers: shared hit=1329
-> Seq Scan on z (cost=0.00..1943.00 rows=33333 width=0) (actual time=0.007..15.753 rows=100000 loops=3)
Filter: ((id + 100) > i.i)
Buffers: shared hit=1329
Planning:
Buffers: shared hit=2
Planning Time: 0.189 ms
Execution Time: 68.695 ms
(12 rows)
With this query I forced Pg to run 3 separate Seq Scans of table Z. At the time of test z had 443 pages. Number shown, 1,329 shared/hit pages, is exactly 3 * 443.
Which shows that the number is total across all loops.
Made larger version of the table (44,248 pages), and ran test:
Now, in here we see that there were 3 concurrent workers (2 workers launched + main process). But buffers reported for Parallel Seq Scan are: hit=16,111 and
read=28,137, total: 44,284 – exactly full size of the table.
This blogpost is a foundation I needed to write before I will add buffers info parsing to Pg::Explain, and display to explain.depesz.com. So now, you know that
soon(ish) – there will be buffers info somewhere there.
I would like to thank to all people that asked me about it over the years – it took some time, but I'm finally getting to do it
1. Robert says:
2021-06-21 at 19:05
I always love when someone takes the time to really dig into the details and presents them in a digestible manner. Thank you for sharing this!