BigQuery Query Optimization With Troposphere PDF
BigQuery Query Optimization With Troposphere PDF
Advanced BigQuery
SQL:2011
Compliant
Replicated, Distributed BigQuery High-Available Cluster
Storage Compute
Streaming (99.9999999999% durability) (Dremel) REST API
Ingest
Master
Shard
Stage 2: Sum (1 slot)
Distributed storage
Distributed storage
foo cnt
Z 16 SELECT foo, COUNT(*) as cnt
A 10 FROM `...`
GROUP BY 1
[A-M]
[N-Z]
ORDER BY 2 DESC
foo cnt LIMIT 2
foo cnt
A 10
Z 16
B 9
Distributed storage
SELECT language,
MAX(views) as views
FROM
wikipedia_benchmark.Wiki1B
WHERE title LIKE "G%o%"
GROUP BY 1
ORDER BY 2 DESC
LIMIT 100
Shuffle
Distributed storage
● Shuffle gets invoked whenever mapping from stage N to stage N+1 isn’t statically determined
● Shuffle has quotas: shuffle too much and you spill to disk
Independent shuffles
SELECT
c.author.name a, c2.a m
FROM github_repos.commits c
JOIN (
SELECT
committer.name a, commit
FROM github_repos.commits) c2
ON c.commit = c2.commit
LIMIT 1000
Master SELECT
c.author.name a, c2.a m
FROM github_repos.commits c
JOIN (
SELECT
committer.name a,
commit
Shard Shard Shard Left table
FROM github_repos.commits)
c2
Shard Right table ON
c.commit = c2.commit
WHERE c2.a = 'tom'
Distributed storage
LIMIT 1000
SELECT c.author.name a,
c2.a m
FROM github_repos.commits c
JOIN (
SELECT
committer.name a, commit
FROM github_repos.commits) c2
ON c.commit = c2.commit
WHERE c2.a = 'tom'
LIMIT 1000
Stage 2 (X shards)
Shard Shard Shard Shard Stage 2.1 (Y shards)
Stage 2.2 (Z shards)
Shuffle
Stage 1: Read
Shard Shard Shard Shard
Distributed storage
SELECT title
FROM ....Wiki10B
GROUP BY title
ORDER BY title
LIMIT 1000
Master
Independent shuffles
Distributed storage
Distributed storage
SELECT title
FROM Wiki1B
ORDER BY title
foo {A, D, F}
foo {A, C, H} foo {B, K, L}
drop {G, Q}
drop {Z} drop {M} Can drop values over the limit at
each node
Distributed storage
SELECT title
FROM Wiki1B
ORDER BY title
LIMIT 1000
Size Error
1M 0.59%
10M 0.94%
100M 0.39%
1B 0.51%
10B 0.32%
100B 0.32%