Ch11 Hash Indexes 1perpage Annotated
Ch11 Hash Indexes 1perpage Annotated
Based on:
• All of Chapter 11 of Ramakrishnan & Gehrke (textbook,
pages 370-386)
• Your hashing knowledge from CPSC 221 or equivalent
1
Some Learning Goals
Compare and contrast the performance of hash-based indexes versus
tree-based indexes (e.g., B+ tree) for equality and range searches.
Provide the best-case, average-case, and worst-case complexities for
such searches.
Explain how collisions are handled for open addressing and chaining
implementations of hash structures. [from CPSC 221]
Explain the advantages that dynamic hashing provides over static
hashing.
Show how insertions and deletions are handled in extendible hashing.
Build an extendible hash index using a given set of data.
Show how insertions and deletions are handled in linear hashing.
Build a linear hash index using a given set of data.
Describe some of the major differences between extendible hashing
and linear hashing (e.g., how the directory is handled, how skew is
handled).
2
Introduction
Hash-based indexes are usually the best choice for
equality selections.
No traversal of trees
Direct computation of where k* should be
The 3 alternatives for index data entries still apply.
Static and dynamic hashing techniques exist; their
trade-offs are similar to B+ trees.
Question: Hash-based indexes cannot support
range searches efficiently. Why not?
3
Motivation
Udi Manber, Chief Scientist, Yahoo! : “The three most
important algorithms at Yahoo”, he said, “were
hashing, hashing, and hashing.”
• Quoted in: Skiena, Steven S. The Algorithm Design Manual, 2nd
edition, Springer, 2008, p. 92.
4
Review from CPSC 221 or Equivalent
1. What are the characteristics of a good hash function?
5
Static Hashing
# primary pages fixed, allocated sequentially, never de-
allocated; overflow pages, if needed
h(k) mod N = bucket (primary bucket page + any overflow
page(s)) to which the data entry with key k belongs (N = #
of pages or “nodes”—they contain lots of <key, rid> pairs)
0
h(key) mod N
1
key
h
N-1
Primary bucket pages Overflow pages
6
Static Hashing (cont.)
A bucket (or bucket chain) contains data entries.
Sometimes we inconsistently call a single page a bucket. But, we’ll
try to use the term “bucket” to refer to both the primary page and
any overflow pages in the same chain.
Analogy: All the buckets in a hash table can be compared
to the leaf pages in a B+ tree.
A hash function works on a search key field of record r, and
must distribute values (ideally, uniformly) over the range 0
... [N-1].
h(key) = (a * key + b) % N … usually works well
• The modulus operator acts as a compression map to allow the
(potentially long) hash value to fit into the given range of
buckets in the hash structure.
a and b are constants; a lot is known about how to tune h 7
Static Hashing (cont.)
Long overflow chains can develop and degrade
performance.
Extendible and Linear Hashing are dynamic techniques to
fix this problem.
7. In static hashing (e.g., CPSC 221), what is the solution
when:
a) A hash table becomes full or nearly full in open addressing?
8
Extendible Hashing (EH)
Let’s look at dynamic hashing in database systems.
First, let’s look at Extendible Hashing.
In EH, there are no overflow pages.
So, if a bucket (primary page) becomes full, why not re-
organize the file by doubling the total number of buckets?
Reading and writing all pages is expensive (slow)!
Idea: Use a directory of pointers to buckets. Double the number
of potential buckets by doubling the directory, and splitting just
the bucket that overflowed.
The directory is much smaller than the file; so, doubling it is
much cheaper. Typically, only one page of data entries is
split.
The trick lies in how the hash function is adjusted.
9
LOCAL DEPTH 2
Bucket A
GLOBAL DEPTH 4* 12* 32* 16*
Example (Slide 1 of 3) 2 2
Bucket B
00 1* 5* 21* 13*
e.g., the directory might start as an
01
array of size 4
10 2
To find the bucket for key r, take Bucket C
11 10*
last ‘global depth’ # of bits of h(r):
e.g., suppose h(r) = binary value of r
2
If r = 5, then h(r) = 101; so, the DIRECTORY
Bucket D
last 2 bits of h(r) map to 15* 7* 19*
11
Example: Slide 3 of 3
Insert Key r = 20 (cont.)
LOCAL DEPTH 3
14
Comments on Extendible Hashing (cont.)
How many pages (at full occupancy) do we get from
1,000,000 records, if we use Alt. 2 (meaning we store <key,
rid> pairs in the bucket, one data entry (DE) per record)?
15
Comments on Extendible Hashing, cont.
Deletions: If the removal of a data entry makes a
bucket empty, the bucket can simply be merged
with its ‘split image’ to minimize the number of
pages we have to visit in the future.
If each directory element points to the same bucket as its
split image, we can halve the directory.
Former UBC CPSC professor Nick Pippenger was
on the team that developed extendible hashing.
Their research paper was published in 1979 in
ACM’s Transactions on Database Systems ([Fagin, et
al., 1979]).
16
Linear Hashing (LH)
This is another dynamic hashing scheme—an
alternative to EH.
LH handles the potential problem of long overflow
chains without using a directory; and handles
duplicates in a nice way.
Idea: Use a family of hash functions h0, h1, h2, ... to
map the key to the appropriate bucket:
• hi(key) = h(key) mod(2i N); N = initial # of buckets
• h0 handles the case with N buckets
• h1 handles the case with 2N buckets
• h2 handles the case with 4N buckets
hi+1 doubles the range of hi (similar to directory doubling)
17
Linear Hashing (cont.)
Directory avoided in LH by using overflow pages,
and choosing bucket to split round-robin.
Important: The bucket being split isn’t necessarily
a full one!
Splitting proceeds in ‘rounds’. Round ends when all
NR initial (for round R) buckets are split. Buckets 0 to
Next-1 have been split; Next to NR–1 are yet to be split.
Current round number is Level
Search: To find bucket for data entry r, find hLevel(r):
• If hLevel(r) in range ‘Next to NR – 1 ’ , r belongs here
• Else, r belongs to either bucket hLevel(r) or bucket
hLevel(r) + NR; must apply hLevel+1(r) to find out which
18
Linear Hashing (cont.)
In the middle of a round:
Buckets already split in
this round:
Bucket to be split Next
If h Level (search key value)
NLevel buckets existed at the is in the above range, use
beginning of this round. h Level+1 (search key value)
to decide if the entry is in
the ‘split image’ bucket.
(This is the range for: h Level ).
‘split image’ buckets were
created (through splitting of
buckets) in this round
19
Linear Hashing (cont.)
Insert: Find bucket by applying hLevel / hLevel+1 :
If bucket to insert into is full:
• Add overflow page and insert data entry
• Split Next bucket and increment Next
The implementors can choose any criterion to
‘trigger’ a split.
Unless told otherwise, the rule we’ll follow in this course
is: “At most one split per insertion”. In particular:
• Only split when you have to allocate a new bucket, that is, when
the data entry that you are trying to insert will not fit in the
existing page(s) (i.e., because both the target bucket and the rest
of its overflow chain—if any—are full).
• We’ll assume that any cascading effects (e.g., reallocation of
overflow pages) won’t cause further splits for the same insertion.
20
Linear Hashing (cont.)
Since buckets are split round-robin, long overflow
chains typically don’t develop.
Over time, the chains will shrink, as the index “matures”.
A doubling of the directory in Extendible Hashing
was similar. There, the switching of hash functions
was implicit in how the number of bits examined is
increased.
21
Example of a Linear Hashing Index
On a split, hLevel+1 is used to e.g., 9 = (1001)2
re-distribute entries.
Level=0, N=4
h h PRIMARY
1 0
Next=0 PAGES
32*44* 36*
000 00
Data entry r
001 01 9* 25* 5* with h(r)=5
23
Example of Linear Hashing (cont.)
Let’s insert 37*, 29*, 22*, 66*, and 34*.
37 = (100101)2
29 = (11101)2
9 = (1001)2 … 25 = (11001)2 … 5 = (101)2
22 = (10110)2
14 = (1110)2 … 18 = (10010)2 … 10 = (1010)
30 = (11110)2
66 = (1000010)2
34 = (100010)2 24
Example (cont.): End of a Round
Let’s insert 50* = (110010)2. Level=1 PRIMARY
OVERFLOW
h1 h0 PAGES PAGES
Next=0
Level=0 000 00 32*
PRIMARY OVERFLOW
h1 h0 PAGES PAGES
001 01 9* 25*
000 00 32*
010 10 66* 18* 10* 34* 50*
001 01 9* 25*
011 11 43* 35* 11*
010 10 66* 18* 10* 34*
Next=3 100 00 44* 36*
011 11 31* 35* 7* 11* 43*
25
What Happened?
26
Summary
Hash-based indexes: best for equality searches,
cannot support range searches.
Static hashing can lead to long overflow chains.
Extendible Hashing avoids overflow pages by
splitting a full bucket when a new data entry is to be
added to it.
Lots of duplicates may require overflow pages. Why?
The directory keeps track of buckets. It doubles
periodically.
The index can become large if there’s a lot of skewed data.
• Additional I/O if this does not fit in main memory
27
Summary (cont.)
A variation to extendible hashing is extensible
hashing: where we use the leading bits (on the left)
to discriminate, rather than the last bits.
We’ll stick with extendible hashing in this course.
Linear Hashing avoids directory by splitting buckets
round-robin, and using overflow pages
Overflow pages are not likely to be long.
• Over time, they may disappear.
Duplicates are handled easily.
Space utilization could be lower than that for Extendible
Hashing since splits are not concentrated on ‘dense’ data
areas. 28