Ch11 Hash Indexes 1perpage Annotated

Chapter 11: Hash-Based Indexes
Extendible Hashing and Linear Hashing
CPSC 404, September-December 2022, Ed Knorr
Based on:
• All of Chapter 11 of Ramakrishnan & Gehrke (textbook,
pages 370-386)
• Your hashing knowledge from CPSC 221 or equivalent
1
Some Learning Goals
 Compare and contrast the performance of hash-based indexes versus
tree-based indexes (e.g., B+ tree) for equality and range searches.
Provide the best-case, average-case, and worst-case complexities for
such searches.
 Explain how collisions are handled for open addressing and chaining
implementations of hash structures. [from CPSC 221]
 Explain the advantages that dynamic hashing provides over static
hashing.
 Show how insertions and deletions are handled in extendible hashing.
 Build an extendible hash index using a given set of data.
 Show how insertions and deletions are handled in linear hashing.
 Build a linear hash index using a given set of data.
 Describe some of the major differences between extendible hashing
and linear hashing (e.g., how the directory is handled, how skew is
handled).
2
Introduction
 Hash-based indexes are usually the best choice for
equality selections.
 No traversal of trees
 Direct computation of where k* should be
 The 3 alternatives for index data entries still apply.
 Static and dynamic hashing techniques exist; their
trade-offs are similar to B+ trees.
 Question: Hash-based indexes cannot support
range searches efficiently. Why not?
3
Motivation
 Udi Manber, Chief Scientist, Yahoo! : “The three most
important algorithms at Yahoo”, he said, “were
hashing, hashing, and hashing.”
• Quoted in: Skiena, Steven S. The Algorithm Design Manual, 2nd
edition, Springer, 2008, p. 92.
4
Review from CPSC 221 or Equivalent
1. What are the characteristics of a good hash function?
2. What is the difference between open addressing and chaining?
3. How are collisions handled in the above scenarios?
4. What is the complexity of a good hash function (for an “average” or

typical search)?
5. What is the worst case complexity of a good hash function?
6. What “hashing” lesson can we learn from Statistics with respect to
“The Birthday Problem”? (i.e., How many people are needed in a
group before you’re likely to find two people in that group with the
same birthday?)
5
Static Hashing
 # primary pages fixed, allocated sequentially, never de-
allocated; overflow pages, if needed
 h(k) mod N = bucket (primary bucket page + any overflow
page(s)) to which the data entry with key k belongs (N = #
of pages or “nodes”—they contain lots of <key, rid> pairs)
0
h(key) mod N
1
key
h
N-1
Primary bucket pages Overflow pages
6
Static Hashing (cont.)
 A bucket (or bucket chain) contains data entries.
 Sometimes we inconsistently call a single page a bucket. But, we’ll
try to use the term “bucket” to refer to both the primary page and
any overflow pages in the same chain.
 Analogy: All the buckets in a hash table can be compared
to the leaf pages in a B+ tree.
 A hash function works on a search key field of record r, and
must distribute values (ideally, uniformly) over the range 0
... [N-1].
 h(key) = (a * key + b) % N … usually works well
• The modulus operator acts as a compression map to allow the
(potentially long) hash value to fit into the given range of
buckets in the hash structure.
 a and b are constants; a lot is known about how to tune h 7
Static Hashing (cont.)
 Long overflow chains can develop and degrade
performance.
 Extendible and Linear Hashing are dynamic techniques to
fix this problem.
7. In static hashing (e.g., CPSC 221), what is the solution
when:
a) A hash table becomes full or nearly full in open addressing?
b) Long overflow chains develop in chained addressing?
8
Extendible Hashing (EH)
 Let’s look at dynamic hashing in database systems.
 First, let’s look at Extendible Hashing.
 In EH, there are no overflow pages.
 So, if a bucket (primary page) becomes full, why not re-
organize the file by doubling the total number of buckets?
 Reading and writing all pages is expensive (slow)!
 Idea: Use a directory of pointers to buckets. Double the number
of potential buckets by doubling the directory, and splitting just
the bucket that overflowed.
 The directory is much smaller than the file; so, doubling it is
much cheaper. Typically, only one page of data entries is
split.
 The trick lies in how the hash function is adjusted.
9
LOCAL DEPTH 2
Bucket A
GLOBAL DEPTH 4* 12* 32* 16*
Example (Slide 1 of 3) 2 2
Bucket B
00 1* 5* 21* 13*
 e.g., the directory might start as an
01
array of size 4
10 2
 To find the bucket for key r, take Bucket C
11 10*
last ‘global depth’ # of bits of h(r):
 e.g., suppose h(r) = binary value of r
2
 If r = 5, then h(r) = 101; so, the DIRECTORY
Bucket D
last 2 bits of h(r) map to 15* 7* 19*
directory entry 01, which points

DATA PAGES
to a specific bucket.
 Insert: If bucket is full, split it (allocate new page, re-distribute)
 If necessary, double the directory. (As we shall see, splitting a
bucket does not always require doubling—we can tell by
comparing the full bucket’s local depth with the global depth.)
10
Example: Slide 2 of 3
Next, Let Us Insert Key r = 20
LOCAL DEPTH 2
Bucket A
GLOBAL DEPTH 32* 16*
2 2 • Bucket A is split into two using

00 1* 5* 21* 13* Bucket B an extra bit, with entries
01
re-distributed as follows:
10 2 • A is divisible by 8, i.e., 1000
11 10* Bucket C • ends in these 3 bits: 000
• A2 is divisible by 4, i.e., 100
2 • ends in these 3 bits: 100
DIRECTORY
15* 7* 19*
Bucket D • Note that only one bucket
need to be re-distributed
2 (i.e., re-hashed).
4* 12* 20* Bucket A2 • B, C, & D remain unchanged
(`split image'
of Bucket A) • But, where do we link A2?
11
Example: Slide 3 of 3
Insert Key r = 20 (cont.)
LOCAL DEPTH 3
GLOBAL DEPTH 32* 16* Bucket A

• Double the directory.
• Add 1 to global depth. 3 2
• Now, we can distinguish 000 1* 5* 21* 13* Bucket B

between A and A2. 001
2
• Note the difference 010
011 10* Bucket C
in local depth among
100
the buckets. 2
101
• Note that multiple pointers 15* 7* 19* Bucket D
110
exist to the unsplit buckets. 111
3
DIRECTORY 4* 12* 20* Bucket A2
(`split image'
of Bucket A)
12
Points to Note
 20 = binary 10100. The last 2 bits (00) tell us that r
belongs in A or A2; but, the last 3 bits tell us which of
A or A2 the key will reside in. In summary:
 Global depth of directory: Max # of bits needed to tell which
bucket an entry belongs to
 Local depth of a bucket: # of bits used to determine if an
entry belongs to this bucket
 When does a split bucket cause the doubling of the
directory?
 Before insertion, suppose the local depth of the bucket = the
global depth. The insertion will cause local depth to become >
global depth; therefore, the directory is doubled by copying it
and adjusting the pointers to the appropriate pages.
13
Comments on Extendible Hashing
 If the directory already exists in memory, an equality
search is answered with one disk access; otherwise, two.
 100 MB file, 100 bytes/rec  1,000,000 records (as data entries) and
25,000 directory elements using 4K pages
• Chances are high that directory will fit in memory (15-bit addr.)
• 15 bits = 215 slots or pointers = 32,768 pointers (about 10 bytes each,
so can keep in BP)
• Enough to point to 32,768 pages (buckets)
• How many pages (at full occupancy) do we get from 1,000,000
records, if we use Alt. 1 (meaning we store the whole record in an
“index” page)?
14
Comments on Extendible Hashing (cont.)
 How many pages (at full occupancy) do we get from
1,000,000 records, if we use Alt. 2 (meaning we store <key,
rid> pairs in the bucket, one data entry (DE) per record)?
 The directory grows in spurts; and, if the distribution of hash values is

skewed, the directory can grow large.
 Multiple entries with the same key can cause problems.
15
Comments on Extendible Hashing, cont.
 Deletions: If the removal of a data entry makes a
bucket empty, the bucket can simply be merged
with its ‘split image’ to minimize the number of
pages we have to visit in the future.
 If each directory element points to the same bucket as its
split image, we can halve the directory.
 Former UBC CPSC professor Nick Pippenger was
on the team that developed extendible hashing.
Their research paper was published in 1979 in
ACM’s Transactions on Database Systems ([Fagin, et
al., 1979]).
16
Linear Hashing (LH)
 This is another dynamic hashing scheme—an
alternative to EH.
 LH handles the potential problem of long overflow
chains without using a directory; and handles
duplicates in a nice way.
 Idea: Use a family of hash functions h0, h1, h2, ... to
map the key to the appropriate bucket:
• hi(key) = h(key) mod(2i N); N = initial # of buckets
• h0 handles the case with N buckets
• h1 handles the case with 2N buckets
• h2 handles the case with 4N buckets
 hi+1 doubles the range of hi (similar to directory doubling)
17
Linear Hashing (cont.)
 Directory avoided in LH by using overflow pages,
and choosing bucket to split round-robin.
 Important: The bucket being split isn’t necessarily
a full one!
 Splitting proceeds in ‘rounds’. Round ends when all
NR initial (for round R) buckets are split. Buckets 0 to
Next-1 have been split; Next to NR–1 are yet to be split.
 Current round number is Level
 Search: To find bucket for data entry r, find hLevel(r):
• If hLevel(r) in range ‘Next to NR – 1 ’ , r belongs here
• Else, r belongs to either bucket hLevel(r) or bucket
hLevel(r) + NR; must apply hLevel+1(r) to find out which
18
 In the middle of a round:
Buckets already split in
this round:
Bucket to be split Next
If h Level (search key value)
NLevel buckets existed at the is in the above range, use
beginning of this round. h Level+1 (search key value)
to decide if the entry is in
the ‘split image’ bucket.
(This is the range for: h Level ).
‘split image’ buckets were
created (through splitting of
buckets) in this round
19
 Insert: Find bucket by applying hLevel / hLevel+1 :
 If bucket to insert into is full:
• Add overflow page and insert data entry
• Split Next bucket and increment Next
 The implementors can choose any criterion to
‘trigger’ a split.
 Unless told otherwise, the rule we’ll follow in this course
is: “At most one split per insertion”. In particular:
• Only split when you have to allocate a new bucket, that is, when
the data entry that you are trying to insert will not fit in the
existing page(s) (i.e., because both the target bucket and the rest
of its overflow chain—if any—are full).
• We’ll assume that any cascading effects (e.g., reallocation of
overflow pages) won’t cause further splits for the same insertion.
20
 Since buckets are split round-robin, long overflow
chains typically don’t develop.
 Over time, the chains will shrink, as the index “matures”.
 A doubling of the directory in Extendible Hashing
was similar. There, the switching of hash functions
was implicit in how the number of bits examined is
increased.
21
Example of a Linear Hashing Index
 On a split, hLevel+1 is used to e.g., 9 = (1001)2
re-distribute entries.
Level=0, N=4
h h PRIMARY
1 0
Next=0 PAGES
32*44* 36*
000 00
Data entry r
001 01 9* 25* 5* with h(r)=5
14* 18*10*30* Primary

010 10
bucket page
31*35* 7* 11*
011 11
(This directory (Here are the actual
information is contents of the linear
for illustration hashed file.)
only.) 22
Example of Linear Hashing: Add in 43*
9 = (1001)2
43 = (101011)2
Level=0, N=4 Level=0, N=4
h h PRIMARY h h PRIMARY OVERFLOW

1 0 1 0
Next=0 PAGES PAGES PAGES
32*44* 36* 32*

000 00 000 00
Next=1
Data entry r
001 01 9* 25* 5* with h(r)=5 001 01 9* 25* 5*
14* 18*10*30* Primary 14* 18*10*30*

010 10 010 10
bucket page
31*35* 7* 11* 31*35* 7* 11* 43*
011 11 011 11
100 00 44* 36*
23
Example of Linear Hashing (cont.)
Let’s insert 37*, 29*, 22*, 66*, and 34*.
 37 = (100101)2
 29 = (11101)2
9 = (1001)2 … 25 = (11001)2 … 5 = (101)2
 22 = (10110)2
14 = (1110)2 … 18 = (10010)2 … 10 = (1010)
30 = (11110)2
 66 = (1000010)2
 34 = (100010)2 24
Example (cont.): End of a Round
Let’s insert 50* = (110010)2. Level=1 PRIMARY
OVERFLOW
h1 h0 PAGES PAGES
Next=0
Level=0 000 00 32*
PRIMARY OVERFLOW
h1 h0 PAGES PAGES
001 01 9* 25*
000 00 32*
010 10 66* 18* 10* 34* 50*
001 01 9* 25*
011 11 43* 35* 11*
010 10 66* 18* 10* 34*
Next=3 100 00 44* 36*
011 11 31* 35* 7* 11* 43*
101 01 5* 37* 29*

100 00 44* 36*
101 5* 37* 29* 110 10 14* 30* 22*

01
110 10 14* 30* 22* 111 11 31* 7*
25
What Happened?
 h0(50) = (110010)2 mod 20N = (110010)2 % 4 = 10

 Is 10 in the unsplit range [Next, N0 – 1]? i.e., [3,3]?
 No; therefore, use h1
 h1(50) = (110010)2 mod 21N = (110010)2 % 8 = 010
 So, insert 50* into bucket 010.
 The bucket was full.
 Put 50* in a newly allocated overflow bucket.
 Trigger the split of the “Next” bucket.
 Redistribute the split bucket’s entries.
 Advance Next to: Next = 0, Level = 1.
26
Summary
 Hash-based indexes: best for equality searches,
cannot support range searches.
 Static hashing can lead to long overflow chains.
 Extendible Hashing avoids overflow pages by
splitting a full bucket when a new data entry is to be
added to it.
 Lots of duplicates may require overflow pages. Why?
 The directory keeps track of buckets. It doubles
periodically.
 The index can become large if there’s a lot of skewed data.
• Additional I/O if this does not fit in main memory
27
Summary (cont.)
 A variation to extendible hashing is extensible
hashing: where we use the leading bits (on the left)
to discriminate, rather than the last bits.
 We’ll stick with extendible hashing in this course.
 Linear Hashing avoids directory by splitting buckets
round-robin, and using overflow pages
 Overflow pages are not likely to be long.
• Over time, they may disappear.
 Duplicates are handled easily.
 Space utilization could be lower than that for Extendible
Hashing since splits are not concentrated on ‘dense’ data
areas. 28

Ch11 Hash Indexes 1perpage Annotated

Uploaded by

Ch11 Hash Indexes 1perpage Annotated

Uploaded by

Chapter 11: Hash-Based Indexes

Extendible Hashing and Linear Hashing

CPSC 404, September-December 2022, Ed Knorr

2. What is the difference between open addressing and chaining?

3. How are collisions handled in the above scenarios?

4. What is the complexity of a good hash function (for an “average” or

b) Long overflow chains develop in chained addressing?

directory entry 01, which points

2 2 • Bucket A is split into two using

GLOBAL DEPTH 32* 16* Bucket A

• Now, we can distinguish 000 1* 5* 21* 13* Bucket B

 The directory grows in spurts; and, if the distribution of hash values is

14* 181030* Primary

h h PRIMARY h h PRIMARY OVERFLOW

3244 36* 32*

14* 181030* Primary 14* 181030*

100 00 44* 36*

101 01 5* 37* 29*

101 5* 37* 29* 110 10 14* 30* 22*

110 10 14* 30* 22* 111 11 31* 7*

 h0(50) = (110010)2 mod 20N = (110010)2 % 4 = 10

You might also like