Department of CSE
COURSE NAME: DBMS
COURSE CODE: 23AD2102R
Topic:
Index Structures, Indexing
and Hashing
Session-21
CREATED BY K. VICTOR BABU
What is Indexing ?
INDEXING is a data structure technique which THE allows you datatothat
quickly
makes up
retrieve records from a database file. computerized
COLLECTIONdatabase a must be
OF physically stored
some computer storage
Indexes are used to quickly locate data withoutonhaving to search every record
medium.
in multiple disk blocks
The DBMS software can then
retrieve,
update, and process this data as needed.
Similar to Indexing in Textbooks
Indexes as Access Paths
A single-level index is an auxiliary file that makes it more efficient to search
for a record in the data file.
The index is usually specified on one field of the file (although it could be
specified on several fields)
One form of an index is a file of entries <field value, pointer to record>,
which is ordered by field value
The index is called an access path on the field.
3
What is index in Database?
• An Index is a small table having only
two columns. The first column
comprises a copy of the primary or
candidate key of a table. Its second
column contains a set of pointers for
holding the address of the disk block
where that specific key value stored.
An index -
Takes a search key as input
Efficiently returns a collection of matching
records.
Types of Indexing
Indexing
Clustere Seconda Multilev
Primary
d ry el
Indexing
Indexing Indexing Indexing
Dense
Sparse
Types of Indexing
WHICH
INDEXING
METHOD IS
USED ?
Primary Indexing
• Primary Index is an ordered file which is of fixed
length size with two fields.
• The first field is the same a primary key and
second field is a pointer that points to that
specific data block.
• The primary Indexing is further divided into two
types.
• Dense Index
• Sparse Index
Dense Index
A record is created for each search key valued
in the database.
Searching is faster
Requires more space to store index records
No. of records in IT = No. of records in HD
Sparse Index
• Sparse index contains only the anchor records
• To locate a record, we find the index record with
the largest search key value <= search key value
we are looking for.
• We start at that record pointed to by the index
record, and proceed along with the pointers in
the file (sequentially) until we find the desired
record.
No. of records in IT = No. of blocks in HD
Time Complexity = log2N +1
Indexes as Access Paths
The index file usually occupies considerably less disk blocks than the
data file because its entries are much smaller
A binary search on the index yields a pointer to the file record
Indexes can also be characterized as dense or sparse
A dense index has an index entry for every search key value
(and hence every record) in the data file.
A sparse (or nondense) index, on the other hand, has index
entries for only some of the search values
1
0
Clustered Index
• Clustering index is defined on an ordered data file. The
data file is ordered on a non-key field.
• In some cases, the index is created on non-primary key
columns which may not be unique for each record.
• In such cases, in order to identify the records faster, we
will group two or more columns together to get the
unique values and create index out of them. This
method is known as the clustering index.
• Basically, records with similar characteristics are
grouped together and indexes are created for these
groups.
Indexes as Access Paths
Contains block pointer
which points to the
next block data with
the same clustering
field value.
Searching criteria is
little bit increased.
Uses Sparse index
Time Complexity = log2N + 1 + 1..
Secondary Indexing
Secondary Indexing
Unordered File Secondary Index
With Secondary
Key Example
• File is ordered on Eid(Primary Key)
• Search to be done using Pno
• So, Index table will maintain Pno as a key
and in ordered.
Time Complexity = log2N
+1
Secondary Indexing
Unordered File with Non-key
Secondary Index Example
• Search done by Ename(Non-key)
• Index file contains Ename as key and is
ordered.
• Maintains intermediate index layer which
contains block of record pointers.
• Pointer in IT points to a particular block
and the record pointers in that block will
point to the record in HD.
Time Complexity = log2N + 1
+ 1
Radhika
Rani
Types of Single-Level Indexes
Primary Index Clustering Index Secondary Index
ordered file ordered file ordered file
a secondary means of accessing a file
Data file is ordered on a Data file is ordered on a non-key Data file is ordered may be on candidate key has a
key field (distinct value field (no distinct value for each unique value or a non-key with duplicate values
for each record) record)
file content file content file content
<key field, pointer> <key field, pointer> <key field, pointer>
one index entry for each one index entry for each distinct The index is an ordered file with two fields:
disk block. key field value of the field; the index entry 1field value.
value is the first record in points to the first data block that 2it is either a block pointer or a record
the block, which is called contains records with that field pointer.
the block anchor value
nondense (sparse) nondense (sparse) index If key, dense. If non key, dense or
index sparse index
Multi-Level Indexes
A Two-level Primary Index
Dynamic Multilevel Indexes Using B-trees
and B+-trees
Multi-Level Indexes
• Because a single-level index is an ordered file, we can create a primary
index to the index itself;
• In this case, the original index file is called the first-level index and the
index to the index is called the second-level index.
• We can repeat the process, creating a third, fourth, ..., top level until all
entries of the top level fit in one disk block
• A multi-level index can be created for any type of first-level index
(primary, secondary, clustering) as long as the first- level index consists
of more than one disk block
A Two-Level Primary
Index
19
Multi-Level Indexes
• Such a multi-level index is a form of search tree
• However, insertion and deletion of new index entries is a
severe problem because every level of the index is an
ordered file.
Multi-Level Indexes
Tree structure
Dynamic Multilevel Indexes Using B-Trees and
B+- Trees
• Most multi-level indexes use B-tree or B+-tree data
• structures because of the insertion and deletion problem
• This leaves space in each tree node (disk block) to allow for new index
entries
• These data structures are variations of search trees that
• allow efficient insertion and deletion of new search values.
• In B-Tree and B+-Tree data structures, each node corresponds to a
disk block
• Each node is kept between half-full and completely full
Dynamic Multilevel Indexes Using B-Trees and
B+- Trees
An insertion into a node that is not full is quite efficient
If a node is full the insertion causes a split into two nodes
Splitting may propagate to other tree levels
A deletion is quite efficient if a node does not become less than half
full
If a deletion causes a node to become less than half full, it must be
merged with neighboring nodes
Dynamic Multilevel Indexes Using B-Trees and
B+- Trees
• Balanced Tree
• In multilevel indexing, inserting
and deleting a record is difficult, as
the corresponding entries in index
tables also need to be changed.
• B-Trees makes these tasks simple.
• Elements are in sorted order
What is Hashing?
THE data that makes up
computerized
COLLECTIONdatabase a must be
• In a huge database structure, it is very inefficient to search
OF physically some all
stored the index
computer storage
values and reach the desired data. on medium.
• Hashing technique is used to calculate the direct location of a data
record on the disk without using index structure.
The DBMS software can then
• Data is stored at the data blocks whoseretrieve,
address is generated by using
the hashing function. update, and process this data as needed.
• The memory location where these records are stored is known as data
bucket or data blocks.
Types of Hashing
Static Hashing
A bucket is a unit of storage containing one or more records (a bucket is typically
a disk block).
In a hash file organization we obtain the bucket of a record directly from its
search-key value using a hash function.
Hash function h is a function from the set of all search-key values K to the set of
all bucket addresses B.
Hash function is used to locate records for access, insertion as well as deletion.
Records with different search-key values may be mapped to the same bucket; thus
entire bucket has to be searched sequentially to locate a record.
Example of Hash File Organization
Hash file organization of account file, using branch-name as key (See figure in
next slide.)
There are 10 buckets,
The binary representation of the i th character is assumed to be the integer i.
The hash function returns the sum of the binary representations of the characters
modulo 10
E.g. h(Perryridge) = 5 h(Round Hill) = 3 h(Brighton) = 3
Example of Hash File Organization
Hash file organization of account file, using branch-name as key (see previous slide for details).
Hash Functions
Worst has function maps all search-key values to the same bucket; this makes
access time proportional to the number of search-key values in the file.
An ideal hash function is uniform, i.e., each bucket is assigned the same
number of search-key values from the set of all possible values.
Ideal hash function is random, so each bucket will have the same number of
records assigned to it irrespective of the actual distribution of search-key values
in the file.
Typical hash functions perform computation on the internal binary
representation of the search-key.
For example, for a string search-key, the binary representations of all the
characters in the string could be added and the sum modulo the number of
buckets could be returned. .
Handling of Bucket Overflows
Bucket overflow can occur because of
Insufficient buckets
Skew in distribution of records. This can occur due to two reasons:
* multiple records have same search-key value
* chosen hash function produces non-uniform distribution of key values
Although the probability of bucket overflow can be reduced, it cannot be
eliminated; it is handled by using overflow buckets.
Handling of Bucket Overflows
Overflow chaining – the overflow buckets of a given bucket are chained together in a
linked list.
Above scheme is called closed hashing.
An alternative, called open hashing, which does not use overflow buckets, is
not suitable for database applications.
Hash Indices
Hashing can be used not only for file organization, but also for index-structure creation.
A hash index organizes the search keys, with their associated record pointers, into a hash
file structure.
Strictly speaking, hash indices are always secondary indices
if the file itself is organized using hashing, a separate primary hash index on it using the
same search-key is unnecessary.
However, we use the term hash index to refer to both secondary index structures and
hash organized files.
Example of Hash Index
Deficiencies of Static Hashing
In static hashing, function h maps search-key values to a fixed set of B of bucket
addresses.
Databases grow with time. If initial number of buckets is too small, performance
will degrade due to too much overflows.
If file size at some point in the future is anticipated and number of buckets
allocated accordingly, significant amount of space will be wasted initially.
If database shrinks, again space will be wasted.
One option is periodic re-organization of the file with a new hash function, but it
is very expensive.
These problems can be avoided by using techniques that allow the number of
buckets to be modified dynamically.
Dynamic Hashing
Good for database that grows and shrinks in size
Allows the hash function to be modified dynamically
Extendable hashing – one form of dynamic hashing
Hash function generates values over a large range — typically b-bit integers,
with b = 32.
At any time use only a prefix of the hash function to index into a table of
bucket addresses.
Let the length of the prefix be i bits, 0 i 32.
Bucket address table size = [Link] i = 0
Value of i grows and shrinks as the size of the database grows and shrinks.
Multiple entries in the bucket address table may point to a bucket.
Thus, actual number of buckets is < 2i
* The number of buckets also changes dynamically due to
coalescing and splitting of buckets.
General Extendible Hash Structure
In this structure, i2 = i3 = i, whereas i1 = i – 1 (see next slide for
details)
Use of Extendible Hash Structure
Each bucket j stores a value ij; all the entries that point to the same bucket have
the same values on the first ij bits.
To locate the bucket containing search-key Kj:
1. Compute h(Kj) = X
2. Use the first i high order bits of X as a displacement into bucket address table, and follow the
pointer to appropriate bucket
To insert a record with search-key value Kj
follow same procedure as look-up and locate the bucket, say j.
If there is room in the bucket j insert record in the bucket.
Else the bucket must be split and insertion re-attempted (next slide.)
* Overflow buckets used instead in some cases (will see shortly)
Updates in Extendible Hash Structure
To split a bucket j when inserting record with search-key value Kj:
If i > ij (more than one pointer to bucket j)
allocate a new bucket z, and set ij and iz to the old ij -+ 1.
make the second half of the bucket address table entries pointing to j to point to z
remove and reinsert each record in bucket j.
recompute new bucket for Kj and insert record in the bucket (further splitting is
required if the bucket is still full)
If i = ij (only one pointer to bucket j)
increment i and double the size of the bucket address table.
replace each entry in the table by two entries that point to the same bucket.
recompute new bucket address table entry for Kj Now i > ij so use the first case
above.
Updates in Extendable Hash Structure
When inserting a value, if the bucket is full after
several splits (that is, i reaches some limit b) create an
overflow bucket instead of splitting bucket entry table further.
To delete a key value,
locate it in its bucket and remove it.
The bucket itself can be removed if it becomes
empty (with appropriate updates to the bucket
address table).
Coalescing of buckets can be done (can coalesce only with a “buddy”
bucket having same value of ij and same ij –1 prefix, if it is present)
Decreasing bucket address table size is also possible
* Note: decreasing bucket address table size is an expensive operation and
should be done only if number of buckets becomes much smaller than the
size of the table
Example
Initial Hash structure, bucket size = 2
Example
Hash structure after insertion of one Brighton and two Downtown records
Example
Hash structure after insertion of Mianus record
Example
Hash structure after insertion of three Perryridge records
Example
Hash structure after insertion of Redwood and Round Hill records
Extendible Hashing vs. Other Schemes
Benefits of extendable hashing:
Hash performance does not degrade with growth of file
Minimal space overhead
Disadvantages of extendable hashing
Extra level of indirection to find desired record
Bucket address table may itself become very big (larger than memory)
* Need a tree structure to locate desired record in the structure!
Changing size of bucket address table is an expensive operation
Linear hashing is an alternative mechanism which avoids these
disadvantages at the possible cost of more bucket overflows
Comparison of Ordered Indexing and
Hashing
Cost of periodic re-organization
Relative frequency of insertions and deletions
Is it desirable to optimize average access time at the expense of worst-case access
time?
Expected type of queries:
Hashing is generally better at retrieving records having a specified value of the
key.
If range queries are common, ordered indices are to be preferred
ACTIVITIES/ CASE STUDIES/ IMPORTANT FACTS
RELATED TO THE SESSION
Consider a dynamic hashing approach for 4-bit integer keys:
1. There is a main hash table of size 4.
2. The 2 least significant bits of a key is used to index into the main hash table.
3. Initially, the main hash table entries are empty.
4. Thereafter, when more keys are hashed into it, to resolve collisions, the set of all keys corresponding to a main
hash table entry is organized as a binary tree that grows on demand.
5. First, the 3rd least significant bit is used to divide the keys into left and right subtrees.
6. to resolve more collisions, each node of the binary tree is further sub-divided into left and right subtrees based
on 4th least significant bit.
7. A split is done only if it is needed, i. e. only when there is a collision.
Consider the following state of the hash table.
SUMMARY
Hashing is a DBMS technique for searching for needed data on the disc
without utilising an index structure. The hashing method is basically used to
index items and retrieve them in a DB since searching for a specific item
using a shorter hashed key rather than the original value is faster.
SELF-ASSESSMENT QUESTIONS
1. What is hashing?
(a) A data structure for storing key-value pairs
(b) b) A technique for converting data of arbitrary size to a fixed size
(c) c) A process of compressing data to save space
(d) d) A method for encrypting data
2. Which of the following is not a suitable use case for hashing?
a) Password storage
b) Data validation
c) Data encryption
d) Sorting large
datasets
TERMINAL QUESTIONS
1. Can you explain the difference between static and dynamic hashing,
and when each is appropriate?
2. What is collision handling, and how is it handled in hashing-based
index structures?
3. How does the choice of hash function affect the performance of a
hashing-based index?
4. What is a primary index, and how is it implemented using hashing in
DBMS?
5. How does extendible hashing differ from linear and quadratic
probing?
REFERENCES FOR FURTHER LEARNING OF
THE SESSION
Reference Books:
1. "Database Management Systems" by Raghu Ramakrishnan and Johannes Gehrke - This book covers
the basics of database management systems, including the concept of index structures.
2. "Database Systems: Design, Implementation, and Management" by Carlos Coronel, Steven Morris,
and Peter Rob - This book provides a comprehensive introduction to database systems, including
index structures and their importance in optimizing database performance.
3. "Database Indexing: A Practical Guide for Developers" by Will Iverson - This book focuses specifically
on the concept of indexing in database management systems, providing practical advice and
examples for developers.
Sites and Web links:
4. "Hash-Based Indexes" by Raghu Ramakrishnan and Johannes Gehrke -
[Link]
5. "Indexing and Hashing" by S. Sudarshan -
[Link]
6. "Extendible Hashing" by Jerome Martin -
[Link]
7. "Concurrency Control in Hash-Based Database Systems" by Christoph G. Schuetz and Michael H.
THANK YOU
Team – DBMS