Module 5 – Indexing and Searching
Prof. Pravin [Link]
Indexing and Searching
• Indexing techniques:
– Inverted files
– Suffix arrays
– Signature files
• Technique used to search each type of
index
• Other searching techniques
2
Overview
• Just like in traditional RDBMSs searching for data
may be costly
• In a RDB one can take (a lot of) advantage from
the well defined structure of (and constraints
on) the data
• Linear scan of the data is not feasible for non-
trivial datasets (real life)
• Indices are not optional in IR (not meaning that
they are in RDBMS)
3
Continue..
• Traditional indices, e.g., B-trees, are not well
suited for IR
• Main approaches:
– Inverted files (or lists)
– Suffix arrays
– Signature files
4
Inverted Files
• There are two main elements:
– vocabulary – set of unique terms
– Occurrences – where those terms appear
• The occurrences can be recorded as
terms or byte offsets
• Using term offset is good to retrieve
concepts such as proximity, whereas
byte offsets allow direct access
Vocabulary Occurrences (byte
offset)
… … 5
Inverted Files
• The number of indexed terms is often several
orders of magnitude smaller when compared to
the documents size (Mbs vs Gbs)
• The space consumed by the occurrence list is
not trivial. Each time the term appears it must
be added to a list in the inverted file
• That may lead to a quite considerable index
overhead
6
Inverted Files - layout
Vocabulary
Occurrences Lists
Posting File
Indexed Number of
Terms occurrences This could be a tree like structure !
7
Example
• Text:
1 6 12 16 18 25 29 36 40 45 54 58 66 70
That house has a garden. The garden has many flowers. The flowers are
beautiful
• Inverted file
Vocabulary Occurrences
beautiful 70
flowers 45, 58
garden 18, 29
house 6
8
Inverted Files
• Coarser addressing may be used
Terms Occurrences (block
offset)
… …
• All occurrences within a block (perhaps a whole
document) are identified by the same block offset
• Much smaller overhead
• Some searches will be less efficient, e.g., proximity
searches. Linear scan may be needed, though hardly
feasible (specially on-line)
9
Space Requirements
• The space required for the vocabulary is rather small.
According to Heaps’ law the vocabulary grows as O(n),
where is a constant between 0.4 and 0.6 in practice
• On the other hand, the occurrences demand much
more space. Since each word appearing in the text is
referenced once in that structure, the extra space is
O(n)
• To reduce space requirements, a technique called block
addressing is used
10
Block Addressing
• The text is divided in blocks
• The occurrences point to the blocks
where the word appears
• Advantages:
– the number of pointers is smaller than positions
– all the occurrences of a word inside a single block
are collapsed to one reference
• Disadvantages:
– online search over the qualifying blocks if exact
positions are required
11
Example
• Text:
That house has a garden. The garden has many flowers. The flowers are
beautiful
Block 1 Block 2 Block 3 Block 4
• Inverted file:
Vocabulary Occurrences
beautiful 4
flowers 3
garden 2
house 1
12
Inverted Files - construction
• Building the index in main memory is not
feasible (wouldn’t fit, and swapping would be
unbearable)
• Building it entirely in disk is not a good idea
either (would take a long time)
• One idea is to build several partial indices in
main memory, one at a time, saving them to
disk and then merging all of them to obtain a
single index
13
Inverted Files - construction
• The procedure works as follows:
– Build and save partial indices l1, I2, …, In
– Merge Ij and Ij+1 into a single partial index Ij,j+1
• Merging indices mean that their sorted vocabularies are
merged, and if a term appears in both indices then the
respective lists should be merged (keeping the document
order)
– Then indices Ij,j+1 and Ij+2,j+3 are merged into
partial index Ij,j+3, and so on and so forth until a
single index is obtained
– Several partial indices can be merged together at once
14
Thank You