0% found this document useful (0 votes)

129 views15 pages

Merging Indices in Information Retrieval

This document discusses indexing and searching techniques for information retrieval. It covers inverted files, suffix arrays, and signature files as the main indexing approaches. Inverted files are described in detail, including their structure of a vocabulary and occurrences lists, approaches for addressing like term offsets and block offsets, techniques for constructing inverted files by building partial indexes in memory and merging them, and their advantages and disadvantages compared to other approaches.

Uploaded by

Pravin Shinde

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

129 views15 pages

Merging Indices in Information Retrieval

Uploaded by

Pravin Shinde

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

Module 5 – Indexing and Searching

Prof. Pravin [Link]

Indexing and Searching

• Indexing techniques:
– Inverted files
– Suffix arrays
– Signature files
• Technique used to search each type of
index
• Other searching techniques

2
Overview

• Just like in traditional RDBMSs searching for data

may be costly
• In a RDB one can take (a lot of) advantage from
the well defined structure of (and constraints
on) the data
• Linear scan of the data is not feasible for non-
trivial datasets (real life)
• Indices are not optional in IR (not meaning that
they are in RDBMS)

3
Continue..

• Traditional indices, e.g., B-trees, are not well

suited for IR

• Main approaches:
– Inverted files (or lists)
– Suffix arrays
– Signature files

4
Inverted Files
• There are two main elements:
– vocabulary – set of unique terms
– Occurrences – where those terms appear
• The occurrences can be recorded as
terms or byte offsets
• Using term offset is good to retrieve
concepts such as proximity, whereas
byte offsets allow direct access

Vocabulary Occurrences (byte

offset)
… … 5
Inverted Files

• The number of indexed terms is often several

orders of magnitude smaller when compared to
the documents size (Mbs vs Gbs)
• The space consumed by the occurrence list is
not trivial. Each time the term appears it must
be added to a list in the inverted file
• That may lead to a quite considerable index
overhead

6
Inverted Files - layout
Vocabulary

Occurrences Lists

Posting File
Indexed Number of
Terms occurrences This could be a tree like structure !
7
Example
• Text:
1 6 12 16 18 25 29 36 40 45 54 58 66 70

That house has a garden. The garden has many flowers. The flowers are
beautiful

• Inverted file
Vocabulary Occurrences
beautiful 70
flowers 45, 58
garden 18, 29
house 6

8
Inverted Files
• Coarser addressing may be used
Terms Occurrences (block
offset)
… …
• All occurrences within a block (perhaps a whole
document) are identified by the same block offset
• Much smaller overhead
• Some searches will be less efficient, e.g., proximity
searches. Linear scan may be needed, though hardly
feasible (specially on-line)

9
Space Requirements
• The space required for the vocabulary is rather small.
According to Heaps’ law the vocabulary grows as O(n),
where  is a constant between 0.4 and 0.6 in practice
• On the other hand, the occurrences demand much
more space. Since each word appearing in the text is
referenced once in that structure, the extra space is
O(n)
• To reduce space requirements, a technique called block
addressing is used

10
Block Addressing
• The text is divided in blocks
• The occurrences point to the blocks
where the word appears
• Advantages:
– the number of pointers is smaller than positions
– all the occurrences of a word inside a single block
are collapsed to one reference
• Disadvantages:
– online search over the qualifying blocks if exact
positions are required

11
Example
• Text:
That house has a garden. The garden has many flowers. The flowers are
beautiful
Block 1 Block 2 Block 3 Block 4

• Inverted file:
Vocabulary Occurrences
beautiful 4
flowers 3
garden 2
house 1

12
Inverted Files - construction

• Building the index in main memory is not

feasible (wouldn’t fit, and swapping would be
unbearable)
• Building it entirely in disk is not a good idea
either (would take a long time)
• One idea is to build several partial indices in
main memory, one at a time, saving them to
disk and then merging all of them to obtain a
single index

13
Inverted Files - construction

• The procedure works as follows:

– Build and save partial indices l1, I2, …, In
– Merge Ij and Ij+1 into a single partial index Ij,j+1
• Merging indices mean that their sorted vocabularies are
merged, and if a term appears in both indices then the
respective lists should be merged (keeping the document
order)
– Then indices Ij,j+1 and Ij+2,j+3 are merged into
partial index Ij,j+3, and so on and so forth until a
single index is obtained
– Several partial indices can be merged together at once

14
Thank You

Indexing and Searching Techniques
No ratings yet
Indexing and Searching Techniques
15 pages
Completed UNIT-III 20.9.17
No ratings yet
Completed UNIT-III 20.9.17
61 pages
Indexing Concepts and Techniques
No ratings yet
Indexing Concepts and Techniques
48 pages
Understanding Indexing Structures
No ratings yet
Understanding Indexing Structures
145 pages
Inverted File Structures Overview
No ratings yet
Inverted File Structures Overview
10 pages
Indexing Structures and Techniques Explained
No ratings yet
Indexing Structures and Techniques Explained
30 pages
Data Structures and Indexing Concepts
No ratings yet
Data Structures and Indexing Concepts
30 pages
Text Indexing Techniques and Benefits
No ratings yet
Text Indexing Techniques and Benefits
11 pages
Chapter 4 IR
No ratings yet
Chapter 4 IR
56 pages
Inverted File Document Retrieval System
No ratings yet
Inverted File Document Retrieval System
3 pages
Understanding Inverted Indexing in IR
100% (1)
Understanding Inverted Indexing in IR
10 pages
Data Structures for Information Retrieval
No ratings yet
Data Structures for Information Retrieval
34 pages
Indexing in Information Retrieval
100% (1)
Indexing in Information Retrieval
34 pages
Index Construction for Document Retrieval
No ratings yet
Index Construction for Document Retrieval
43 pages
Indexing Structure and Process Explained
No ratings yet
Indexing Structure and Process Explained
59 pages
Inverted Index Implementation Guide
No ratings yet
Inverted Index Implementation Guide
2 pages
Indexing Structures and File Types
No ratings yet
Indexing Structures and File Types
45 pages
IR System Indexing and Searching Guide
No ratings yet
IR System Indexing and Searching Guide
59 pages
Inverted Index Construction for Search Engines
No ratings yet
Inverted Index Construction for Search Engines
21 pages
Efficient Document Indexing Algorithms
No ratings yet
Efficient Document Indexing Algorithms
33 pages
File Organization and Indexing Methods
No ratings yet
File Organization and Indexing Methods
24 pages
Indexing Structures in Information Retrieval
No ratings yet
Indexing Structures in Information Retrieval
29 pages
Inverted Indexes for Efficient Search
No ratings yet
Inverted Indexes for Efficient Search
61 pages
Inverted File Structures in IR
No ratings yet
Inverted File Structures in IR
20 pages
Inverted Index: Definition & Implementation
No ratings yet
Inverted Index: Definition & Implementation
6 pages
Index Construction Methodology Overview
No ratings yet
Index Construction Methodology Overview
43 pages
Index Construction in Information Retrieval
No ratings yet
Index Construction in Information Retrieval
43 pages
Indexing Structure in Information Retrieval
No ratings yet
Indexing Structure in Information Retrieval
41 pages
Understanding Inverted Indexes in Search Engines
No ratings yet
Understanding Inverted Indexes in Search Engines
38 pages
Database Indexing Principles Explained
No ratings yet
Database Indexing Principles Explained
5 pages
Advanced Indexing Techniques in Databases
No ratings yet
Advanced Indexing Techniques in Databases
5 pages
Advanced Indexing Techniques in Databases
No ratings yet
Advanced Indexing Techniques in Databases
5 pages
Indexing Structure and Process Overview
No ratings yet
Indexing Structure and Process Overview
26 pages
Indexing Concepts and Techniques Explained
No ratings yet
Indexing Concepts and Techniques Explained
8 pages
Indexing Structure Overview
No ratings yet
Indexing Structure Overview
38 pages
Inverted Indexing in Information Retrieval
No ratings yet
Inverted Indexing in Information Retrieval
18 pages
Indexing and Hashing Techniques Explained
No ratings yet
Indexing and Hashing Techniques Explained
83 pages
Inverted Index Construction Techniques
No ratings yet
Inverted Index Construction Techniques
46 pages
Overview of Index Structures in Databases
No ratings yet
Overview of Index Structures in Databases
34 pages
Memoryhierarchy Indexing
No ratings yet
Memoryhierarchy Indexing
9 pages
Indexing vs. Hashing in DBMS
No ratings yet
Indexing vs. Hashing in DBMS
31 pages
Inverted Index and Query Processing Guide
No ratings yet
Inverted Index and Query Processing Guide
13 pages
Indexing and Searching in IR Systems
No ratings yet
Indexing and Searching in IR Systems
28 pages
File Organization and Index Methods Guide
No ratings yet
File Organization and Index Methods Guide
31 pages
Glimpse: Efficient File System Search Tool
No ratings yet
Glimpse: Efficient File System Search Tool
11 pages
Indexing and Hashing Concepts Explained
No ratings yet
Indexing and Hashing Concepts Explained
46 pages
Inverted Indexing Techniques Explained
No ratings yet
Inverted Indexing Techniques Explained
22 pages
File Organization and Indexing Methods
No ratings yet
File Organization and Indexing Methods
26 pages
Reorganizing Indexed Sequential Files
No ratings yet
Reorganizing Indexed Sequential Files
77 pages
Database File Organization and Indexing
No ratings yet
Database File Organization and Indexing
41 pages
Understanding Indexing Mechanisms in Databases
No ratings yet
Understanding Indexing Mechanisms in Databases
63 pages
Understanding Generic Subroutines
No ratings yet
Understanding Generic Subroutines
15 pages
C Subroutines and Control Abstraction
No ratings yet
C Subroutines and Control Abstraction
7 pages
Names, Scopes, and Bindings in Programming
No ratings yet
Names, Scopes, and Bindings in Programming
29 pages
Overview of Programming Paradigms
No ratings yet
Overview of Programming Paradigms
7 pages
Functional Programming and Lambda Calculus
No ratings yet
Functional Programming and Lambda Calculus
6 pages
Understanding the IO Monad in Haskell
100% (1)
Understanding the IO Monad in Haskell
43 pages
Understanding the IO Monad in Haskell
No ratings yet
Understanding the IO Monad in Haskell
43 pages
Web Crawler Architecture and Policies
No ratings yet
Web Crawler Architecture and Policies
37 pages
User Interface Design Principles for IRs
No ratings yet
User Interface Design Principles for IRs
24 pages
Text Compression Techniques Explained
No ratings yet
Text Compression Techniques Explained
1 page
Multimedia Information Retrieval Overview
No ratings yet
Multimedia Information Retrieval Overview
20 pages
Indexing and Searching Techniques
No ratings yet
Indexing and Searching Techniques
25 pages
Multimedia Information Retrieval Systems
No ratings yet
Multimedia Information Retrieval Systems
20 pages
Multimedia Information Retrieval Systems
No ratings yet
Multimedia Information Retrieval Systems
20 pages
Multimedia Information Retrieval Techniques
No ratings yet
Multimedia Information Retrieval Techniques
51 pages
The Vertue Method: A Fitness Revolution
100% (1)
The Vertue Method: A Fitness Revolution
32 pages
English Practice for 6th Grade Students
No ratings yet
English Practice for 6th Grade Students
2 pages
Understanding Determiners in English
No ratings yet
Understanding Determiners in English
9 pages
Oral History and the 1947 Partition Risks
No ratings yet
Oral History and the 1947 Partition Risks
19 pages
Housekeeping Vocabulary Guide
No ratings yet
Housekeeping Vocabulary Guide
7 pages
Four Elements Training Quick Guide
No ratings yet
Four Elements Training Quick Guide
17 pages
Kalinga Banga Dance Overview
58% (59)
Kalinga Banga Dance Overview
21 pages
Sensitivity Analysis in Project Management
No ratings yet
Sensitivity Analysis in Project Management
2 pages
Ireland Tourist Attractions Map
No ratings yet
Ireland Tourist Attractions Map
1 page
Inflow Performance: Vogel & Fetkovich Equations
No ratings yet
Inflow Performance: Vogel & Fetkovich Equations
35 pages
Child Psychology and Parenting Skills
No ratings yet
Child Psychology and Parenting Skills
64 pages
Tamil Sangam Literature
No ratings yet
Tamil Sangam Literature
8 pages
True vs False Pelvis: Key Differences
No ratings yet
True vs False Pelvis: Key Differences
4 pages
IFFI 2024: Celebrating Global Cinema
No ratings yet
IFFI 2024: Celebrating Global Cinema
326 pages
Gerunds and Infinitives in Verb Patterns
No ratings yet
Gerunds and Infinitives in Verb Patterns
3 pages
Labor Law Rights and Security of Tenure
No ratings yet
Labor Law Rights and Security of Tenure
21 pages
HCI H2 Maths Exam Paper 2 Solutions
No ratings yet
HCI H2 Maths Exam Paper 2 Solutions
3 pages
Nursing Care Plan for Liver Abscess
No ratings yet
Nursing Care Plan for Liver Abscess
22 pages
Terrain Awareness Flight Operations Guide
No ratings yet
Terrain Awareness Flight Operations Guide
20 pages
Contemporary Greek Flute Compositions
0% (1)
Contemporary Greek Flute Compositions
8 pages
Impact of Compensation on Employee Motivation
100% (1)
Impact of Compensation on Employee Motivation
43 pages
Cultivating Resilience for Growth
No ratings yet
Cultivating Resilience for Growth
11 pages
Common Definition Errors and Rules
No ratings yet
Common Definition Errors and Rules
2 pages
Seoul Rental Bike Data Analysis
No ratings yet
Seoul Rental Bike Data Analysis
20 pages
Steps in Project Preparation Analysis
No ratings yet
Steps in Project Preparation Analysis
52 pages
졸업작품 분석: 'The Songs of Universe'
No ratings yet
졸업작품 분석: 'The Songs of Universe'
54 pages
Duck Breed Identification Guide
No ratings yet
Duck Breed Identification Guide
5 pages
English Spelling and Pronunciation Guide
No ratings yet
English Spelling and Pronunciation Guide
4 pages
Introduction to Translation Studies
100% (4)
Introduction to Translation Studies
19 pages
Drishti-1731326936 244936
No ratings yet
Drishti-1731326936 244936
33 pages

Merging Indices in Information Retrieval

Uploaded by

Merging Indices in Information Retrieval

Uploaded by

Module 5 – Indexing and Searching

Prof. Pravin [Link]

• Just like in traditional RDBMSs searching for data

• Traditional indices, e.g., B-trees, are not well

Vocabulary Occurrences (byte

• The number of indexed terms is often several

• Building the index in main memory is not

• The procedure works as follows:

You might also like