Boolean Logic for IR Professionals

The Boolean Model uses set theory and boolean expressions to represent queries. Queries are expressed as terms combined with boolean operators like AND, OR, and NOT. Terms can either be present or absent in a document. This results in a simple yet precise representation of queries.

Uploaded by

Ali Hasan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2K views26 pages

Boolean Logic for IR Professionals

Uploaded by

Ali Hasan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

The Boolean Model

•Simple model based on set theory

•Queries specified as boolean expressions
precise semantics
neat formalism
q = ka ∧ (kb ∨ ¬kc) (applying distributive law)
= (ka ∧ kb) ∨ (ka ∧ ¬kc) (disjunctive normal form or DNF)
•Term are either present or absent. Thus wij ∈ {0,1}
Simple Query Language:
Boolean
– Terms + Connectors (or operators)
– terms
●
words
●
normalized (stemmed) words
●
phrases
●
thesaurus terms
– connectors
●
AND
●
OR
●
NOT
Boolean Queries
●
Cat
●
Cat OR Dog
●
Cat AND Dog
●
(Cat AND Dog)
●
(Cat AND Dog) OR Collar
●
(Cat AND Dog) OR (Collar AND Leash)
●
(Cat OR Dog) AND (Collar OR Leash)
Boolean Queries
●
(Cat OR Dog) AND (Collar OR Leash)
– Each of the following combinations works:
Boolean Queries
●
(Cat OR Dog) AND (Collar OR Leash)
– None of the following combinations work:
Boolean Logic
C=A
C=A
C = A∩ B
C = A∪ B B
A
DeMorgan' s Law :
A∩ B = A∪ B
A∪ B = A∩ B
Boolean Queries

– Usually expressed as INFIX operators in IR

●
((a AND b) OR (c AND b))
– NOT is UNARY PREFIX operator
●
((a AND b) OR (c AND (NOT b)))
– AND and OR can be n-ary operators
●
(a AND b AND c AND d)
– Some rules - (De Morgan revisited)
●
NOT(a) AND NOT(b) = NOT(a OR b)
●
NOT(a) OR NOT(b)= NOT(a AND b)
●
NOT(NOT(a)) = a
Boolean queries
●
Small variations in a query can generate very different
results
– data AND compression AND retrieval
– text AND compression AND retrieval
●
the user should be able to pose complex queries like:
– (text OR data OR image) AND
(compression OR compaction OR decompression) AND
(archiving OR retrieval OR storage)
– ...but many users are not able (or willing)...
Ranked queries
●
Rather than seeking exact Boolean answers, non-
professional users might prefer simply giving a list
of words that are of interest and letting the
retrieval system supply the documents that seem
most relevant
●
Text, data, image, compression, compaction,
archiving, storage, retrieval...
Ranked queries
●
It would be useless to convert a list of words to a
Boolean query
– connect with AND -> too few documents
– connect with OR -> too many documents
●
solution: a ranked response
– A heuristic is applied to measure the similarity of each
document to the query
– Documents are ranked accorrding to similarity
Processing ranked queries
●
How to assign a similarity measure to each
document that indicates how closely it
matches a query?
Ranking strategies
●
Simple techniques
– Count the number of query terms that appear
somewhere in the document
●
A document that contains 5 query terms is ranked
higher than a document that contains 3 query terms
●
More advanced techniques
– Cosine measure
●
Takes into account the lenghts of the documents,
etc.
Coordinate matching
●
Count the number of query terms that appear in
each document
●
The more terms that appear, the more likely it is
that the document is relevant
●
A hybrid query between a conjunctive AND query
(all) and a disjunctive OR query (≥1)
– A document that contains any of the terms is a potential
answer, but preference is given to documents that
contain all or most of them
Inner product similarity
●
Coordinate matching can be formalized as an
inner product of a query vector with a set of
document vectors
– binary weights : term present – not present
● the similarity measure of a document dj with a
query q is expressed as
 
– sim(q,dj) =
q ⋅d j
– the inner product
n of two n-vectors X and Y:
X ⋅ Y = ∑ xiyi
i =1
Example document collection
j Document dj

1 Pease Porridge hot, pease porridge cold.

2 Pease porridge in the pot.
3 Nine days old.
4 In the pot cold, in the pot hot.
5 Pease porridge, pease porridge.
6 Eat the lot.
Example

j Document vectors (wi,j)

col day eat hot lot nin old pea por pot
1 1 0 0 1 0 0 0 1 1 0
2 0 0 0 0 0 0 0 1 1 1
3 0 1 0 0 0 1 1 0 0 0
4 1 0 0 1 0 0 0 0 0 1
5 0 0 0 0 0 0 0 1 1 0
6 0 0 1 0 1 0 0 0 0 0

Hot porridge 0 0 0 1 0 0 0 0 1 0
Inner product, example
●
query vector (”hot, porridge”):
– (0,0,0,1,0,0,0,0,1,0)
● document vector (d1):
– (1,0,0,1,0,0,0,1,1,0)
● sim(“hot porridge”, d1) = 2
Drawbacks
●
Takes no account of term frequency
– documents with many occurrences of a term should be
favored
●
Takes no account of term scarcity
– rare terms should have more weight
●
Long documents with many terms are
automatically favored
– they are likely to contain more of any given list of query
terms
Solutions
●
Term frequency
– Binary ”present” - ”not-present” judgment can be replaced
with an integer indicating how many times the term
appears in the document
– freqd,t: within-document frequency

– sim(“hot porridge”, d1) =

(0,0,0,1,0,0,0,0,1,0) • (1,0,0,1,0,0,0,2,2,0) = 3
Solutions
●
Term frequency
– This favors long documents over short ones, so we
ususally use the normalized frequence of term ki in
document dj:
freq i , j
fi,j =
max l freq l , j
– where the max is computed over all terms that are
mentioned in the text of document dj.
– Will call this measure TF (term frequency).
Solutions
● More generally, a term ki can be assigned
– in a document dj: a document-term weight wi,j
– in a query q: a query-term weight wi,q
●
The similarity measure is the inner product of the document
vector and the query vector

n
sim (q , d i ) = ∑w i ,q ⋅ w i , j
i =1
Solutions
● It is normal to assign wi,q= 0, if ki does not
appear in q, so the measure can be stated
as

sim (q , d j ) = ∑w
k i ∈q
i,q ⋅w i , j
Inverse document frequency
●
If only the term frequency is taken into account, and a query
contains common words, a document with enough appearances
of a common term is always ranked first, irrespective of other
words.
●
Solution: Reduce the weights for terms that appear in many
documents.
● The inverse document frequency (IDF) for term ki is

N
idf i = log
ni
where N is the total number of documents and ni is the number of
documents containing term ki. The log is used to make the
values of TF & IDF comparable.
Weighting terms: TF*IDF
●
TF*IDF = term frequency * inverse document frequency
● Weight of term ki in document dj:

N
w i,j = fi,j × log
ni
● Since usually fi,q=1, the query weights are typically

N
w i ,q = log
ni
Similarity of vectors

∑w
k i ∈q
i ,q ⋅ wi , j
sim(q, d j ) =
n

∑w
i =1
2
i, j
Document Vectors One location for each word.
“Nova” occurs 10 times in text A
“Galaxy” occurs 5 times in text A
“Heat” occurs 3 times in text A
(Blank means 0 occurrences.)
nova galaxy heat h’wood film role diet fur
A 10 5 3
B 5 10
C 10 8 7
D 9 10 5
E 10 10
F 9 10
G 5 7 9
H 6 10 2 8
I 7 5 1 3

What role did Shifu play in the hollywood animated film Kung fu Panda

Vector Space Model and Document Scoring
No ratings yet
Vector Space Model and Document Scoring
44 pages
TF Idf
100% (3)
TF Idf
38 pages
Module 3 Indexing Part A
No ratings yet
Module 3 Indexing Part A
46 pages
Session 4 Text Feature
No ratings yet
Session 4 Text Feature
40 pages
TF-IDF and Ranked Retrieval Basics
No ratings yet
TF-IDF and Ranked Retrieval Basics
51 pages
TF-IDF and Vector Space Model Overview
No ratings yet
TF-IDF and Vector Space Model Overview
37 pages
Advanced Info Retrieval Lecture
No ratings yet
Advanced Info Retrieval Lecture
27 pages
Chapter 6 - Scoring Term Weighting and Vector Space Model
No ratings yet
Chapter 6 - Scoring Term Weighting and Vector Space Model
43 pages
Understanding Ranked Retrieval Models
No ratings yet
Understanding Ranked Retrieval Models
52 pages
Lecture 5 - Scoring, Term Weighting, Vector Space Model - Part 1
No ratings yet
Lecture 5 - Scoring, Term Weighting, Vector Space Model - Part 1
45 pages
Unit 4
No ratings yet
Unit 4
61 pages
L12&L13 Ranked Retrieval
No ratings yet
L12&L13 Ranked Retrieval
31 pages
Lecture 04
No ratings yet
Lecture 04
41 pages
User Search Techniques in IR Systems
No ratings yet
User Search Techniques in IR Systems
63 pages
3 termWeightingIR
No ratings yet
3 termWeightingIR
32 pages
ISR Chap... 5
No ratings yet
ISR Chap... 5
34 pages
Vector Space Model & Tf-idf Explained
100% (1)
Vector Space Model & Tf-idf Explained
16 pages
L02-IR Models MMN
No ratings yet
L02-IR Models MMN
27 pages
Understanding Information Retrieval Models
No ratings yet
Understanding Information Retrieval Models
46 pages
Information Retrieval Systems Information Retrieval Systems
No ratings yet
Information Retrieval Systems Information Retrieval Systems
7 pages
4 IRModels
No ratings yet
4 IRModels
46 pages
Chapter 3 IR
No ratings yet
Chapter 3 IR
34 pages
Chapter Three Term Weighting and Similarity Measures
No ratings yet
Chapter Three Term Weighting and Similarity Measures
33 pages
AI6122 Topic 3.2 - Ranking
No ratings yet
AI6122 Topic 3.2 - Ranking
27 pages
Lecture 6 Score - Term Weight - Vector Space Model
No ratings yet
Lecture 6 Score - Term Weight - Vector Space Model
43 pages
UNIT5-User Search Techniques
No ratings yet
UNIT5-User Search Techniques
24 pages
3 Retrieval Models
No ratings yet
3 Retrieval Models
87 pages
Reference Material For NLP - 1
No ratings yet
Reference Material For NLP - 1
40 pages
I R Rank
No ratings yet
I R Rank
52 pages
ISR Chap..3
No ratings yet
ISR Chap..3
26 pages
3 Termweighting
No ratings yet
3 Termweighting
40 pages
Understanding IR Models and Ranking
No ratings yet
Understanding IR Models and Ranking
43 pages
Chapter Three Term Weighting and Similarity Measures
No ratings yet
Chapter Three Term Weighting and Similarity Measures
25 pages
L04
No ratings yet
L04
35 pages
Understanding Term Weighting in IR
No ratings yet
Understanding Term Weighting in IR
34 pages
IR Models for Information Retrieval
No ratings yet
IR Models for Information Retrieval
51 pages
6 Tfidf
No ratings yet
6 Tfidf
48 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
27 pages
3 Termweighting
No ratings yet
3 Termweighting
34 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
31 pages
Lecture 5
No ratings yet
Lecture 5
75 pages
Term Weighting and Similarity Measures
No ratings yet
Term Weighting and Similarity Measures
35 pages
Text Processing & Term Weighting
100% (2)
Text Processing & Term Weighting
38 pages
Chapter 3 Term Weighting
No ratings yet
Chapter 3 Term Weighting
11 pages
Unit 2
No ratings yet
Unit 2
58 pages
UNIT 6 Applications of NLP
No ratings yet
UNIT 6 Applications of NLP
60 pages
Understanding Information Retrieval Models
No ratings yet
Understanding Information Retrieval Models
28 pages
Understanding IR Models and Ranking
No ratings yet
Understanding IR Models and Ranking
25 pages
Term Weighting in Information Retrieval
No ratings yet
Term Weighting in Information Retrieval
34 pages
Lec2 2
No ratings yet
Lec2 2
17 pages
Understanding Information Retrieval Models
No ratings yet
Understanding Information Retrieval Models
30 pages
Ir End Pyq Sols
No ratings yet
Ir End Pyq Sols
8 pages
Lecture4 VSM
No ratings yet
Lecture4 VSM
101 pages
Homework2 Solution
100% (1)
Homework2 Solution
11 pages
Lecture 10
No ratings yet
Lecture 10
18 pages
Information Retrieval Models Guide
No ratings yet
Information Retrieval Models Guide
54 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
33 pages
Precision Recal TF Idf
No ratings yet
Precision Recal TF Idf
36 pages
Understanding Data Warehouse Architecture
No ratings yet
Understanding Data Warehouse Architecture
13 pages
Understanding Ontologies for Experts
No ratings yet
Understanding Ontologies for Experts
13 pages
Web Search Engine Challenges & Architecture
No ratings yet
Web Search Engine Challenges & Architecture
21 pages
Relevance of The Results: Documents Are Retrieved Relevant Irrelevant Measure
No ratings yet
Relevance of The Results: Documents Are Retrieved Relevant Irrelevant Measure
42 pages
Elective II: Selected Topics in Information Retrieval (IR) and Natural Language Processing (NLP)
No ratings yet
Elective II: Selected Topics in Information Retrieval (IR) and Natural Language Processing (NLP)
16 pages
Embracing Hope Through Adversity
No ratings yet
Embracing Hope Through Adversity
3 pages
Professional Practices
No ratings yet
Professional Practices
8 pages
Movie Description (Audio Description) .
No ratings yet
Movie Description (Audio Description) .
11 pages
Human Computer Interaction
No ratings yet
Human Computer Interaction
26 pages
Algorithms Worksheet
No ratings yet
Algorithms Worksheet
12 pages
Solar Tracker Using Non-Technical Device
No ratings yet
Solar Tracker Using Non-Technical Device
6 pages
24th ICFC Contestants Mechanics
No ratings yet
24th ICFC Contestants Mechanics
13 pages
Chapter 12 - Battery
No ratings yet
Chapter 12 - Battery
8 pages
Aviation Charts for EDTL Pilots
No ratings yet
Aviation Charts for EDTL Pilots
11 pages
Bar Chart Guide
No ratings yet
Bar Chart Guide
29 pages
Understanding Two-Way Communication
100% (1)
Understanding Two-Way Communication
144 pages
Gost 33466-2015
No ratings yet
Gost 33466-2015
35 pages
Internship Report
No ratings yet
Internship Report
45 pages
Mbmfd3fd2600-65-181717.5de-Df-In Ii Data Sheet
No ratings yet
Mbmfd3fd2600-65-181717.5de-Df-In Ii Data Sheet
3 pages
Concrete Pump 560C-8 Manual
100% (1)
Concrete Pump 560C-8 Manual
460 pages
Lecture 04: Realtime Operating Systems (Rtos)
No ratings yet
Lecture 04: Realtime Operating Systems (Rtos)
57 pages
Wavelet & Fourier Analysis of ENSO Data
No ratings yet
Wavelet & Fourier Analysis of ENSO Data
10 pages
Apc Smart Ups 1500 VA Manual
No ratings yet
Apc Smart Ups 1500 VA Manual
18 pages
Identify Rational Functions and Equations
No ratings yet
Identify Rational Functions and Equations
28 pages
E Viscous Wiring Harness Trouble Shooting
No ratings yet
E Viscous Wiring Harness Trouble Shooting
2 pages
Sales Order Change and Delete
No ratings yet
Sales Order Change and Delete
16 pages
KPI Optimization Guidelines for Nokia
100% (1)
KPI Optimization Guidelines for Nokia
17 pages
French Lesson by Slidesgo
No ratings yet
French Lesson by Slidesgo
44 pages
Housing Project BOQ Billing Format
100% (6)
Housing Project BOQ Billing Format
125 pages
N5
No ratings yet
N5
18 pages
AltRTU600 (V2.0) Industrial Cellular RTU User Manual-V1.0
No ratings yet
AltRTU600 (V2.0) Industrial Cellular RTU User Manual-V1.0
22 pages
Applying Halstead Software Science On Different Programming Languages For Analyzing Software Complexity
No ratings yet
Applying Halstead Software Science On Different Programming Languages For Analyzing Software Complexity
5 pages
The Cybernetic Teammate - A Field Experiment On Generative AI Reshaping Teamwork and Expertise
No ratings yet
The Cybernetic Teammate - A Field Experiment On Generative AI Reshaping Teamwork and Expertise
56 pages
CH 14
No ratings yet
CH 14
4 pages
Cornerstone - Problem 13.21-13.22
No ratings yet
Cornerstone - Problem 13.21-13.22
4 pages
Social Media Content Calendar
No ratings yet
Social Media Content Calendar
5 pages
Pda TR32 2004
100% (1)
Pda TR32 2004
153 pages
Basic Concepts of Percentage by Dear Sir-Fr.5gvjitm2xxzxw-255655-1563935561683991879
No ratings yet
Basic Concepts of Percentage by Dear Sir-Fr.5gvjitm2xxzxw-255655-1563935561683991879
4 pages
Siaemic - ALFOplus2 - Leaflet&datasheet - October 2021
100% (1)
Siaemic - ALFOplus2 - Leaflet&datasheet - October 2021
4 pages