0% found this document useful (0 votes)
2K views26 pages

Boolean Logic for IR Professionals

The Boolean Model uses set theory and boolean expressions to represent queries. Queries are expressed as terms combined with boolean operators like AND, OR, and NOT. Terms can either be present or absent in a document. This results in a simple yet precise representation of queries.

Uploaded by

Ali Hasan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2K views26 pages

Boolean Logic for IR Professionals

The Boolean Model uses set theory and boolean expressions to represent queries. Queries are expressed as terms combined with boolean operators like AND, OR, and NOT. Terms can either be present or absent in a document. This results in a simple yet precise representation of queries.

Uploaded by

Ali Hasan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

The Boolean Model

•Simple model based on set theory


•Queries specified as boolean expressions
precise semantics
neat formalism
q = ka ∧ (kb ∨ ¬kc) (applying distributive law)
= (ka ∧ kb) ∨ (ka ∧ ¬kc) (disjunctive normal form or DNF)
•Term are either present or absent. Thus wij ∈ {0,1}
Simple Query Language:
Boolean
– Terms + Connectors (or operators)
– terms

words

normalized (stemmed) words

phrases

thesaurus terms
– connectors

AND

OR

NOT
Boolean Queries

Cat

Cat OR Dog

Cat AND Dog

(Cat AND Dog)

(Cat AND Dog) OR Collar

(Cat AND Dog) OR (Collar AND Leash)

(Cat OR Dog) AND (Collar OR Leash)
Boolean Queries

(Cat OR Dog) AND (Collar OR Leash)
– Each of the following combinations works:
Boolean Queries

(Cat OR Dog) AND (Collar OR Leash)
– None of the following combinations work:
Boolean Logic
C=A
C=A
C = A∩ B
C = A∪ B B
A
DeMorgan' s Law :
A∩ B = A∪ B
A∪ B = A∩ B
Boolean Queries

– Usually expressed as INFIX operators in IR



((a AND b) OR (c AND b))
– NOT is UNARY PREFIX operator

((a AND b) OR (c AND (NOT b)))
– AND and OR can be n-ary operators

(a AND b AND c AND d)
– Some rules - (De Morgan revisited)

NOT(a) AND NOT(b) = NOT(a OR b)

NOT(a) OR NOT(b)= NOT(a AND b)

NOT(NOT(a)) = a
Boolean queries

Small variations in a query can generate very different
results
– data AND compression AND retrieval
– text AND compression AND retrieval

the user should be able to pose complex queries like:
– (text OR data OR image) AND
(compression OR compaction OR decompression) AND
(archiving OR retrieval OR storage)
– ...but many users are not able (or willing)...
Ranked queries

Rather than seeking exact Boolean answers, non-
professional users might prefer simply giving a list
of words that are of interest and letting the
retrieval system supply the documents that seem
most relevant

Text, data, image, compression, compaction,
archiving, storage, retrieval...
Ranked queries

It would be useless to convert a list of words to a
Boolean query
– connect with AND -> too few documents
– connect with OR -> too many documents

solution: a ranked response
– A heuristic is applied to measure the similarity of each
document to the query
– Documents are ranked accorrding to similarity
Processing ranked queries

How to assign a similarity measure to each
document that indicates how closely it
matches a query?
Ranking strategies

Simple techniques
– Count the number of query terms that appear
somewhere in the document

A document that contains 5 query terms is ranked
higher than a document that contains 3 query terms

More advanced techniques
– Cosine measure

Takes into account the lenghts of the documents,
etc.
Coordinate matching

Count the number of query terms that appear in
each document

The more terms that appear, the more likely it is
that the document is relevant

A hybrid query between a conjunctive AND query
(all) and a disjunctive OR query (≥1)
– A document that contains any of the terms is a potential
answer, but preference is given to documents that
contain all or most of them
Inner product similarity

Coordinate matching can be formalized as an
inner product of a query vector with a set of
document vectors
– binary weights : term present – not present
● the similarity measure of a document dj with a
query q is expressed as
 
– sim(q,dj) =
q ⋅d j
– the inner product
n of two n-vectors X and Y:
X ⋅ Y = ∑ xiyi
i =1
Example document collection
j Document dj

1 Pease Porridge hot, pease porridge cold.


2 Pease porridge in the pot.
3 Nine days old.
4 In the pot cold, in the pot hot.
5 Pease porridge, pease porridge.
6 Eat the lot.
Example

j Document vectors (wi,j)


col day eat hot lot nin old pea por pot
1 1 0 0 1 0 0 0 1 1 0
2 0 0 0 0 0 0 0 1 1 1
3 0 1 0 0 0 1 1 0 0 0
4 1 0 0 1 0 0 0 0 0 1
5 0 0 0 0 0 0 0 1 1 0
6 0 0 1 0 1 0 0 0 0 0

Hot porridge 0 0 0 1 0 0 0 0 1 0
Inner product, example

query vector (”hot, porridge”):
– (0,0,0,1,0,0,0,0,1,0)
● document vector (d1):
– (1,0,0,1,0,0,0,1,1,0)
● sim(“hot porridge”, d1) = 2
Drawbacks

Takes no account of term frequency
– documents with many occurrences of a term should be
favored

Takes no account of term scarcity
– rare terms should have more weight

Long documents with many terms are
automatically favored
– they are likely to contain more of any given list of query
terms
Solutions

Term frequency
– Binary ”present” - ”not-present” judgment can be replaced
with an integer indicating how many times the term
appears in the document
– freqd,t: within-document frequency

– sim(“hot porridge”, d1) =


(0,0,0,1,0,0,0,0,1,0) • (1,0,0,1,0,0,0,2,2,0) = 3
Solutions

Term frequency
– This favors long documents over short ones, so we
ususally use the normalized frequence of term ki in
document dj:
freq i , j
fi,j =
max l freq l , j
– where the max is computed over all terms that are
mentioned in the text of document dj.
– Will call this measure TF (term frequency).
Solutions
● More generally, a term ki can be assigned
– in a document dj: a document-term weight wi,j
– in a query q: a query-term weight wi,q

The similarity measure is the inner product of the document
vector and the query vector

n
sim (q , d i ) = ∑w i ,q ⋅ w i , j
i =1
Solutions
● It is normal to assign wi,q= 0, if ki does not
appear in q, so the measure can be stated
as

sim (q , d j ) = ∑w
k i ∈q
i,q ⋅w i , j
Inverse document frequency

If only the term frequency is taken into account, and a query
contains common words, a document with enough appearances
of a common term is always ranked first, irrespective of other
words.

Solution: Reduce the weights for terms that appear in many
documents.
● The inverse document frequency (IDF) for term ki is

N
idf i = log
ni
where N is the total number of documents and ni is the number of
documents containing term ki. The log is used to make the
values of TF & IDF comparable.
Weighting terms: TF*IDF

TF*IDF = term frequency * inverse document frequency
● Weight of term ki in document dj:

N
w i,j = fi,j × log
ni
● Since usually fi,q=1, the query weights are typically

N
w i ,q = log
ni
Similarity of vectors

∑w
k i ∈q
i ,q ⋅ wi , j
sim(q, d j ) =
n

∑w
i =1
2
i, j
Document Vectors One location for each word.
“Nova” occurs 10 times in text A
“Galaxy” occurs 5 times in text A
“Heat” occurs 3 times in text A
(Blank means 0 occurrences.)
nova galaxy heat h’wood film role diet fur
A 10 5 3
B 5 10
C 10 8 7
D 9 10 5
E 10 10
F 9 10
G 5 7 9
H 6 10 2 8
I 7 5 1 3

What role did Shifu play in the hollywood animated film Kung fu Panda

You might also like