The Boolean Model
•Simple model based on set theory
•Queries specified as boolean expressions
precise semantics
neat formalism
q = ka ∧ (kb ∨ ¬kc) (applying distributive law)
= (ka ∧ kb) ∨ (ka ∧ ¬kc) (disjunctive normal form or DNF)
•Term are either present or absent. Thus wij ∈ {0,1}
Simple Query Language:
Boolean
– Terms + Connectors (or operators)
– terms
●
words
●
normalized (stemmed) words
●
phrases
●
thesaurus terms
– connectors
●
AND
●
OR
●
NOT
Boolean Queries
●
Cat
●
Cat OR Dog
●
Cat AND Dog
●
(Cat AND Dog)
●
(Cat AND Dog) OR Collar
●
(Cat AND Dog) OR (Collar AND Leash)
●
(Cat OR Dog) AND (Collar OR Leash)
Boolean Queries
●
(Cat OR Dog) AND (Collar OR Leash)
– Each of the following combinations works:
Boolean Queries
●
(Cat OR Dog) AND (Collar OR Leash)
– None of the following combinations work:
Boolean Logic
C=A
C=A
C = A∩ B
C = A∪ B B
A
DeMorgan' s Law :
A∩ B = A∪ B
A∪ B = A∩ B
Boolean Queries
– Usually expressed as INFIX operators in IR
●
((a AND b) OR (c AND b))
– NOT is UNARY PREFIX operator
●
((a AND b) OR (c AND (NOT b)))
– AND and OR can be n-ary operators
●
(a AND b AND c AND d)
– Some rules - (De Morgan revisited)
●
NOT(a) AND NOT(b) = NOT(a OR b)
●
NOT(a) OR NOT(b)= NOT(a AND b)
●
NOT(NOT(a)) = a
Boolean queries
●
Small variations in a query can generate very different
results
– data AND compression AND retrieval
– text AND compression AND retrieval
●
the user should be able to pose complex queries like:
– (text OR data OR image) AND
(compression OR compaction OR decompression) AND
(archiving OR retrieval OR storage)
– ...but many users are not able (or willing)...
Ranked queries
●
Rather than seeking exact Boolean answers, non-
professional users might prefer simply giving a list
of words that are of interest and letting the
retrieval system supply the documents that seem
most relevant
●
Text, data, image, compression, compaction,
archiving, storage, retrieval...
Ranked queries
●
It would be useless to convert a list of words to a
Boolean query
– connect with AND -> too few documents
– connect with OR -> too many documents
●
solution: a ranked response
– A heuristic is applied to measure the similarity of each
document to the query
– Documents are ranked accorrding to similarity
Processing ranked queries
●
How to assign a similarity measure to each
document that indicates how closely it
matches a query?
Ranking strategies
●
Simple techniques
– Count the number of query terms that appear
somewhere in the document
●
A document that contains 5 query terms is ranked
higher than a document that contains 3 query terms
●
More advanced techniques
– Cosine measure
●
Takes into account the lenghts of the documents,
etc.
Coordinate matching
●
Count the number of query terms that appear in
each document
●
The more terms that appear, the more likely it is
that the document is relevant
●
A hybrid query between a conjunctive AND query
(all) and a disjunctive OR query (≥1)
– A document that contains any of the terms is a potential
answer, but preference is given to documents that
contain all or most of them
Inner product similarity
●
Coordinate matching can be formalized as an
inner product of a query vector with a set of
document vectors
– binary weights : term present – not present
● the similarity measure of a document dj with a
query q is expressed as
– sim(q,dj) =
q ⋅d j
– the inner product
n of two n-vectors X and Y:
X ⋅ Y = ∑ xiyi
i =1
Example document collection
j Document dj
1 Pease Porridge hot, pease porridge cold.
2 Pease porridge in the pot.
3 Nine days old.
4 In the pot cold, in the pot hot.
5 Pease porridge, pease porridge.
6 Eat the lot.
Example
j Document vectors (wi,j)
col day eat hot lot nin old pea por pot
1 1 0 0 1 0 0 0 1 1 0
2 0 0 0 0 0 0 0 1 1 1
3 0 1 0 0 0 1 1 0 0 0
4 1 0 0 1 0 0 0 0 0 1
5 0 0 0 0 0 0 0 1 1 0
6 0 0 1 0 1 0 0 0 0 0
Hot porridge 0 0 0 1 0 0 0 0 1 0
Inner product, example
●
query vector (”hot, porridge”):
– (0,0,0,1,0,0,0,0,1,0)
● document vector (d1):
– (1,0,0,1,0,0,0,1,1,0)
● sim(“hot porridge”, d1) = 2
Drawbacks
●
Takes no account of term frequency
– documents with many occurrences of a term should be
favored
●
Takes no account of term scarcity
– rare terms should have more weight
●
Long documents with many terms are
automatically favored
– they are likely to contain more of any given list of query
terms
Solutions
●
Term frequency
– Binary ”present” - ”not-present” judgment can be replaced
with an integer indicating how many times the term
appears in the document
– freqd,t: within-document frequency
– sim(“hot porridge”, d1) =
(0,0,0,1,0,0,0,0,1,0) • (1,0,0,1,0,0,0,2,2,0) = 3
Solutions
●
Term frequency
– This favors long documents over short ones, so we
ususally use the normalized frequence of term ki in
document dj:
freq i , j
fi,j =
max l freq l , j
– where the max is computed over all terms that are
mentioned in the text of document dj.
– Will call this measure TF (term frequency).
Solutions
● More generally, a term ki can be assigned
– in a document dj: a document-term weight wi,j
– in a query q: a query-term weight wi,q
●
The similarity measure is the inner product of the document
vector and the query vector
n
sim (q , d i ) = ∑w i ,q ⋅ w i , j
i =1
Solutions
● It is normal to assign wi,q= 0, if ki does not
appear in q, so the measure can be stated
as
sim (q , d j ) = ∑w
k i ∈q
i,q ⋅w i , j
Inverse document frequency
●
If only the term frequency is taken into account, and a query
contains common words, a document with enough appearances
of a common term is always ranked first, irrespective of other
words.
●
Solution: Reduce the weights for terms that appear in many
documents.
● The inverse document frequency (IDF) for term ki is
N
idf i = log
ni
where N is the total number of documents and ni is the number of
documents containing term ki. The log is used to make the
values of TF & IDF comparable.
Weighting terms: TF*IDF
●
TF*IDF = term frequency * inverse document frequency
● Weight of term ki in document dj:
N
w i,j = fi,j × log
ni
● Since usually fi,q=1, the query weights are typically
N
w i ,q = log
ni
Similarity of vectors
∑w
k i ∈q
i ,q ⋅ wi , j
sim(q, d j ) =
n
∑w
i =1
2
i, j
Document Vectors One location for each word.
“Nova” occurs 10 times in text A
“Galaxy” occurs 5 times in text A
“Heat” occurs 3 times in text A
(Blank means 0 occurrences.)
nova galaxy heat h’wood film role diet fur
A 10 5 3
B 5 10
C 10 8 7
D 9 10 5
E 10 10
F 9 10
G 5 7 9
H 6 10 2 8
I 7 5 1 3
What role did Shifu play in the hollywood animated film Kung fu Panda