Vector Space Model
Vector Space Model
• The Vector Space Model is an algebraic model used in
Information Retrieval (IR) where documents and queries are
represented as vectors in a common multi-dimensional space.
• Each document and query is transformed into a vector of
terms.
• The dimension of the space equals the number of distinct
terms (vocabulary) in the document corpus.
• Each vector component corresponds to the weight of a term
in a document.
• Common weighting schemes:
• Binary (presence/absence)
• Term Frequency (TF)
• TF-IDF (Term Frequency-Inverse Document Frequency)
Vector Space Model (VSM)
• Term Frequency (TF): The number of times a term appears in
a document.
• Inverse Document Frequency (IDF): Measures how important
a term is. It is computed as:
TF-IDF Weighting:
Vector Representation
• Each document or query is represented as a vector of TF-IDF
weights:
Similarity Measures
• To retrieve relevant documents for a query, we calculate similarity between the
query vector and each document vector.
Example
• D1: “cat sat on the mat”
• D2: “dog sat behind the cat”
• Query Q: “cat sat behind dog”
Solution
Create vocabulary
Ignore stopwords like “on”, “the”, etc.
Vocabulary = [cat, sat, behind, dog, mat]
Binary Term-Document Matrix
Term D1 D2 Q
cat 1 1 1
sat 1 1 1
behind 0 1 1
dog 0 1 1
mat 1 0 0
So the document vectors become:
•D1 = [1, 1, 0, 0, 1]
•D2 = [1, 1, 1, 1, 0]
•Q = [1, 1, 1, 1, 0]
Compute the Euclidean Distance between documents and query
Final Answer: ED(D1, Q) ≈ 1.73, ED(D2, Q) = 0
D2 is more similar to the query Q than D1.
Determine the documents for the given query in most
relevant order. Apply cosine similarity as the relevant
retrieval metric. Use appropriate preprocessing
wherever required.
Query: "fire save stories"
Documents:
•D1: "A man and a woman in fire."
•D2: "A man saves a woman in fire."
•D3: "Men and women and the baby a good movie."
•D4: "A man saved the baby in fire."
Solution
"fire save stories" -> fire, save, stories
• D1: "A man and a woman in fire"→ [man, woman, fire]
• D2: "A man saves a woman in fire"→ [man, save, woman,
fire]
• D3: "Men and women and the baby a good movie“→ [man,
woman, baby, good, movi]
• D4: "A man saved the baby in fire"→ [man, save, baby, fire]
•Vocabulary: [fire, save, stori, man, woman, baby, good, movi]
Term Q D1 D2 D3 D4
fire 1 1 1 0 1
save 1 0 1 0 1
stori 1 0 0 0 0
man 0 1 1 1 1
woman 0 1 1 1 0
baby 0 0 0 1 1
good 0 0 0 1 0
movi 0 0 0 1 0
Calculate Cosine Similarity
D3: Dot product: 0 (no common term with Q except none
— “fire” and “save” missing)
Cosine: 0
Scores:
•D2: 0.577
•D4: 0.577
•D1: 0.333
•D3: 0.000
Final ranking: D2, D4, D1, D3