0% found this document useful (0 votes)
17 views11 pages

Vector Space Model

Uploaded by

mananpadia1101
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views11 pages

Vector Space Model

Uploaded by

mananpadia1101
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd

Vector Space Model

Vector Space Model


• The Vector Space Model is an algebraic model used in
Information Retrieval (IR) where documents and queries are
represented as vectors in a common multi-dimensional space.
• Each document and query is transformed into a vector of
terms.
• The dimension of the space equals the number of distinct
terms (vocabulary) in the document corpus.
• Each vector component corresponds to the weight of a term
in a document.
• Common weighting schemes:
• Binary (presence/absence)
• Term Frequency (TF)
• TF-IDF (Term Frequency-Inverse Document Frequency)
Vector Space Model (VSM)
• Term Frequency (TF): The number of times a term appears in
a document.
• Inverse Document Frequency (IDF): Measures how important
a term is. It is computed as:

TF-IDF Weighting:
Vector Representation
• Each document or query is represented as a vector of TF-IDF
weights:

Similarity Measures
• To retrieve relevant documents for a query, we calculate similarity between the
query vector and each document vector.
Example
• D1: “cat sat on the mat”
• D2: “dog sat behind the cat”
• Query Q: “cat sat behind dog”
Solution
Create vocabulary
Ignore stopwords like “on”, “the”, etc.
Vocabulary = [cat, sat, behind, dog, mat]
Binary Term-Document Matrix
Term D1 D2 Q
cat 1 1 1
sat 1 1 1
behind 0 1 1
dog 0 1 1
mat 1 0 0

So the document vectors become:


•D1 = [1, 1, 0, 0, 1]
•D2 = [1, 1, 1, 1, 0]
•Q = [1, 1, 1, 1, 0]
Compute the Euclidean Distance between documents and query

Final Answer: ED(D1, Q) ≈ 1.73, ED(D2, Q) = 0


D2 is more similar to the query Q than D1.
Determine the documents for the given query in most
relevant order. Apply cosine similarity as the relevant
retrieval metric. Use appropriate preprocessing
wherever required.
Query: "fire save stories"
Documents:
•D1: "A man and a woman in fire."
•D2: "A man saves a woman in fire."
•D3: "Men and women and the baby a good movie."
•D4: "A man saved the baby in fire."
Solution
"fire save stories" -> fire, save, stories
• D1: "A man and a woman in fire"→ [man, woman, fire]
• D2: "A man saves a woman in fire"→ [man, save, woman,
fire]
• D3: "Men and women and the baby a good movie“→ [man,
woman, baby, good, movi]
• D4: "A man saved the baby in fire"→ [man, save, baby, fire]
•Vocabulary: [fire, save, stori, man, woman, baby, good, movi]
Term Q D1 D2 D3 D4
fire 1 1 1 0 1
save 1 0 1 0 1
stori 1 0 0 0 0
man 0 1 1 1 1
woman 0 1 1 1 0
baby 0 0 0 1 1
good 0 0 0 1 0
movi 0 0 0 1 0
Calculate Cosine Similarity
D3: Dot product: 0 (no common term with Q except none
— “fire” and “save” missing)
Cosine: 0

Scores:
•D2: 0.577
•D4: 0.577
•D1: 0.333
•D3: 0.000

Final ranking: D2, D4, D1, D3

You might also like