Information Retrieval
Chapter 5:
Retrieval Evaluation
IR Evaluation
• It is known that measuring or evaluating the
performance and accuracy of the system is very
important after IR system is designed.
• According to (Singhal, 2001), there are two main things
to measure in IR system; these are: effectiveness of the
system and its efficiency
Cont..
• Effectiveness:-Power to be effective; the quality of being able to
bring about an effect
• How is a system capable of retrieving relevant documents from
the collection?
about user satisfaction
• Efficiency:- The ratio of the output to the input of any system
• Skillfulness in avoiding wasted time and effort
It is about time, space
…cont
To measure ad hoc (informal) information retrieval
effectiveness in the standard way, we need a test collection
consisting of three things:
1. A document collection
2. A test suite (set)of information needs, expressible as
queries
3. A set of relevance judgments, standardly a binary
assessment of either relevant or non relevant for each
query-document pair
Document collection
• Specific questions that might be considered when gathering
documents include:
1. How many items should be gathered?
2. What items should be sampled to create the document
collection?
3. What about copyright constraints?
Example (N=128)
….cont
The standard approach to information retrieval system
evaluation revolves around the notion of relevant and
non relevant documents.
With respect to a user information need, a document in
the test collection is given a binary classification as
either relevant or non relevant.
This decision is referred to as the gold standard or
ground truth judgment of relevance.
Mind Break
A document is relevant if it addresses the stated
information need, not because it just happens to
contain all the words in the query.
How ?
Types of Evaluation Strategies
•System-centered studies
– Given documents, queries, and relevance judgments
• Try several variations of the system
• Measure which system returns the “best” hit list
•User-centered studies
– Given several users, and at least two retrieval systems
• Have each user try the same task on both systems
• Measure which system works the “best” for users information need
Performance measures (Recall, Precision, etc.)
• The two most frequent and basic measures for
information retrieval effectiveness are :
1. Precision and
2. Recall.
Precision
Precision (P) is the fraction of retrieved documents that
are relevant
The ability to retrieve top-ranked documents
that are mostly relevant.
Precision is percentage of retrieved documents
that are relevant to the query (i.e. number of retrieved
documents that are relevant).
Precision Formula
Recall
Recall (R) is the fraction of relevant documents that are
retrieved
– The ability of the search to find all of the
relevant items in the corpus
– Recall is percentage of relevant documents
retrieved from the database in response to users query.
Recall Formula
Question
• When do you think the precision/recall has value
100% ? Or sometimes we can get the value of
precision and recall 100% or one. How can we
justify this value?
Example
Examples
An IR system returns 8 relevant documents, and 10
non relevant documents. There are a total of 20
relevant documents in the collection.
a. What is the precision of the system on this search,
and
b. what is its recall?
c. What is F-measure?
R- Precision
Precision at the R-th position in the ranking of results for a
query, where R is the total number of relevant documents.
It requires having a set of known relevant documents, from
which we calculate the precision of the top relevant
documents returned
– Calculate precision after R documents are seen
– Can be averaged over all queries
Example
Example 2:
Exercise
• Given a query q, for which the relevant documents are d1,
d6, d10, d15, d22, d26, an IR system retrieves the following
ranking: d6, d2, d11, d3, d10, d1, d14, d15, d7, d23.
• compute the precision and recall for this ranking at each
retrieved document.
Cont..
Cont..
• The average precision over positions 1, 5, 6, and 8
where relevant documents were found is
(1.0+0.40+0.50+0.50)/6=0.40. The R-precision is the
precision at position 6, which is 3/6=0.50.
total retrieved
Problems with both precision and recall
Number of irrelevant documents in the collection is not
taken into account.
Recall is undefined when there is no relevant document
in the collection.
Precision is undefined when no document is retrieved.
Other measures
Noise = retrieved irrelevant docs / retrieved docs
Silence/Miss = non-retrieved relevant docs / relevant
docs
Noise = 1 – Precision; Silence = 1 – Recall
F-measure
• A single measure that trades off precision versus
recall is the F measure, which is the weighted
harmonic mean of precision and recall:
• One measure of performance that takes into accounts
both recall and precision. Harmonic mean of recall
and precision:
Exercise
• The following list of Rs and Ns represents relevant (R) and
non relevant (N) returned documents in a ranked list of 20
documents retrieved in response to a query from a collection
of 10,000 documents. The top of the ranked list (the document
the system thinks is most likely to be relevant) is on the left of
the list. This list shows 6 relevant documents. Assume that
there are 8 relevant documents in total in the collection.
RRNNNNNNRNRNNNRNNNNR
Questions
• Calculate the following:
a) What is the precision of the system on the top 20?
b) What is recall?
c) What is p@10?
d) What is the F-measure on the top 20?
e) Assume that these 20 documents are the complete result set
of the system. What is the MAP for the query?
f) Noise
g) Silence
Difficulties in Evaluating IR System
IR systems essentially facilitate communication between a
user and document collections
Relevance is a measure of the effectiveness of
communication
– Effectiveness is related to the relevancy of retrieved
items.
– Relevance: relates to problem, information need,
query and a document or surrogate
……..cont
Relevance judgments is made by
– The user who posed the retrieval problem
– An external judge
– Is the relevance judgment made by users and external
person the same?
Relevance judgment is usually:
……….cont
– Subjective: Depends upon a specific user’s judgment.
– Situational: Relates to user’s current needs.
– Cognitive: Depends on human perception and
behavior.
– Dynamic: Changes over time.
Information Retrieval
Chapter 6:
Query Languages and Operations
Introduction
• Information is the main value of Information Society.
• Depending on the particular application scenario and on the
type of information that has to be managed and searched,
different techniques need to be devised.
• The dictionary definition of query is a set of instructions passed
to a database to retrieve particular data.
Cont….
• A query is the formulation of a user information need.
• A query is composed of keywords and the documents
containing such keywords are searched for popular and
Intuitive, Easy to express, Allow fast ranking.
Cont…
Query language (QL) refers to any computer programming language
that requests and retrieves data from database and information
systems by sending queries.
• Query Languages: A source language consisting of procedural
operators that invoke functions to be executed.
Keyword-based queries
Queries are combinations of words.
The document collection is searched for documents that
contain these words.
Word queries are intuitive, easy to express and provide fast
ranking.
popular Keyword-based queries are
1. Single-word queries:
A query is a single word
Simplest form of query.
All documents that include this word are retrieved.
Documents may be ranked by the frequency of this word in the
document.
Con’ted
2. phrase queries:
A query is a sequence of words treated as a single unit. Also
called “literal string” or “exact phrase” query, Phrase is usually
surrounded by quotation marks,
All documents that include this phrase are retrieved, Usually,
separators (commas, colons, etc.) and “trivial words” (e.g., “a”,
“the”, or “of”) in the phrase are ignored,
Conted
In effect, this query is for a set of words that must appear in
sequence, Allows users to specify a context and thus gain
precision.
Example: “United States of America”.
Con’ted
3. Multiple-word queries:
A query is a set of words (or phrases).
Two interpretations:
• A document is retrieved if it includes any of the query words.
• A document is retrieved if it includes each of the query
words.
Cont..
Documents may be ranked by the number of query words they
contain: A document containing n query words is ranked higher
than a document containing n-1 query words.
Documents containing all the query words are ranked at the top.
Documents containing only one query word are ranked at
bottom.
Frequency counts may still be used to break ties among documents
that contain the same query words.
Cont…
4. Proximity queries:
Restrict the distance within a document between two search
terms.
Important for large documents in which the two search words
may appear in different contexts.
Proximity specifications limit the acceptable occurrences and
hence increase the precision of the search.
Cont….
General Format: Word1 within m units of Word2. Unit may be
character, word, paragraph, etc.
Example:
• nuclear within 0 paragraphs of cleanup
Finds documents that discuss “nuclear” and “cleanup” in the
same paragraph.
• united within 5 words of american
Structural queries
So far, we assumed documents that are entirely free of
structure.
Structured documents would allow more powerful queries.
Queries could combine text queries with structural queries:
queries that relate to the structure of the document.
Mixing contents and structure in queries:
Cont…
• Contents words, phrases, or patterns and
• Structural constraints containment, proximity, or other
restrictions on structural elements
Example
• Example: Retrieve documents that contain a page in which the
phrase “terrorist attack” appears in the text and a photo whose
caption contains the phrase “World Trade Center”.
• The corresponding query could be: same page (“terrorist
attack”, photo (caption (“World Trade Center”))).
Types
Three main structures
Fixed structure
Hypertext structure
Hierarchical structure
Fixed structure
Document is divided to a fixed set of fields, much like a filled
form.
Fields may be associated with types, such as date.
Each field has text and fields cannot nest or overlap.
Queries (multiple-words, Boolean, proximity, patterns, etc.) are
targeted at particular fields.
Hypertext structure
Hierarchical structure
Intermediate model between fixed structure and hypertext.
The “anarchic” hypertext network is restricted to a hierarchical
structure.
The model allows recursive decomposition of documents.
Queries may combine Regular text queries, which are targeted
at particular areas (the target area is defined by a “path
expression”) and Queries on the structure itself; for example
“retrieve documents with at least 5 sections
Cont…..
Relevance feedback
After initial retrieval results are presented, allow the user to
provide feedback on the relevance of one or more of the
retrieved documents.
The system use this feedback information to reformulate the
query and Produce new results based on reformulated query.
After that allows more interactive, multi-pass process.
RF
The idea of relevance feedback (RF) is to involve the user in
RELEVANCE FEEDBACK the retrieval process so as to
improve the final result set.
In particular, the user gives feedback on the relevance of
documents in an initial set of results.
The basic procedure is:
The user issues a (short, simple) query.
The system returns an initial set of retrieval results.
The user marks some returned documents as relevant or
non relevant.
The system computes a better representation of the
information need based on the user feedback.
The system displays a revised set of retrieval results.
Architecture
THE END OF:
Chapter 5 and 6