Why is precision alone not enough to evaluate a search system?

Precision measures how many retrieved results are relevant but ignores how many relevant documents were missed, which recall captures. A system can have perfect precision by returning one obviously relevant result while missing many others, so the two are usually considered together or combined into rank-sensitive measures.

What advantage does nDCG offer over mean average precision?

nDCG uses graded relevance, distinguishing highly relevant from marginally relevant results, and explicitly discounts gains at lower ranks. This makes it well suited to web search, where users care most about the very top results and relevance is not simply yes or no.

IR Effectiveness Metrics

Effectiveness metrics turn a ranked list of results into a number that reflects how well it satisfies an information need, enabling systems to be compared and averaged across queries.

یافتن موضوع با PaperMindبه‌زودیFind papers & topics

Tools & resources

دریافت اسلایدها

Learn & explore

ویدیوبه‌زودی

Definition

An IR effectiveness metric is a function that maps a system's ranked output for one or more queries, together with relevance judgments, to a score quantifying retrieval quality, with different metrics emphasizing recall, early precision, or graded gain at top ranks.

Scope

This topic covers the measures used to score retrieval output: set-based precision and recall and their F-measure combination, rank-sensitive measures including precision at k, average precision and mean average precision, reciprocal rank, and gain-based measures such as discounted cumulative gain and its normalized form. It addresses what each metric rewards, how metrics handle graded relevance and incomplete judgments, and how scores are aggregated and tested for significance. It excludes the collections and judgments that supply the relevance data.

Core questions

How do precision and recall capture complementary aspects of retrieval quality?
Why are rank-sensitive metrics needed when users scan results top-down?
How does average precision summarize a ranked list into a single number?
How do gain-based metrics such as nDCG use graded relevance and rank discounting?
How are metrics affected by incomplete relevance judgments?

Key concepts

precision and recall
F-measure
precision at k
average precision and MAP
mean reciprocal rank (MRR)
discounted cumulative gain (DCG / nDCG)
graded relevance
robust metrics for incomplete judgments (bpref)

Key theories

Precision, recall, and average precision: Precision and recall measure the fraction of retrieved items that are relevant and the fraction of relevant items retrieved; average precision integrates precision across recall levels for a single query, and its mean over queries (MAP) is a standard summary for ranked retrieval.
Discounted cumulative gain: Gain-based evaluation assigns each result a gain according to its graded relevance and discounts gains at lower ranks, then normalizes against the ideal ranking, yielding nDCG, which rewards placing highly relevant items near the top.
Evaluation with incomplete judgments: When not all documents are judged, naive metrics can be biased, motivating measures such as bpref and inferred AP that are more robust to unjudged documents in large or pooled collections.

Clinical relevance

Effectiveness metrics are the yardstick by which retrieval research and industry measure progress and choose between systems. nDCG and MAP in particular are routine in evaluation campaigns and production offline testing, and metric choice shapes which behaviors a ranking system is optimized to produce.

History

Precision and recall date to the earliest IR experiments, and average precision became the workhorse of TREC ad hoc evaluation. Järvelin and Kekäläinen's 2002 cumulated-gain measures introduced graded-relevance, rank-discounted evaluation, giving nDCG, which became dominant for web-style ranking. Work on incomplete judgments produced robust metrics for large collections.

Key figures

Kalervo Järvelin
Jaana Kekäläinen
Ellen M. Voorhees
Chris Buckley

Seminal works

manning2008
jarvelin2002
buckley2004

Frequently asked questions

Why is precision alone not enough to evaluate a search system?: Precision measures how many retrieved results are relevant but ignores how many relevant documents were missed, which recall captures. A system can have perfect precision by returning one obviously relevant result while missing many others, so the two are usually considered together or combined into rank-sensitive measures.
What advantage does nDCG offer over mean average precision?: nDCG uses graded relevance, distinguishing highly relevant from marginally relevant results, and explicitly discounts gains at lower ranks. This makes it well suited to web search, where users care most about the very top results and relevance is not simply yes or no.