Why are test collections so central to IR research?

A test collection of documents, queries, and relevance judgments lets different systems be scored on exactly the same task, making comparisons reproducible and fair. Reusable collections also let new systems be evaluated without gathering fresh judgments each time.

Why use online evaluation if test collections exist?

Test collections measure effectiveness against fixed judgments but cannot fully capture real user satisfaction, context, or behavior. Online experiments such as A/B tests and interleaving observe how actual users respond, complementing offline metrics with behavioral evidence.

Evaluation in Information Retrieval

Evaluation in information retrieval is the methodology for measuring how well a retrieval system satisfies information needs, using test collections, relevance judgments, and effectiveness metrics.

Definition

Information retrieval evaluation is the set of experimental methods and metrics used to quantify a system's effectiveness at returning relevant results for stated information needs, encompassing offline test-collection experiments and online user-based experiments.

Scope

This area covers how retrieval quality is measured: the Cranfield test-collection paradigm of documents, queries, and relevance judgments; effectiveness metrics such as precision, recall, mean average precision, and normalized discounted cumulative gain; pooling and assessment methods for gathering judgments at scale; and user-centered and online evaluation through studies and controlled experiments such as A/B testing and interleaving. It treats the science of measuring effectiveness, distinct from the models and systems being measured.

Sub-topics

Core questions

How can the quality of a ranked list be quantified objectively?
What constitutes a reusable test collection, and how is relevance judged?
Which metrics capture the user-perceived quality of rankings?
How can relevance judgments be gathered affordably for large collections?
How do online experiments measure real user satisfaction?

Key concepts

test collection
relevance judgments (qrels)
precision and recall
mean average precision (MAP)
normalized discounted cumulative gain (nDCG)
pooling
interleaving and A/B testing
statistical significance of results

Key theories

Cranfield test-collection paradigm: Retrieval systems can be compared reproducibly by fixing a document collection, a set of queries, and human relevance judgments, then scoring each system's output against the judgments, allowing controlled, repeatable experiments.
Effectiveness as a measurable construct: Defining metrics over ranked output, from set-based precision and recall to rank-sensitive measures such as average precision and discounted cumulative gain, turns the vague notion of search quality into quantities that can be averaged across queries and compared statistically.
Offline and online evaluation complementarity: Test-collection experiments offer reproducibility and control but rely on judged relevance, whereas online experiments such as A/B tests and interleaving measure real user behavior, and the two together give a fuller picture of system quality.

Clinical relevance

Rigorous evaluation is what lets the field measure progress and compare systems fairly; shared test collections and evaluation campaigns such as TREC have driven decades of advances. Online evaluation methods such as A/B testing and interleaving are core tools for improving production search and recommendation systems.

History

Systematic IR evaluation began with Cleverdon's Cranfield experiments in the 1960s, which established the test-collection paradigm. The Text REtrieval Conference (TREC), launched in 1992 by NIST, scaled this approach to large collections and many tasks, standardizing metrics and pooling. Online evaluation through controlled experiments grew with web-scale interactive systems.

Key figures

Cyril Cleverdon
Ellen M. Voorhees
Karen Spärck Jones
Mark Sanderson

Seminal works

cleverdon1967
voorhees2005
sanderson2010

Frequently asked questions

Why are test collections so central to IR research?: A test collection of documents, queries, and relevance judgments lets different systems be scored on exactly the same task, making comparisons reproducible and fair. Reusable collections also let new systems be evaluated without gathering fresh judgments each time.
Why use online evaluation if test collections exist?: Test collections measure effectiveness against fixed judgments but cannot fully capture real user satisfaction, context, or behavior. Online experiments such as A/B tests and interleaving observe how actual users respond, complementing offline metrics with behavioral evidence.