Qrels (query relevance judgments) are the records that state, for each topic in a test collection, which documents have been judged relevant and at what grade. Evaluation tools compare a system's ranked output against the qrels to compute effectiveness metrics.

Do disagreements between human judges invalidate test collections?

Assessors do disagree on individual documents, but research has repeatedly shown that the relative ordering of systems remains stable across different assessors. So while absolute scores shift, the conclusions about which system is better are generally robust.

Test Collections and Relevance Judgments

A test collection bundles a document set, a set of queries, and human relevance judgments so that retrieval systems can be scored and compared reproducibly.

Cari Topik dengan PaperMindTidak lama lagiFind papers & topics

Tools & resources

Muat turun slaid

Learn & explore

VideoTidak lama lagi

Definition

A test collection is a fixed dataset comprising a corpus of documents, a set of query or topic statements describing information needs, and relevance judgments specifying which documents are relevant to each topic, together enabling reproducible measurement of retrieval effectiveness.

Scope

This topic covers the construction and use of reusable IR test collections following the Cranfield paradigm: the document corpus, topic statements that define information needs, and the relevance judgments (qrels) that record which documents are relevant to each topic. It addresses graded versus binary relevance, judgment consistency, reusability of collections for new systems, and the role of large-scale efforts such as TREC. It excludes the metrics computed from judgments and the pooling procedures used to gather them, which are adjacent topics.

Core questions

What are the three components of a Cranfield-style test collection?
How are information needs expressed as topics distinct from the short queries given to systems?
How is relevance defined and recorded, and when is graded relevance used?
How consistent are human relevance judgments, and does inconsistency affect comparisons?
What makes a test collection reusable for systems that did not contribute to it?

Key concepts

document corpus
topic / information need statement
relevance judgments (qrels)
binary vs. graded relevance
assessor agreement
collection reusability
TREC test collections
ground truth for evaluation

Key theories

Cranfield paradigm: Fixing documents, queries, and relevance judgments creates a controlled laboratory setting in which any system's ranked output can be scored against the judgments, making retrieval experiments reproducible and comparable.
Robustness of comparisons to judge disagreement: Although human assessors disagree about individual relevance decisions, studies show that the relative ranking of systems on a collection is largely stable across assessors, supporting the validity of test-collection comparisons.

Clinical relevance

Shared test collections are the common currency of IR research, letting researchers worldwide compare systems on identical tasks and reproduce results. Collections from evaluation campaigns such as TREC, CLEF, and NTCIR have shaped decades of progress and remain standard benchmarks for new retrieval methods.

History

The test-collection methodology originated with Cleverdon's Cranfield experiments in the 1960s, which compared indexing approaches using fixed queries and judgments. The launch of TREC in 1992 scaled the paradigm to large, realistic collections and many tasks, producing the standardized, reusable collections that anchor modern IR evaluation.

Key figures

Cyril Cleverdon
Ellen M. Voorhees
Donna Harman

Seminal works

cleverdon1967
voorhees2005

Frequently asked questions

What are 'qrels'?: Qrels (query relevance judgments) are the records that state, for each topic in a test collection, which documents have been judged relevant and at what grade. Evaluation tools compare a system's ranked output against the qrels to compute effectiveness metrics.
Do disagreements between human judges invalidate test collections?: Assessors do disagree on individual documents, but research has repeatedly shown that the relative ordering of systems remains stable across different assessors. So while absolute scores shift, the conclusions about which system is better are generally robust.