Why not judge every document in the collection?

Large collections contain millions of documents, so judging all of them for every topic is infeasible. Pooling judges only the documents that contributing systems rank highly, which captures most relevant documents while keeping assessment effort manageable.

What is the risk of treating unjudged documents as non-relevant?

A later system might retrieve relevant documents that were never in the pool and therefore counted as non-relevant, unfairly lowering its measured score. This pool bias is why deeper, more diverse pools and judgment-robust metrics are used when reusing collections.

Pooling and Relevance Assessment

Pooling is the method that makes large-scale IR evaluation feasible by judging only the documents that participating systems rank highly, rather than every document in the collection.

Najít téma v PaperMindJiž brzyFind papers & topics

Tools & resources

Stáhnout prezentaci

Learn & explore

VideoJiž brzy

Definition

Pooling is a sampling strategy for relevance assessment in which the highest-ranked documents from a set of contributing retrieval runs are merged, with duplicates removed, into a pool that human assessors judge, with documents outside the pool conventionally treated as non-relevant.

Scope

This topic covers how relevance judgments are gathered efficiently for large collections, principally the pooling method used in TREC and similar campaigns, where the top-ranked documents from many systems are merged into a pool that assessors judge. It addresses pool depth, the treatment of unjudged documents as non-relevant, the reusability and potential bias of pooled collections, and assessor effort and agreement. It excludes the metrics computed afterward and the definition of the collection itself.

Core questions

How does pooling reduce the number of documents that must be judged?
How is pool depth chosen, and how does it affect coverage of relevant documents?
Why are unjudged documents usually treated as non-relevant, and what bias can that introduce?
How reusable are pooled collections for systems that did not contribute to the pool?
How are assessor effort, agreement, and quality managed?

Key concepts

pooling method
pool depth
contributing runs
unjudged-as-non-relevant assumption
pool bias and reusability
assessor agreement
incomplete relevance information
crowdsourced relevance assessment

Key theories

Pooling for scalable assessment: By judging only the union of the top-ranked documents from many diverse systems, pooling makes it practical to evaluate large collections while still finding most of the relevant documents that any reasonable system would surface.
Reliability and reusability concerns: Pooling can under-represent relevant documents found only by future systems, raising questions about bias and reusability that motivate deeper pools, diverse contributors, and robust metrics for incomplete judgments.

Clinical relevance

Pooling is what makes shared, reusable test collections affordable, and it underlies the judgments behind decades of benchmark results. Understanding its assumptions matters when reusing old collections to evaluate new methods, especially neural systems that may surface relevant documents the original pools never judged.

History

Pooling was adopted by TREC from its start in 1992 to make judging large collections tractable. Zobel's 1998 analysis examined the reliability and reusability of pooled collections, and subsequent work on incomplete judgments produced metrics and deeper or smarter pooling strategies to mitigate bias as collections and system populations evolved.

Key figures

Ellen M. Voorhees
Justin Zobel
Chris Buckley

Seminal works

voorhees2005
zobel1998
buckley2004

Frequently asked questions

Why not judge every document in the collection?: Large collections contain millions of documents, so judging all of them for every topic is infeasible. Pooling judges only the documents that contributing systems rank highly, which captures most relevant documents while keeping assessment effort manageable.
What is the risk of treating unjudged documents as non-relevant?: A later system might retrieve relevant documents that were never in the pool and therefore counted as non-relevant, unfairly lowering its measured score. This pool bias is why deeper, more diverse pools and judgment-robust metrics are used when reusing collections.