What is interleaving and why is it used?

Interleaving merges the results of two ranking systems into a single list shown to each user and attributes clicks to whichever system contributed each clicked result. Because each user effectively compares both systems at once, interleaving is often more sensitive than A/B testing for detecting ranking improvements.

Why can't clicks be taken at face value as relevance?

Users tend to click higher-ranked results regardless of true relevance (position bias) and are influenced by how results are presented. Click models correct for these biases so that clicks can be interpreted as more reliable evidence of relevance.

User and Online Evaluation

User and online evaluation measure retrieval quality through real or simulated user interaction, using studies, click data, A/B tests, and interleaving rather than fixed relevance judgments.

Definition

User and online evaluation comprises methods that assess retrieval systems through user interaction, ranging from controlled laboratory studies of task performance and satisfaction to large-scale online experiments such as A/B tests and interleaving that compare systems by observing the behavior of real users.

Scope

This topic covers evaluation that centers on users and their behavior: interactive user studies of task success and satisfaction, the use of implicit signals such as clicks and dwell time, click models that interpret behavior, and controlled online experiments including A/B testing and interleaving. It addresses how to measure real user benefit, the biases of behavioral signals, and the design and analysis of online experiments. It complements the offline test-collection evaluation covered in adjacent topics.

Core questions

How can real user satisfaction and task success be measured rather than just relevance against judgments?
What implicit signals do users provide, and how reliable are they?
How do click models account for position and presentation bias?
How do A/B testing and interleaving compare systems online?
Why is interleaving often more sensitive than A/B testing for ranking comparisons?

Key concepts

interactive user study
task success and satisfaction
implicit feedback (clicks, dwell time)
click models (position, cascade)
position and presentation bias
A/B testing
interleaving
online metrics and sensitivity

Key theories

Implicit feedback and click models: User clicks and other interactions provide abundant but biased relevance signals; click models such as the position and cascade models formalize how users examine results so that clicks can be interpreted as evidence of relevance.
Controlled online experimentation: A/B testing randomly assigns users to system variants and compares outcome metrics, while interleaving blends two rankings into one list and attributes clicks, often yielding more sensitive within-user comparisons of ranking quality.

Clinical relevance

Online evaluation is the primary way large search, recommendation, and e-commerce systems decide which changes to ship, because it measures real user impact. A/B testing and interleaving, interpreted through click models that correct for bias, drive continuous improvement of production ranking at scale.

History

User-centered IR evaluation has long studied interactive search behavior, but the rise of web search made large-scale online evaluation practical. Joachims's 2002 work established clickthrough data as a relevance signal and introduced interleaving, controlled web experimentation matured in industry through the 2000s, and the 2016 survey consolidated online evaluation methods.

Key figures

Thorsten Joachims
Filip Radlinski
Katja Hofmann
Ron Kohavi

Seminal works

hofmann2016
joachims2002
kohavi2009

Frequently asked questions

What is interleaving and why is it used?: Interleaving merges the results of two ranking systems into a single list shown to each user and attributes clicks to whichever system contributed each clicked result. Because each user effectively compares both systems at once, interleaving is often more sensitive than A/B testing for detecting ranking improvements.
Why can't clicks be taken at face value as relevance?: Users tend to click higher-ranked results regardless of true relevance (position bias) and are influenced by how results are presented. Click models correct for these biases so that clicks can be interpreted as more reliable evidence of relevance.