Why is accuracy not enough to evaluate a recommender?

A recommender can be accurate yet unhelpful, for example by suggesting items the user already knows or near-duplicates. Properties such as diversity, novelty, serendipity, and coverage capture aspects of usefulness that accuracy misses, so good evaluation considers multiple dimensions.

Why is data splitting tricky in recommender evaluation?

Recommendation data is time-ordered and skewed toward popular items, so naive random splits can leak future information or reward simply recommending popular items. Careful time-based splits and bias-aware metrics are needed to make offline results predictive of real performance.

Recommender Evaluation

Recommender evaluation measures how good recommendations are, spanning predictive accuracy, ranking quality, and beyond-accuracy properties such as diversity, novelty, and coverage.

PaperMind(으)로 주제 찾기곧 제공Find papers & topics

Tools & resources

슬라이드 다운로드

Learn & explore

동영상곧 제공

Definition

Recommender evaluation is the set of methodologies and metrics for assessing the quality of a recommender system, including offline accuracy and ranking measures computed on held-out data, beyond-accuracy properties of the recommendation set, and user-centered and online experiments.

Scope

This topic covers how recommender systems are assessed: offline experiments using held-out interaction data, accuracy measures for rating prediction and for top-N ranking, and beyond-accuracy criteria including diversity, novelty, serendipity, and catalog coverage, as well as user studies and online experiments. It addresses experimental design pitfalls specific to recommendation, such as data splitting and popularity bias, and connects to the broader online-evaluation methods used across information access.

Core questions

How is recommendation quality measured for rating prediction versus top-N ranking?
Why are accuracy metrics alone insufficient to judge a recommender?
How are diversity, novelty, serendipity, and coverage quantified?
How should interaction data be split to avoid leakage and popularity bias?
How do offline, user-study, and online evaluations complement one another?

Key concepts

rating-prediction accuracy (MAE, RMSE)
top-N ranking metrics (precision, recall, nDCG)
diversity and novelty
serendipity
catalog coverage
offline vs. online evaluation
data splitting and leakage
popularity bias

Key theories

Accuracy and ranking evaluation: Recommenders are scored either on how well they predict ratings, using error measures, or on how well they rank items, using top-N measures such as precision, recall, and normalized discounted cumulative gain, the latter aligning better with how recommendations are consumed.
Beyond-accuracy evaluation: Because accurate but redundant or obvious recommendations may not satisfy users, evaluation also considers diversity, novelty, serendipity, and coverage, recognizing that recommendation quality is multidimensional.

Clinical relevance

Sound evaluation determines which recommendation changes are deployed and guards against optimizing the wrong objective. Beyond-accuracy concerns such as diversity and novelty directly affect user satisfaction and engagement, and connect to broader issues of filter bubbles and fairness in recommendation.

History

Herlocker and colleagues' 2004 article established a rigorous framework for evaluating collaborative-filtering recommenders, clarifying tasks and metrics. The Netflix Prize popularized RMSE-based accuracy evaluation, after which the field broadened toward ranking and beyond-accuracy measures, consolidated in handbook chapters that stress matching evaluation to the intended user task.

Key figures

Jonathan Herlocker
Joseph Konstan
Guy Shani
Asela Gunawardana

Seminal works

herlocker2004
shani2011
ricci2015

Frequently asked questions

Why is accuracy not enough to evaluate a recommender?: A recommender can be accurate yet unhelpful, for example by suggesting items the user already knows or near-duplicates. Properties such as diversity, novelty, serendipity, and coverage capture aspects of usefulness that accuracy misses, so good evaluation considers multiple dimensions.
Why is data splitting tricky in recommender evaluation?: Recommendation data is time-ordered and skewed toward popular items, so naive random splits can leak future information or reward simply recommending popular items. Careful time-based splits and bias-aware metrics are needed to make offline results predictive of real performance.