Is one critical appraisal tool best for every study?

No. Because different designs are prone to different biases, most appraisal is done with design-specific tools, and a systematic review found no single gold-standard instrument that works across all study types.

Why have many fields moved away from quality scores?

Summary quality scores combine items with arbitrary weights and can rank studies misleadingly. Domain-based tools such as RoB 2 and QUADAS-2 instead give a transparent judgement for each kind of bias, which is more defensible and reproducible.

Critical Appraisal Tools and Checklists

Critical appraisal tools are structured instruments — checklists, scales, and signalling-question frameworks — that guide a reviewer through the validity, results, and applicability of a study in an explicit and repeatable way. By turning expert judgement into a defined set of questions, they make appraisal more transparent, more consistent between reviewers, and easier to report.

Najít téma v PaperMindJiž brzyFind papers & topics

Tools & resources

Stáhnout prezentaci

Learn & explore

VideoJiž brzy

Definition

A critical appraisal tool is a predefined set of items or domains, often phrased as questions, that a reviewer applies to an individual study to make an explicit, criteria-based judgement about its risk of bias, the interpretability of its results, and its applicability.

Scope

This topic covers the families of appraisal instruments and the rationale behind them: generic checklists (such as the CASP series and the Users' Guides), design-specific risk-of-bias tools (such as RoB 2 for randomised trials and QUADAS-2 for diagnostic accuracy studies), and the difference between simple checklists, summary quality scales, and domain-based judgement tools. It is reference-educational and does not endorse any single tool for clinical decisions.

Core questions

What kinds of appraisal instruments exist, and how do checklists, scales, and domain-based tools differ?
Why are most appraisal tools design-specific rather than universal?
What is the difference between a summary quality score and a domain-based risk-of-bias judgement?
How much does the choice of tool affect the appraisal of the same study?

Key concepts

Generic appraisal checklist (CASP, Users' Guides)
Design-specific risk-of-bias tool (RoB 2, QUADAS-2)
Signalling questions
Domain-based judgement versus summary quality score
Inter-rater reliability of appraisal
Reproducibility of appraisal judgements

Mechanisms

Appraisal tools operationalise the generic validity-results-applicability logic of evidence-based medicine into concrete items keyed to a particular design. Generic checklists such as CASP and the JAMA Users' Guides walk a reader through the same three questions for any paper (Guyatt 1993; Greenhalgh 1997). Modern domain-based tools go further by grouping items into bias domains — for example RoB 2 evaluates randomised trials across domains such as the randomisation process, deviations from intended interventions, missing outcome data, measurement of the outcome, and selection of the reported result, reaching a per-domain and overall judgement through signalling questions (Sterne 2019). QUADAS-2 applies the same domain-and-signalling-question architecture to diagnostic accuracy studies (Whiting 2011). The shift from numeric summary scales to domain-based judgement reflects evidence that arbitrary weighting of checklist items can mislead, and that transparent per-domain reasoning is more defensible.

Clinical relevance

These tools are used by clinicians, students, and systematic reviewers to make the appraisal of individual studies explicit and auditable. They describe how the trustworthiness of research is assessed; they characterise evidence and are not themselves a basis for diagnosing or treating any individual patient.

Evidence & guidelines

A systematic review of more than a hundred appraisal tools found substantial heterogeneity in content and no single validated gold standard for any study design, underscoring that tool choice is itself a methodological decision (Katrak 2004). Contemporary practice favours design-specific, domain-based instruments — RoB 2 for randomised trials and QUADAS-2 for diagnostic accuracy studies are widely endorsed in Cochrane and other systematic-review guidance (Sterne 2019; Whiting 2011) — and discourages converting these judgements into a single summary quality score.

History

Early appraisal aids were narrative reading guides; the McMaster Users' Guides of the 1990s and the CASP checklists that followed gave clinicians explicit, study-type-specific question sets (Guyatt 1993; Greenhalgh 1997). As systematic reviewing matured, the field moved from simple checklists and numeric quality scales toward domain-based risk-of-bias tools, exemplified by QUADAS-2 for diagnostic studies (Whiting 2011) and the revised RoB 2 for randomised trials (Sterne 2019), reflecting accumulating evidence that summary scores could be unreliable.

Debates

Quality scores versus domain-based judgement: Collapsing many appraisal items into a single numeric quality score depends on arbitrary weighting and can produce misleading rankings; current methodological consensus favours transparent, per-domain risk-of-bias judgements over summary scales.
Lack of a universal gold-standard tool: The proliferation of tools with divergent content and no validated reference instrument for any design means the same study can be appraised differently depending on the tool, raising concerns about reproducibility.

Key figures

Julian Higgins
Jonathan Sterne
Penny Whiting
Gordon Guyatt
Trisha Greenhalgh

Seminal works

katrak-2004
sterne-2019-rob2
whiting-2011-quadas2

Frequently asked questions

Is one critical appraisal tool best for every study?: No. Because different designs are prone to different biases, most appraisal is done with design-specific tools, and a systematic review found no single gold-standard instrument that works across all study types.
Why have many fields moved away from quality scores?: Summary quality scores combine items with arbitrary weights and can rank studies misleadingly. Domain-based tools such as RoB 2 and QUADAS-2 instead give a transparent judgement for each kind of bias, which is more defensible and reproducible.