Hypothesis test

Fleiss' Kappa for Multiple Rater Agreement

Also known as: multi-rater kappa, Fleiss kappa, Fleiss' Kappa (Çoklu Değerlendirici Uyumu)

Fleiss' Kappa is a non-parametric statistic for measuring the degree of agreement among three or more raters who classify items into mutually exclusive nominal categories. Introduced by Joseph L. Fleiss in 1971 as a generalization of Cohen's Kappa beyond two raters, it corrects observed agreement for the level of agreement expected by chance alone, making it the standard reliability index in medical diagnosis studies, content analysis, and multi-coder research.

Tools & resources

Download slides

Learn & explore

Read the full method

Members only

Method map

The neighbourhood of related methods — select a node to explore.

Fleiss' Kappa

Cohen's Kappa Reliability Analysis Gwet's AC1 Interrater Reliability Intraclass Correlation C…

When to use it

Use Fleiss' Kappa when three or more independent raters each assess the same set of items using a fixed set of nominal (unordered) categories, and the primary question is whether rater agreement exceeds chance. The key assumptions are: (1) the number of raters assigned to each item is constant across items, (2) categories are fixed and mutually exclusive, and (3) ratings are independent. The statistic is designed for nominal categories only; for ordered categories, a weighted variant that penalises larger disagreements more heavily is preferable. With only two raters, Cohen's Kappa is the appropriate choice.

Strengths & limitations

Strengths

Extends Cohen's Kappa to any number of raters without requiring the same raters to evaluate every item, making it highly practical for large annotation projects.
Corrects for chance agreement, unlike simple percentage agreement, providing a more honest reliability estimate.
Widely reported benchmark conventions (Landis and Koch, 1977) make results interpretable across disciplines.
Does not require the data to be normally distributed, making it suitable for any categorical classification task.

Limitations

Applies only to nominal (unordered) categories; for ordinal ratings a weighted Fleiss Kappa is needed.
The κ value is sensitive to the prevalence of categories — when one category dominates, the expected agreement is high and κ can be paradoxically low even when raters often agree.
Assumes a fixed number of raters per item; variable numbers of raters require alternative approaches such as Krippendorff's Alpha.
Large-sample normal approximation underlies the z-test; with fewer than roughly 20 items the p-value may be unreliable.

Frequently asked

What is the difference between Fleiss' Kappa and Cohen's Kappa?

Cohen's Kappa (1960) was designed for exactly two raters assessing the same items. Fleiss' Kappa (1971) generalises the statistic to three or more raters and additionally allows different subsets of raters to evaluate different items, as long as the number of raters per item is constant. When you have only two raters, use Cohen's Kappa; with three or more, use Fleiss' Kappa.

How do I interpret the κ value?

Landis and Koch (1977) proposed the most widely cited benchmarks: below 0.20 = slight, 0.21–0.40 = fair, 0.41–0.60 = moderate, 0.61–0.80 = substantial, above 0.80 = almost perfect. These are conventions, not laws. In high-stakes fields such as clinical diagnosis many practitioners require κ ≥ 0.80 before a rating scheme is considered adequately reliable.

Can Fleiss' Kappa handle ordinal categories?

The standard formula treats all disagreements equally and is therefore appropriate only for nominal (unordered) categories. For ordinal ratings — where disagreeing by one step is less serious than disagreeing by three — a weighted version of Fleiss' Kappa that penalises larger disagreements more heavily should be used instead.

What minimum sample size is recommended?

As a rule of thumb the analysis requires at least 20 items (subjects or cases) to yield stable estimates and a reliable large-sample z-test. With very few items the asymptotic normal approximation for the standard error breaks down and bootstrapped confidence intervals are advisable.

Sources

Fleiss, J.L. (1971). Measuring Nominal Scale Agreement Among Many Raters. Psychological Bulletin, 76(5), 378–382. DOI: 10.1037/h0031619 ↗

How to cite this page

ScholarGate. (2026, June 1). Fleiss' Kappa for Multiple Rater Agreement. ScholarGate. https://scholargate.app/en/statistics/fleiss-kappa

Which method?

Set this method beside its closest kin and read them side by side — the library lays the books on the table; the choice is yours.

Cohen's KappaStatistics↔ compare
Reliability AnalysisReliability↔ compare

Compare side by side →

Referenced by

Cohen's Kappa Gwet's AC1 Interrater Reliability Intraclass Correlation Coefficient

Related reference concepts

Interrater Reliability Measurement Validity and Reliability Chi-Squared and Fisher Exact Tests Evaluation and Annotation Correlation and Covariance Heterogeneity in Meta-Analysis

Spotted an issue on this page? Report or suggest a fix →

Fleiss' Kappa for Multiple Rater Agreement

Also known as: multi-rater kappa, Fleiss kappa, Fleiss' Kappa (Çoklu Değerlendirici Uyumu)

Tools & resources

Download slides

Learn & explore

Read the full method

Members only

When to use it

Strengths & limitations

Strengths

Extends Cohen's Kappa to any number of raters without requiring the same raters to evaluate every item, making it highly practical for large annotation projects.
Corrects for chance agreement, unlike simple percentage agreement, providing a more honest reliability estimate.
Widely reported benchmark conventions (Landis and Koch, 1977) make results interpretable across disciplines.
Does not require the data to be normally distributed, making it suitable for any categorical classification task.

Limitations

Applies only to nominal (unordered) categories; for ordinal ratings a weighted Fleiss Kappa is needed.
The κ value is sensitive to the prevalence of categories — when one category dominates, the expected agreement is high and κ can be paradoxically low even when raters often agree.
Assumes a fixed number of raters per item; variable numbers of raters require alternative approaches such as Krippendorff's Alpha.
Large-sample normal approximation underlies the z-test; with fewer than roughly 20 items the p-value may be unreliable.

Frequently asked

What is the difference between Fleiss' Kappa and Cohen's Kappa?

How do I interpret the κ value?

Can Fleiss' Kappa handle ordinal categories?

What minimum sample size is recommended?

Sources

Fleiss, J.L. (1971). Measuring Nominal Scale Agreement Among Many Raters. Psychological Bulletin, 76(5), 378–382. DOI: 10.1037/h0031619 ↗

How to cite this page

ScholarGate. (2026, June 1). Fleiss' Kappa for Multiple Rater Agreement. ScholarGate. https://scholargate.app/en/statistics/fleiss-kappa

Which method?

Set this method beside its closest kin and read them side by side — the library lays the books on the table; the choice is yours.

Cohen's KappaStatistics↔ compare
Reliability AnalysisReliability↔ compare

Compare side by side →

Fleiss' Kappa for Multiple Rater Agreement

Read the full method

Method map

When to use it

Strengths & limitations

Frequently asked

Sources

How to cite this page

Which method?

Referenced by

Similar methods

Related reference concepts

Fleiss' Kappa for Multiple Rater Agreement

Read the full method

Method map

When to use it

Strengths & limitations

Frequently asked

Sources

How to cite this page

Which method?

Referenced by

Similar methods

Related reference concepts