What makes data “categorical”?

Data are categorical when each observation falls into one of a set of discrete classes — such as diseased/healthy or treatment arm A/B/C — rather than taking a measured numeric value; the analysis works with the counts in each class.

How does this area differ from regression for continuous outcomes?

The outcome here is a category or a count, not a continuous measurement, so the methods centre on contingency tables, ratios of risks and odds, and models such as logistic and log-linear regression rather than on means and ordinary linear regression.

Categorical Data Analysis

Categorical data analysis is the branch of biostatistics concerned with data that fall into discrete categories rather than taking continuous numeric values — a disease being present or absent, a tumour being benign or malignant, a patient being assigned to one of several treatment arms. Its central object is the contingency table of counts, and its methods test and quantify associations between categorical variables while controlling for others.

Pronađite temu uz PaperMindUskoroFind papers & topics

Tools & resources

Preuzmi slajdove

Learn & explore

VideoUskoro

Definition

Categorical data analysis is the set of statistical methods for describing, testing, and modelling associations among variables whose values are counts within unordered or ordered discrete categories, typically organised as contingency tables of frequencies.

Scope

This area orients the reader to the core ideas that recur across the topic pages below it: how categorical observations are arranged into contingency tables, how association in a table is tested (chi-squared and exact tests), how an association is summarised by an effect measure (risk ratios and odds ratios), and how a confounding categorical variable is handled by stratification (Mantel-Haenszel methods). It frames these as methodological tools for reading and producing health research, not as clinical guidance.

Sub-topics

Core questions

Is there an association between two categorical variables, or are they independent?
How large is the association, expressed as a ratio or difference of risks or odds?
Does an apparent association persist after stratifying on a third categorical variable, or is it confounded or modified by it?
When cell counts are small, which exact procedure replaces the large-sample approximation?

Key concepts

Contingency table of counts
Independence of categorical variables
Chi-squared test of association
Exact tests for sparse tables
Effect measures: risk ratio and odds ratio
Stratification and the Mantel-Haenszel estimator
Confounding and effect modification across strata
Log-linear and logistic models for tables

Mechanisms

Categorical observations are cross-classified into a table whose cells hold frequencies. A test of association compares the observed cell counts with those expected if the row and column variables were independent: Pearson's chi-squared statistic, justified asymptotically by Fisher's clarification of its degrees of freedom, sums the squared standardised differences, while exact tests enumerate the conditional distribution of tables when counts are too small for the approximation. The strength of association is then summarised by an effect measure derived from the table — a risk ratio or an odds ratio. When a third variable threatens to confound the association, the data are split into strata defined by that variable and a pooled estimate is formed across strata; the Mantel-Haenszel procedure provides such a stratified test and summary estimate. These pieces generalise into log-linear and logistic regression models that handle several categorical predictors at once.

Clinical relevance

Most diagnostic, prognostic, and risk-factor evidence in the health sciences is reported as associations between categorical variables — exposed versus unexposed, event versus no event — so the methods in this area underlie how that evidence is generated and appraised. They describe how associations are measured and tested; they are tools for interpreting research and not a basis for individual diagnostic or treatment decisions.

Epidemiology

Contingency-table methods are the everyday machinery of epidemiology: cohort, case-control, and cross-sectional studies all reduce, at their simplest, to a two-by-two table of exposure against outcome, and stratified (Mantel-Haenszel) analysis is the classical non-model approach to confounding before regression. The same methods recur in clinical trials reporting binary endpoints and in diagnostic-test evaluation.

History

The field began with Karl Pearson's chi-squared statistic at the turn of the twentieth century and Fisher's 1922 correction of its degrees of freedom for contingency tables, followed by Fisher's exact test for small samples. Mid-century epidemiology supplied the effect-measure framework — Cornfield's odds-ratio argument and the Mantel-Haenszel stratified estimator of 1959 — and the later twentieth century unified these methods within the generalised-linear-model framework, synthesised in Agresti's textbook treatment.

Key figures

Karl Pearson
Ronald A. Fisher
Jerome Cornfield
Nathan Mantel
William Haenszel
Alan Agresti
Joseph Fleiss

Seminal works

fisher-1922
mantel-haenszel-1959
agresti-2013

Frequently asked questions

What makes data “categorical”?: Data are categorical when each observation falls into one of a set of discrete classes — such as diseased/healthy or treatment arm A/B/C — rather than taking a measured numeric value; the analysis works with the counts in each class.
How does this area differ from regression for continuous outcomes?: The outcome here is a category or a count, not a continuous measurement, so the methods centre on contingency tables, ratios of risks and odds, and models such as logistic and log-linear regression rather than on means and ordinary linear regression.