How does population stratification create false GWAS results?

If cases and controls differ in ancestry, variants whose frequency differs between those ancestries appear associated with the trait through ancestry rather than causation, producing spurious associations across the genome.

How is stratification usually corrected?

The standard approach includes leading principal components of genome-wide genotypes as covariates, or uses a linear mixed model, so that association tests reflect effects within ancestry rather than ancestry differences themselves.

Population Stratification and Ancestry in GWAS

Population stratification is the systematic difference in ancestry between the people compared in a genetic study. When cases and controls differ in ancestral background, any variant whose frequency happens to differ between those ancestries will look associated with the trait even if it has no causal role - a confounding that can manufacture false positives across the whole genome. Detecting and adjusting for ancestry is therefore a core safeguard of valid association testing.

Nájsť tému v PaperMindČoskoroFind papers & topics

Tools & resources

Stiahnuť snímky

Learn & explore

VideoČoskoro

Definition

Population stratification is confounding of genotype-phenotype association by systematic ancestry differences between compared groups, and its control is the set of methods - principally ancestry principal components and mixed models - that adjust association tests so that signals reflect within-ancestry effects rather than ancestry itself.

Scope

This topic covers why ancestry differences confound association tests, how stratification is detected (genomic inflation, principal-components analysis), how it is corrected (principal-component covariates, mixed models, genomic control), and the broader equity concern that the European-ancestry skew of GWAS limits the transferability of findings and polygenic scores. It is a methods reference, not clinical guidance.

Core questions

How do ancestry differences between cases and controls create spurious associations?
How is stratification detected, and what does an inflated genomic-control factor indicate?
How does principal-components analysis correct for ancestry?
When are mixed models preferred for handling structure and relatedness?
Why does the European-ancestry skew of GWAS limit generalisability?

Key concepts

Confounding by ancestry
Genomic control and the inflation factor (lambda)
Principal-components analysis of genotypes
Ancestry-informative markers
Linear mixed models for structure and relatedness
Admixture and continuous ancestry
Transferability of findings and polygenic scores across ancestries

Mechanisms

If subgroups of differing ancestry are unequally represented among cases and controls, and if both disease risk and allele frequencies differ between those subgroups, allele frequency will track the trait through ancestry rather than causation, inflating test statistics genome-wide. Detection relies on this genome-wide signature: the genomic-control inflation factor summarises how much the median test statistic exceeds its null expectation, and principal-components analysis of genome-wide genotypes reveals axes of ancestry variation among samples. Correction typically includes leading principal components as covariates in the regression, which absorbs the ancestry signal, or uses linear mixed models that jointly account for structure and cryptic relatedness via a genetic relationship matrix. Reference panels such as the 1000 Genomes Project help place samples on a global ancestry map and inform imputation. Because most GWAS samples are of European ancestry, even well-corrected analyses yield effect estimates and polygenic scores that transfer imperfectly to other populations.

Clinical relevance

Adjusting for ancestry is essential to the validity of the genetic evidence used in disease research, and the ancestry composition of studies bears directly on whose biology is represented in genomic findings and scores. This topic is descriptive of methods and equity considerations; it is not a basis for individual genetic testing or clinical interpretation.

Evidence & guidelines

Standards here come from methodological literature rather than clinical guidelines. Price et al. (2006) introduced principal-components correction (the EIGENSTRAT approach) as a scalable solution; Price et al. (2010) reviewed and extended strategies including mixed models; the 1000 Genomes Project (2015) provided the diverse reference needed to characterise ancestry; and Visscher et al. (2017) highlight the generalisability and equity consequences of ancestry imbalance.

History

Concern that ancestry could confound genetic association predates GWAS, and early approaches such as genomic control and structured association were developed to address it. The 2006 introduction of principal-components analysis gave a fast, genome-wide way to model continuous ancestry and became standard practice, later complemented by mixed-model methods that also handle relatedness. As GWAS scaled into biobanks, the field increasingly recognised that controlling stratification within predominantly European samples does not solve the larger problem of under-representation of other ancestries.

Debates

Do ancestry corrections fully remove confounding, or can they also remove real signal?: Principal components and mixed models control stratification effectively in most settings, but distinguishing confounding from genuine ancestry-correlated biology - and avoiding over-correction that erases real effects - remains a methodological judgement, especially for traits with subtle geographic structure.
Does the European-ancestry skew of GWAS undermine equity and validity?: Findings and polygenic scores derived mostly from European-ancestry samples transfer imperfectly to other populations, raising scientific concerns about generalisability and equity concerns about the distribution of genomic-medicine benefits.

Key figures

Alkes Price
David Reich
Nick Patterson
Noah Zaitlen
Peter Visscher

Seminal works

price-2006
price-2010

Frequently asked questions

How does population stratification create false GWAS results?: If cases and controls differ in ancestry, variants whose frequency differs between those ancestries appear associated with the trait through ancestry rather than causation, producing spurious associations across the genome.
How is stratification usually corrected?: The standard approach includes leading principal components of genome-wide genotypes as covariates, or uses a linear mixed model, so that association tests reflect effects within ancestry rather than ancestry differences themselves.