How does linkage disequilibrium let a GWAS type only some variants?

Because variants in a haplotype block are strongly correlated, a genotyped tag SNP carries information about its untyped neighbours, so an array of well-chosen tags captures most common variation in the genome.

What is the difference between D' and r-squared?

D' measures whether recombination has historically separated two alleles, while r-squared measures how well one variant statistically predicts another; r-squared is the quantity most relevant to the power of tag-SNP-based association testing.

Linkage Disequilibrium and SNP Tagging

Linkage disequilibrium (LD) is the non-random co-occurrence of alleles at different positions in the genome: variants close together tend to be inherited together as haplotype blocks. This correlation is what makes genome-wide association studies affordable - a genotyping array need only type a subset of carefully chosen 'tag' SNPs, because each tag stands in statistically for the untyped variants with which it is in strong LD.

Znajdź temat z PaperMindWkrótceFind papers & topics

Tools & resources

Pobierz slajdy

Learn & explore

WideoWkrótce

Definition

Linkage disequilibrium is the statistical association between alleles at two or more loci - their co-occurrence on haplotypes more or less often than expected if they were independent - and SNP tagging is the use of a subset of variants that, through LD, capture the variation of untyped neighbouring sites.

Scope

This topic explains what LD is, how it is measured (D' and r-squared), why it forms blocks shaped by recombination and population history, how tag SNPs are selected to capture common variation efficiently, and how LD both enables association mapping and complicates the localisation of causal variants. It is a methodological reference, not clinical guidance.

Core questions

What does it mean for two variants to be in linkage disequilibrium?
How are D' and r-squared used to quantify LD, and how do they differ?
Why does the genome fall into haplotype blocks, and what determines their boundaries?
How are tag SNPs chosen so an array captures most common variation?
Why does LD make it hard to identify the actual causal variant within an associated region?

Key concepts

Haplotype and haplotype block
D' (normalised disequilibrium coefficient)
r-squared (correlation between markers)
Recombination hotspots
Tag SNP selection
Reference haplotype panels (HapMap, 1000 Genomes)
Fine-mapping and causal-variant ambiguity

Mechanisms

Alleles at nearby loci are inherited together until recombination separates them, so over generations LD decays with genetic distance and is broken up at recombination hotspots, producing blocks of high internal correlation. Two common measures quantify it: D' captures whether recombination has occurred between two sites, while r-squared measures how well one variant predicts another and directly governs the power lost when a tag SNP proxies an untyped causal variant. Because variants within a block are strongly correlated, an array can genotype a chosen set of tag SNPs and recover most common variation, and missing variants can be statistically imputed against sequenced reference panels such as HapMap and the 1000 Genomes Project. The same correlation that enables tagging also means an association signal is shared across many variants in a block, so identifying the true causal variant requires additional fine-mapping rather than simply taking the most significant marker.

Clinical relevance

LD structure underlies how genome-wide genetic evidence is generated and how association regions are interpreted in disease research. This topic is descriptive of method and population genetics; it is not a basis for individual genetic testing or clinical interpretation.

Evidence & guidelines

Knowledge of human LD structure rests on large reference resources rather than clinical guidelines. The International HapMap Project (2007) mapped genome-wide LD and tag SNPs, the 1000 Genomes Project (2015) extended reference haplotypes across diverse populations, and reviews such as Slatkin (2008) and Bush and Moore (2012) explain how LD measures and tagging are applied in association mapping.

History

The concept of allelic association predates genomics, but its practical importance grew with the discovery in the early 2000s that the human genome has a block-like haplotype structure shaped by recombination hotspots. The HapMap Project then catalogued LD genome-wide and made tag-SNP selection feasible, which directly enabled the first affordable GWAS arrays. The 1000 Genomes Project later broadened reference panels to many populations, improving imputation and revealing how LD patterns differ by ancestry.

Debates

Do LD patterns transfer across populations?: Haplotype structure and LD vary with population history, so tag SNPs and imputation panels optimised in one ancestry capture variation imperfectly in another, contributing to the reduced performance of European-derived arrays and scores in other populations.

Key figures

Montgomery Slatkin
Mark Daly
David Altshuler
Goncalo Abecasis
William Bush

Seminal works

slatkin-2008
hapmap-2007
1000g-2015

Frequently asked questions

How does linkage disequilibrium let a GWAS type only some variants?: Because variants in a haplotype block are strongly correlated, a genotyped tag SNP carries information about its untyped neighbours, so an array of well-chosen tags captures most common variation in the genome.
What is the difference between D' and r-squared?: D' measures whether recombination has historically separated two alleles, while r-squared measures how well one variant statistically predicts another; r-squared is the quantity most relevant to the power of tag-SNP-based association testing.