What is the difference between nucleotide diversity and the number of segregating sites?

The number of segregating sites counts how many positions vary in a sample, while nucleotide diversity averages the differences between pairs of sequences; both estimate the same underlying parameter under a simple neutral model, and their discrepancy is itself informative.

Is a SNP the same thing as a mutation?

A SNP is a single-nucleotide variant observed segregating in a population; it originates from a point mutation, but the term emphasises that the variant is present at appreciable frequency rather than being a newly arisen change in one individual.

Nucleotide Diversity and Variant Classification

Nucleotide diversity measures how much two randomly chosen sequences from a population differ on average, while variant classification organises the many kinds of DNA differences — single-nucleotide substitutions, small insertions and deletions, and larger structural changes — into a consistent vocabulary. Together they describe both how much variation a genome carries and what that variation looks like.

Troba un tema amb PaperMindAviatFind papers & topics

Tools & resources

Baixa les diapositives

Learn & explore

VídeoAviat

Definition

Nucleotide diversity (commonly denoted pi) is the average number of nucleotide differences per site between two sequences sampled from a population; variant classification is the systematic categorisation of observed sequence differences (e.g., single-nucleotide variants, indels, structural variants).

Scope

The entry covers the standard summary measures of within-population sequence variation, especially nucleotide diversity and the number of segregating sites, and the classification of variant types by size and by predicted effect on sequence. It treats these as descriptive and methodological concepts; it does not assign clinical significance to particular variants.

Core questions

How is the amount of sequence variation in a sample summarised?
How do nucleotide diversity and the number of segregating sites differ as estimators?
What are the main classes of genetic variant by size and type?
How are variants represented and exchanged in a standard file format?

Key concepts

Nucleotide diversity (pi)
Segregating sites and Watterson's theta
Single-nucleotide variant (SNV/SNP)
Insertion-deletion (indel)
Structural variant
Reference and alternate alleles
Variant Call Format (VCF)

Key theories

Infinite-sites model and theta: Under the infinite-sites assumption each new mutation falls at a previously unmutated site, so the population mutation parameter theta can be estimated either from the number of segregating sites (Watterson's estimator) or from average pairwise differences (nucleotide diversity); systematic discrepancy between the two is informative about departures from neutrality.

Mechanisms

Variation is first detected by aligning sequenced reads to a reference genome and identifying positions that differ; differences are then classified by size and form. Summary statistics condense this into population-level measures: the number of segregating sites underlies Watterson's estimator of theta, while average pairwise differences define nucleotide diversity. Because both estimate the same parameter under a neutral, constant-size model, their difference (formalised by Tajima) flags demographic change or selection. Standardised representation in the Variant Call Format allows variants to be stored, shared, and compared across studies.

Clinical relevance

A consistent variant vocabulary and reliable diversity estimates are prerequisites for interpreting genomic data in health settings, because the same descriptive categories are used when a sequenced genome is screened for clinically relevant variants. This entry explains how variants are described and counted and is not a basis for individual diagnostic or treatment decisions.

Evidence & guidelines

Foundational estimators of sequence diversity were established by Watterson and by Tajima, while large surveys such as the early human SNP map and the 1000 Genomes Project reference provide the empirical scale of human variation. The Variant Call Format and its tooling are the de facto community standard for representing classified variants.

History

Early molecular population genetics quantified variation through allozyme and restriction-site surveys, then through DNA sequencing. Watterson's 1975 and Tajima's 1989 work gave the estimators still used today, and the 2001 human SNP map and later sequencing consortia turned variant cataloguing into a genome-wide enterprise, accompanied by standard formats such as VCF for representing the resulting variants.

Key figures

G. A. Watterson
Fumio Tajima
Richard Durbin
Gonçalo Abecasis

Seminal works

watterson-1975
tajima-1989
snp-map-2001

Frequently asked questions

What is the difference between nucleotide diversity and the number of segregating sites?: The number of segregating sites counts how many positions vary in a sample, while nucleotide diversity averages the differences between pairs of sequences; both estimate the same underlying parameter under a simple neutral model, and their discrepancy is itself informative.
Is a SNP the same thing as a mutation?: A SNP is a single-nucleotide variant observed segregating in a population; it originates from a point mutation, but the term emphasises that the variant is present at appreciable frequency rather than being a newly arisen change in one individual.