ScholarGate
Assistent

Pathway Enrichment and Network Analysis

A genomic experiment often ends with a list of dozens or hundreds of genes — too many to interpret one at a time. Pathway enrichment analysis asks a sharper question: given this gene list, are any known biological pathways or processes represented more than would be expected by chance? It is the standard route from a gene list to a biological interpretation.

Definition

Pathway enrichment analysis is a family of statistical methods that test whether genes annotated to defined biological pathways or gene sets are over-represented among the genes implicated by an experiment, either within a selected list (over-representation analysis) or across a continuously ranked list (gene set enrichment analysis).

Scope

This topic covers the two main families of enrichment methods — over-representation analysis on a selected gene list and gene set enrichment across a fully ranked list — together with the curated pathway resources they draw on and the statistical pitfalls that affect their validity. It is a methodological reference and does not provide clinical interpretation of results.

Core questions

  • Given a list of genes, which pathways or processes are statistically over-represented?
  • How does ranking-based enrichment differ from threshold-based over-representation?
  • Which background (reference) gene set should a test be evaluated against?
  • How are multiple-testing and length or selection biases controlled?

Key concepts

  • Over-representation analysis (ORA)
  • Gene set enrichment analysis (GSEA)
  • Gene sets and pathway databases (KEGG, Reactome, GO terms)
  • Background or reference gene set
  • Multiple-testing correction
  • Selection and length bias in RNA-seq enrichment

Mechanisms

Over-representation analysis takes a list of genes already selected by a threshold — for example, the genes called differentially expressed — and asks, typically with a hypergeometric or Fisher's exact test, whether any pathway contains more of those genes than expected given the background. Gene set enrichment analysis instead uses the whole ranked list of genes and tests whether members of a pathway tend to cluster toward the top or bottom of the ranking, avoiding the need to choose a hard threshold. Both rely on curated gene sets drawn from resources such as the Gene Ontology, KEGG, and Reactome. Validity depends on choosing an appropriate background and correcting for the many pathways tested; for RNA-seq data, methods must also account for the tendency of longer or more highly expressed genes to be detected as significant, a selection bias that uncorrected enrichment tests can mistake for biological signal.

Clinical relevance

Pathway enrichment is the interpretive step that turns a differential-expression or variant result into a statement about biological processes, and it is widely used in translational genomics to generate mechanistic hypotheses. It describes how gene-level results are summarised at the pathway level and is intended as reference orientation, not as a basis for individual diagnostic or treatment decisions.

History

Early functional interpretation counted how many genes from a list fell into each annotation category, formalised in over-representation tools such as DAVID. Gene set enrichment analysis (2005) reframed the problem around the full ranked gene list, which proved more sensitive to coordinated, subtle changes across a pathway. As RNA-seq replaced microarrays, methods such as GOseq (2010) corrected for the length and count biases specific to sequencing data, and curated pathway resources including KEGG and Reactome became the standard gene-set inputs.

Debates

Over-representation versus ranking-based enrichment
Over-representation analysis requires a significance threshold and so discards information below the cut-off, whereas gene set enrichment uses the entire ranking; each has different sensitivity and assumptions, and the choice can change which pathways are reported.
Bias in enrichment from sequencing data
In RNA-seq, longer and more highly expressed genes are more likely to be called significant, so naive enrichment tests can report pathways enriched for long genes rather than for genuine biology unless this selection bias is corrected.

Key figures

  • Aravind Subramanian
  • Jill Mesirov
  • Da Wei Huang
  • Minoru Kanehisa

Related topics

Seminal works

  • subramanian-2005
  • huang-2009
  • kanehisa-2000
  • young-2010

Frequently asked questions

What is the difference between over-representation analysis and gene set enrichment analysis?
Over-representation analysis tests a pre-selected list of genes (for example, those above a significance threshold) for pathway over-representation, while gene set enrichment analysis uses the entire ranked list of genes and asks whether a pathway's members cluster toward the extremes of the ranking, avoiding a hard cut-off.
Why does the choice of background gene set matter?
Enrichment is judged relative to a reference set of genes; using an inappropriate background (for example, all genes when only a subset could have been detected) can make pathways appear enriched or depleted for statistical rather than biological reasons.

Methods for this concept

Related concepts