ScholarGate
Pembantu

Genome Assembly Algorithms and Methods

Genome assembly is the computational problem of reconstructing a genome from the many overlapping short or long reads produced by sequencing, since no current technology reads a whole chromosome end to end. The algorithms that solve it determine how completely and accurately a genome can be recovered from raw sequence data.

Cari Topik dengan PaperMindTidak lama lagiFind papers & topics
Tools & resources
Muat turun slaid
Learn & explore
VideoTidak lama lagi

Definition

Genome assembly is the algorithmic reconstruction of a genome's sequence by detecting overlaps among sequencing reads and merging them into longer contiguous sequences (contigs), which may then be ordered and oriented into scaffolds, either without (de novo) or with (reference-guided) an existing reference.

Scope

The entry covers the two dominant algorithmic paradigms, overlap-layout-consensus and the de Bruijn graph, the distinction between de novo assembly and reference-guided assembly, and the concepts of contigs and scaffolds. It is a methodological topic focused on the computational reconstruction step and does not address laboratory protocols or clinical use.

Core questions

  • Why must sequencing reads be assembled rather than read directly as whole chromosomes?
  • How do overlap-layout-consensus and de Bruijn graph approaches differ?
  • What limits assembly completeness, and how do repeats and read length matter?

Key concepts

  • Overlap-layout-consensus assembly
  • De Bruijn graph assembly
  • k-mers
  • Contigs and scaffolds
  • De novo versus reference-guided assembly
  • Repeat resolution
  • Assembly contiguity (e.g., N50)

Mechanisms

Assembly algorithms reconstruct a genome by exploiting the overlaps between reads. Overlap-layout-consensus methods compute pairwise overlaps among reads, arrange them into a layout, and derive a consensus sequence; this approach suited longer reads and underpinned early whole-genome shotgun assemblies. De Bruijn graph methods instead break reads into fixed-length subsequences (k-mers) and represent the genome as paths through a graph of overlapping k-mers, which scales efficiently to the very large numbers of short reads produced by high-throughput sequencing. Repetitive regions longer than the read length create ambiguities that fragment assemblies, so longer reads and paired information are used to resolve them and to link contigs into scaffolds.

Clinical relevance

Genome assembly is the computational foundation that turns raw sequencing data into the contiguous sequences used to build reference genomes and to study previously uncharacterised organisms. This entry is reference and educational material describing how assembly works and is not guidance for any clinical or diagnostic procedure.

Evidence & guidelines

The methodological literature is primary and review-based rather than guideline-based: Idury and Waterman (1995) introduced a graph formulation foreshadowing de Bruijn assembly, Zerbino and Birney (2008) established de Bruijn graph assembly for short reads with Velvet, and the whole-genome shotgun assembly of the human genome (Venter et al., 2001) exemplifies the overlap-layout-consensus paradigm at scale.

History

Early assemblers used overlap-layout-consensus methods well suited to the relatively long reads of Sanger sequencing, as in the whole-genome shotgun assembly of the human genome in 2001. The shift to short-read high-throughput sequencing made de Bruijn graph methods, anticipated by graph formulations from the mid-1990s and realised in tools such as Velvet (2008), the dominant paradigm, while the later return of long reads renewed interest in overlap-based approaches for resolving repeats.

Key figures

  • Michael Waterman
  • Daniel Zerbino
  • Ewan Birney
  • Eugene Myers

Related topics

Seminal works

  • idury-1995
  • zerbino-2008
  • venter-2001-asm

Frequently asked questions

What is the difference between de novo and reference-guided assembly?
De novo assembly reconstructs a genome from reads alone, without using a prior sequence, whereas reference-guided assembly aligns or scaffolds reads against an existing reference genome to assist the reconstruction.
Why are repetitive regions hard to assemble?
When a repeat is longer than the reads spanning it, the algorithm cannot tell which copy a read came from, creating ambiguous paths that break the assembly into shorter fragments; longer reads help resolve these repeats.

Methods for this concept

Related concepts