ScholarGate
Asistents

String Indexing Structures

String indexing structures preprocess a text into a searchable form — tries, suffix trees, or suffix arrays — so that many later pattern queries, and questions about repeats and common substrings, can be answered quickly.

Atrast tematu ar PaperMindDrīzumāFind papers & topics
Tools & resources
Lejupielādēt slaidus
Learn & explore
VideoDrīzumā

Definition

A string indexing structure is a data structure built from a text (or set of strings) that supports efficient queries about the text — such as whether and where a pattern occurs, the longest repeated substring, or the longest common substring of two texts — typically by representing the text's prefixes or suffixes.

Scope

This topic covers data structures that index strings for fast repeated querying: tries (prefix trees) for sets of strings, suffix trees that represent all suffixes of a text for linear-time pattern search, suffix arrays as their space-efficient counterpart, and supporting structures such as the longest-common-prefix array. It treats their construction algorithms and the queries they accelerate, and excludes single-shot pattern matching, which does not preprocess the text.

Core questions

  • How does a trie organize a set of strings for prefix-based search?
  • How does a suffix tree represent all suffixes so that pattern queries take time proportional to the pattern length?
  • How do suffix arrays achieve similar power with less memory?
  • What construction algorithms build these structures efficiently, ideally in linear time?
  • Which text problems (repeats, common substrings) do these indexes solve elegantly?

Key concepts

  • trie (prefix tree)
  • suffix tree
  • suffix array
  • longest-common-prefix array
  • on-line construction
  • longest repeated substring
  • longest common substring
  • space-time trade-offs

Key theories

Suffix trees and linear-time construction
A suffix tree compactly represents every suffix of a text, allowing a pattern to be searched in time proportional to its length; Ukkonen's algorithm constructs it on-line in time linear in the text length.
Suffix arrays as a space-efficient index
A suffix array stores the sorted order of all suffixes in a plain integer array; combined with a longest-common-prefix array it supports fast pattern search using far less memory than a suffix tree.

Clinical relevance

Text indexes power large-scale search and bioinformatics: suffix arrays and related compressed indexes underlie genome read aligners and full-text search engines, tries support autocomplete and IP routing tables, and these structures make repeated queries over massive texts practical where re-scanning would be prohibitive.

History

Peter Weiner introduced suffix trees in 1973, with simpler linear-time constructions by McCreight (1976) and Ukkonen (1995). Manber and Myers introduced suffix arrays in 1990 as a more space-efficient alternative, and later work on compressed and FM-indexes extended these ideas to very large texts and genomes.

Key figures

  • Peter Weiner
  • Edward McCreight
  • Esko Ukkonen
  • Udi Manber
  • Gene Myers
  • Dan Gusfield

Related topics

Seminal works

  • ukkonen1995
  • manber1993
  • gusfield1997

Frequently asked questions

When should I use a suffix array instead of a suffix tree?
Suffix arrays use considerably less memory than suffix trees and have good cache behavior, making them preferable for large texts such as genomes. Suffix trees expose more structure directly and can make some algorithms conceptually simpler, but at higher space cost; suffix arrays plus a longest-common-prefix array recover most of that power.
What is a trie good for?
A trie stores a set of strings by shared prefixes, giving fast prefix search, autocompletion, and ordered traversal. It is well suited to dictionaries, autocomplete suggestions, and longest-prefix matching such as in IP routing, where keys share common beginnings.

Methods for this concept

Related concepts