ScholarGate
Assistant

Query Processing and Optimization

Query processing and optimization is the part of a database system that turns a declarative query into an efficient execution plan, choosing physical operators and access paths to minimize cost over large data sets.

Definition

Query processing is the set of activities by which a database system parses, optimizes, and executes a query; query optimization is the search for an execution plan of low estimated cost among the many plans that are logically equivalent to the query.

Scope

This area covers the pipeline from a parsed query to results: translation of SQL into relational-algebra expressions, the physical operators that implement each algebraic operation (scans, joins, sorting, aggregation), the indexes and access methods that provide fast paths to data, and the cost-based optimizer that uses statistics to choose among equivalent plans. It excludes transaction and concurrency control and the design of schemas; it focuses on how a single query is evaluated efficiently.

Sub-topics

Core questions

  • How is a declarative query translated into a tree of physical operators?
  • What physical algorithms implement selection, join, sorting, and aggregation?
  • How do indexes and access methods speed up data retrieval?
  • How does a cost-based optimizer estimate plan cost and choose among plans?
  • Why can logically equivalent queries differ enormously in execution cost?

Key concepts

  • physical query plan
  • relational-algebra equivalences
  • join algorithms
  • sorting and external merge sort
  • indexes and access methods
  • selectivity and cardinality estimation
  • cost model
  • join-order enumeration
  • pipelining and materialization

Key theories

Logical-to-physical plan generation
A query is first rewritten using relational-algebra equivalences into candidate logical plans, then each algebraic operator is mapped to a physical algorithm, yielding executable plans whose costs can be compared.
Cost-based optimization
The optimizer uses statistics on table sizes and value distributions to estimate the cost of alternative plans — especially join orders and access paths — and selects the cheapest, an approach pioneered by System R's optimizer.
Access-path selection
Choosing whether to scan a table, use an index, or exploit clustering for each operation is central to performance; the optimizer weighs selectivity estimates and I/O cost to pick the best access path.

Clinical relevance

Query optimization is what makes relational databases usable at scale: the same SQL query can run in milliseconds or hours depending on the plan chosen, so the optimizer's quality directly determines the performance of business analytics, transaction processing, and data-intensive applications.

History

The cost-based optimizer of IBM's System R (Selinger et al., 1979) established the dominant approach: enumerate plans, estimate costs from statistics, and use dynamic programming for join ordering. Decades of refinement added better cardinality estimation, new operators, and adaptive techniques, but the basic framework remains the foundation of modern query optimizers.

Debates

Reliability of cardinality estimation
Cost-based optimizers depend on estimates of intermediate result sizes, which can be badly wrong for correlated or skewed data; researchers debate how much to invest in better statistics and learned estimators versus adaptive, runtime re-optimization.

Key figures

  • Patricia Selinger
  • Jeffrey D. Ullman
  • Michael Stonebraker

Related topics

Seminal works

  • selinger1979
  • garciamolina2008
  • silberschatz2019

Frequently asked questions

Why does the same query run fast on one system and slow on another?
Performance depends on the execution plan the optimizer chooses, which in turn depends on available indexes, statistics about the data, and the cost model. Different systems, or the same system with stale statistics, can pick very different plans — for example a different join order or an index scan versus a full table scan.
What is the hardest part of query optimization?
Accurately estimating the sizes of intermediate results (cardinality estimation). Errors compound across joins, so a small mis-estimate early in a plan can lead the optimizer to choose a drastically worse join order or access path. This is why cardinality estimation remains an active research problem.

Methods for this concept

Related concepts