Why does an optimizer sometimes choose a bad plan?

Cost-based optimizers rely on estimates of how many rows each operation will produce. When data is skewed or columns are correlated, these estimates can be far off, and the error compounds across joins. A wrong estimate can lead the optimizer to pick a poor join order or access path, even though the plan looked cheap on paper.

Why use dynamic programming for join ordering?

The number of possible join orders grows combinatorially with the number of tables, so exhaustive search is infeasible. Dynamic programming builds optimal plans for larger sets of relations from optimal plans for smaller subsets, dramatically reducing the work while still finding a good (often optimal) join order for typical query sizes.

Cost-Based Query Optimization

Cost-based query optimization searches the space of equivalent execution plans for a query and selects the one with the lowest estimated cost, using statistics about the data and a model of execution cost.

ค้นหาหัวข้อด้วย PaperMindเร็ว ๆ นี้Find papers & topics

Tools & resources

ดาวน์โหลดสไลด์

Learn & explore

วิดีโอเร็ว ๆ นี้

Definition

Cost-based query optimization is the process of estimating, from data statistics and a cost model, the execution cost of alternative equivalent plans for a query and selecting the plan of least estimated cost, typically with join ordering chosen by dynamic programming.

Scope

This topic covers how an optimizer chooses a plan: the plan search space generated by relational-algebra equivalences and alternative physical operators, the cost model that scores plans (chiefly in I/O and CPU), cardinality and selectivity estimation from statistics and histograms, and the dynamic-programming approach to join-order enumeration introduced by System R. It excludes the implementation of the operators themselves and rule-driven logical rewriting beyond what feeds plan generation.

Core questions

How is the space of equivalent execution plans generated and pruned?
How does the optimizer estimate the cardinality and selectivity of operations?
How does dynamic programming enumerate join orders efficiently?
What does the cost model account for, and how are I/O and CPU costs combined?
Why do cardinality-estimation errors cause poor plan choices?

Key concepts

plan search space
cost model (I/O and CPU)
selectivity and cardinality estimation
histograms and statistics
join-order enumeration
dynamic programming
interesting orders
transformation-based optimization

Key theories

System R dynamic-programming optimization: Selinger's optimizer enumerates join orders bottom-up with dynamic programming, keeping the cheapest plan for each subset of relations (and each interesting sort order), which makes searching the large join-order space tractable.
Cardinality and selectivity estimation: The optimizer estimates the number of tuples each operation produces using table statistics, histograms, and selectivity formulas; these estimates drive the cost model and are the main source of optimization error.
Cost models and search strategies: Plans are scored by a cost model combining I/O, CPU, and memory effects; beyond System R's dynamic programming, transformation-based optimizers explore plans via rewrite rules to extend optimization to richer operators and distributed settings.

Clinical relevance

The optimizer is the component that lets users write declarative SQL without specifying how to execute it; the quality of its cost estimates and search directly governs whether queries on large databases are fast, so cost-based optimization is among the most studied and commercially important parts of database systems.

History

The 1979 System R optimizer by Selinger and colleagues introduced cost-based optimization with dynamic-programming join ordering and selectivity-based cost estimation, defining the field. Later transformation-based optimizers such as Volcano and Cascades generalized the search, and ongoing work on cardinality estimation, including learned models, addresses the framework's central weakness.

Debates

Cardinality estimation versus adaptive execution: Because cost-based optimization is only as good as its size estimates, which are often wrong for correlated data, researchers debate whether to invest in better statistics and machine-learned estimators or to rely more on adaptive, runtime re-optimization that corrects bad plans during execution.

Key figures

Patricia Selinger
Goetz Graefe
Surajit Chaudhuri

Seminal works

selinger1979
graefe1993

Frequently asked questions

Why does an optimizer sometimes choose a bad plan?: Cost-based optimizers rely on estimates of how many rows each operation will produce. When data is skewed or columns are correlated, these estimates can be far off, and the error compounds across joins. A wrong estimate can lead the optimizer to pick a poor join order or access path, even though the plan looked cheap on paper.
Why use dynamic programming for join ordering?: The number of possible join orders grows combinatorially with the number of tables, so exhaustive search is infeasible. Dynamic programming builds optimal plans for larger sets of relations from optimal plans for smaller subsets, dramatically reducing the work while still finding a good (often optimal) join order for typical query sizes.