Why is network cost so important in distributed query processing?

In a distributed database the slowest and most contended resource is usually the network between nodes. Shipping large intermediate results across nodes can dominate total query time, so optimizers and techniques such as the semijoin focus on transferring as little data as possible, even at the cost of extra local computation.

When is a broadcast join used instead of a repartition join?

A broadcast join sends a copy of one input to every node and is efficient when that input is small. A repartition (shuffle) join redistributes both inputs by the join key across nodes and is used when both relations are large. The optimizer compares the communication cost of broadcasting versus shuffling to choose.

Distributed Query Processing

Distributed query processing evaluates queries over data spread across many nodes, exploiting parallelism for speed and minimizing the network communication that dominates cost in a distributed setting.

Tìm chủ đề với PaperMindSắp ra mắtFind papers & topics

Tools & resources

Tải xuống bản trình chiếu

Learn & explore

VideoSắp ra mắt

Definition

Distributed query processing is the decomposition, optimization, and execution of a query over data located at multiple sites or partitions, where the plan must coordinate work across nodes and minimize both computation and inter-node data transfer.

Scope

This topic covers how queries run across partitioned and replicated data: forms of parallelism (partitioned, pipelined, and independent); parallel and distributed join strategies such as repartition and broadcast joins; communication-reducing techniques such as the semijoin; and the extension of cost-based optimization to account for network transfer and data placement. It treats how a logical query is decomposed and scheduled across nodes. It excludes data placement decisions and the commit protocols for distributed transactions.

Core questions

What forms of parallelism (partitioned, pipelined, independent) can a distributed plan exploit?
How are joins executed when the inputs are partitioned across nodes?
How does the semijoin reduce the amount of data shipped between sites?
How does optimization change when network cost dominates?
How does data placement affect which plan is cheapest?

Key concepts

partitioned parallelism
pipelined parallelism
independent parallelism
repartition (shuffle) join
broadcast join
semijoin reduction
communication cost
data localization

Key theories

Parallelism in query execution: Distributed plans exploit partitioned parallelism (the same operator runs on disjoint data partitions), pipelined parallelism (operators in a chain run concurrently), and independent parallelism (unrelated subplans run at once) to reduce response time.
Distributed and parallel joins: Joins over partitioned data use repartitioning (shuffling both inputs by the join key) or broadcasting a small input to all nodes; choosing between them depends on relation sizes and the existing partitioning.
Semijoin and communication minimization: The semijoin reduces a relation to only the tuples that can match before shipping it across the network, cutting communication cost; this technique was central to early distributed query processors such as SDD-1.

Clinical relevance

Distributed query processing is what lets analytic systems answer queries over data far larger than any single machine could hold, and the techniques for minimizing network traffic and maximizing parallelism directly determine the speed of large-scale data warehouses and query engines.

History

Early distributed query processing was pioneered in the SDD-1 system around 1980, which introduced semijoin-based communication reduction. Shared-nothing parallel databases of the 1980s and 1990s, surveyed by DeWitt and Gray, established repartition and broadcast joins and the parallelism taxonomy that modern distributed query engines still use.

Key figures

Philip Bernstein
David DeWitt
M. Tamer Özsu
Patrick Valduriez

Seminal works

bernstein1981
dewitt1992
ozsu2011

Frequently asked questions

Why is network cost so important in distributed query processing?: In a distributed database the slowest and most contended resource is usually the network between nodes. Shipping large intermediate results across nodes can dominate total query time, so optimizers and techniques such as the semijoin focus on transferring as little data as possible, even at the cost of extra local computation.
When is a broadcast join used instead of a repartition join?: A broadcast join sends a copy of one input to every node and is efficient when that input is small. A repartition (shuffle) join redistributes both inputs by the join key across nodes and is used when both relations are large. The optimizer compares the communication cost of broadcasting versus shuffling to choose.