ScholarGate
Asistent

Distributed and Parallel Databases

Distributed and parallel databases spread data and query processing across multiple machines to achieve scalability, availability, and high performance while preserving a coherent view of the data.

Găsește o temă cu PaperMindÎn curândFind papers & topics
Tools & resources
Descarcă prezentarea
Learn & explore
VideoÎn curând

Definition

A distributed database stores data across multiple networked sites that appear to users as a single database, and a parallel database uses multiple processors and disks (typically shared-nothing) to execute database operations concurrently for higher throughput and lower latency.

Scope

This area covers managing data across many nodes: how data is partitioned (fragmented) and replicated; how queries are processed in parallel across partitions and across distributed sites; and how transactions commit atomically and replicas stay consistent through commit and consensus protocols. It treats the architectural distinction between shared-nothing parallel databases and geographically distributed databases. It is the database-specific complement to general distributed-computing topics, which it cites but does not duplicate; it excludes general-purpose consensus and distributed-systems theory beyond their database use.

Sub-topics

Core questions

  • How is data partitioned and replicated across nodes, and why?
  • How are queries executed in parallel across partitions and sites?
  • How is a transaction committed atomically when it spans multiple nodes?
  • How do replicas remain consistent in the presence of failures?
  • How do parallel (shared-nothing) and geographically distributed designs differ?

Key concepts

  • horizontal and vertical fragmentation
  • replication
  • shared-nothing architecture
  • partitioned and pipelined parallelism
  • distributed query processing
  • two-phase commit
  • consensus and replica consistency
  • speedup and scaleup

Key theories

Data partitioning and replication
Tables are horizontally or vertically fragmented and distributed across nodes for scalability, and copies are replicated for availability and read performance; the placement strategy determines load balance and fault tolerance.
Parallel query processing
Shared-nothing parallel databases achieve near-linear speedup and scaleup by partitioning data and executing operators such as scans and joins in parallel across nodes, exploiting partitioned and pipelined parallelism.
Distributed commit and replica consistency
Atomic commitment protocols such as two-phase commit ensure all-or-nothing outcomes across sites, and consensus and replication protocols keep replicas consistent despite node and network failures.

Clinical relevance

Distributed and parallel databases are what let data systems scale to internet workloads: parallel data warehouses run analytics over petabytes, geographically distributed databases keep global services available and low-latency, and the partitioning, replication, and commit techniques here underpin nearly every large-scale data platform.

History

Distributed database research began in the late 1970s with systems such as SDD-1 and distributed Ingres. The 1980s saw shared-nothing parallel databases (Gamma, Teradata) that DeWitt and Gray argued in 1992 were the future of high-performance data management. Internet-scale demands later drove the partitioned, replicated systems that define modern cloud data platforms.

Key figures

  • M. Tamer Özsu
  • Patrick Valduriez
  • David DeWitt
  • Jim Gray

Related topics

Seminal works

  • ozsu2011
  • dewitt1992
  • silberschatz2019

Frequently asked questions

What is the difference between a distributed database and a parallel database?
A parallel database uses many tightly coupled processors and disks, usually in one location with a fast interconnect (often a shared-nothing cluster), to run queries faster. A distributed database spreads data across separate, often geographically dispersed sites for availability and locality. The line blurs, but parallel databases emphasize performance and distributed databases emphasize distribution and autonomy.
Why is shared-nothing the dominant parallel architecture?
In a shared-nothing design each node has its own CPU, memory, and disk, so there is no central resource to become a bottleneck as nodes are added. This lets the system achieve near-linear speedup and scaleup, which is why it became the standard architecture for scalable parallel and analytic databases.

Methods for this concept

Related concepts