What is the difference between partitioning and replication?

Partitioning divides the data so each node holds a different subset, which spreads storage and load for scalability. Replication keeps copies of the same data on multiple nodes for availability and faster reads. Most large systems do both: data is sharded across nodes, and each shard is replicated several times.

Why use consistent hashing instead of plain hash partitioning?

With ordinary modulo-based hash partitioning, changing the number of nodes remaps almost every key, forcing massive data movement. Consistent hashing arranges keys and nodes on a ring so that adding or removing a node only reassigns the keys near that node, keeping rebalancing cheap as the cluster grows or shrinks.

Data Partitioning and Replication

Data partitioning splits a database across multiple nodes for scalability, while replication keeps copies of data on several nodes for availability and read performance; together they determine how a distributed database scales and tolerates failure.

Temukan Topik dengan PaperMindSegeraFind papers & topics

Tools & resources

Unduh salindia

Learn & explore

VideoSegera

Definition

Partitioning (fragmentation or sharding) divides a relation's rows or columns across multiple nodes so each holds a portion of the data; replication stores copies of the same data on multiple nodes; placement and replication policies jointly govern scalability, availability, and load balance.

Scope

This topic covers how data is placed across nodes: horizontal partitioning (sharding) by range, hash, or list and vertical partitioning by column; partitioning strategies including consistent hashing; and replication models — synchronous versus asynchronous, primary-backup versus multi-primary — together with the consistency-availability trade-offs they imply. It treats how partitioning enables parallelism and how replication enables fault tolerance. It excludes the commit and consensus protocols that keep replicas in agreement, which are an adjacent topic.

Core questions

How do range, hash, and list partitioning distribute rows across nodes?
When is vertical partitioning preferable to horizontal partitioning?
How does consistent hashing limit data movement when nodes are added or removed?
What are the trade-offs between synchronous and asynchronous replication?
How do primary-backup and multi-primary replication differ in consistency and availability?

Key concepts

horizontal partitioning (sharding)
vertical partitioning
range, hash, and list partitioning
consistent hashing
synchronous versus asynchronous replication
primary-backup replication
multi-primary replication
partition key and load balance

Key theories

Horizontal and vertical partitioning: Horizontal partitioning (sharding) distributes a table's rows across nodes by a partition key to spread load and enable parallel processing, while vertical partitioning splits a table by columns; the partitioning function critically affects load balance and query locality.
Consistent hashing: Consistent hashing maps keys and nodes onto a ring so that adding or removing a node moves only a small, bounded fraction of keys, making it a foundational technique for partitioning in elastic distributed data stores.
Replication models and trade-offs: Synchronous replication keeps copies identical at the cost of latency and availability under partitions, while asynchronous replication is faster but can serve stale data; primary-backup centralizes writes whereas multi-primary allows writes anywhere at the cost of conflict resolution.

Clinical relevance

Partitioning and replication are the levers that make data systems scale and stay available: sharding lets a single logical database serve workloads no one machine could handle, and replication keeps services running and fast across failures and regions, making these techniques central to every large-scale data platform.

History

Fragmentation and replication were studied in early distributed-database systems of the late 1970s and 1980s. Consistent hashing, introduced by Karger and colleagues in 1997 for web caching, was later adopted by scalable key-value stores as a partitioning scheme, and large internet services popularized aggressive sharding and replication for elasticity and availability.

Key figures

M. Tamer Özsu
Patrick Valduriez
David Karger

Seminal works

ozsu2011
karger1997

Frequently asked questions

What is the difference between partitioning and replication?: Partitioning divides the data so each node holds a different subset, which spreads storage and load for scalability. Replication keeps copies of the same data on multiple nodes for availability and faster reads. Most large systems do both: data is sharded across nodes, and each shard is replicated several times.
Why use consistent hashing instead of plain hash partitioning?: With ordinary modulo-based hash partitioning, changing the number of nodes remaps almost every key, forcing massive data movement. Consistent hashing arranges keys and nodes on a ring so that adding or removing a node only reassigns the keys near that node, keeping rebalancing cheap as the cluster grows or shrinks.