What problem does consistent hashing solve?

When data is partitioned across nodes by hashing keys, naive hashing reshuffles almost everything when a node is added or removed. Consistent hashing arranges keys and nodes on a ring so that such a change relocates only a small, bounded fraction of keys, which is essential for elastic, churning clusters.

Scalable Storage Systems

Scalable storage systems spread data across many machines to provide capacity, throughput, and availability beyond any single server while masking the failures of individual nodes.

Definition

A scalable storage system stores data across a cluster of machines, partitioning it for capacity and throughput and replicating it for durability and availability, so that the aggregate system scales with the number of nodes while tolerating individual node failures.

Scope

This topic covers distributed file systems designed for commodity clusters, distributed key-value and wide-column stores, and the structured-overlay techniques—consistent hashing and distributed hash tables—used to partition and locate data at scale. It covers data partitioning (sharding), replication for durability, and the consistency and availability trade-offs that distinguish strongly consistent from highly available stores.

Core questions

How is data partitioned and located across a large, changing set of nodes?
How is durability and availability achieved despite frequent node failures?
What consistency guarantees can a scalable store provide, and at what cost?

Key theories

Cluster file systems: Systems like the Google File System store huge files as chunks replicated across commodity servers, optimizing for large sequential access and treating failures as the norm rather than the exception.
Distributed structured stores: Wide-column and key-value stores such as Bigtable and Dynamo partition data by key across nodes and replicate it, trading query expressiveness and consistency for horizontal scalability and availability.
Consistent hashing and distributed hash tables: Consistent hashing maps keys and nodes onto a ring so that adding or removing a node moves only a small fraction of keys, and distributed hash tables like Chord provide scalable, decentralized key lookup with logarithmic routing.

Clinical relevance

Scalable storage is the durable foundation of cloud platforms and large web services: object stores, databases, and analytics pipelines all rest on distributed file systems and key-value stores whose partitioning and replication choices set the system's durability and consistency guarantees.

History

Peer-to-peer distributed hash tables such as Chord (2001) showed scalable decentralized lookup; Google's File System (2003) and Bigtable (2006-2008) demonstrated cluster-scale storage for structured data; and Amazon's Dynamo (2007) popularized highly available key-value storage, together founding the modern scalable-storage and NoSQL landscape.

Debates

Strong consistency versus high availability in storage: Strongly consistent stores simplify application logic but must sacrifice availability under partitions, while highly available stores like Dynamo accept temporary divergence and push conflict resolution to the application; the right choice depends on the data's tolerance for staleness.

Key figures

Sanjay Ghemawat
Werner Vogels
Ion Stoica
Hari Balakrishnan

Seminal works

ghemawat2003
decandia2007
stoica2001

Frequently asked questions

What problem does consistent hashing solve?: When data is partitioned across nodes by hashing keys, naive hashing reshuffles almost everything when a node is added or removed. Consistent hashing arranges keys and nodes on a ring so that such a change relocates only a small, bounded fraction of keys, which is essential for elastic, churning clusters.