RAID and Storage Reliability
RAID — redundant arrays of inexpensive disks — combines multiple storage devices using striping, mirroring, and parity to deliver greater performance, capacity, and fault tolerance than any single device, and is the foundation of reliable storage in data centers.
Definition
RAID is a storage architecture that combines multiple physical drives into a single logical unit using techniques such as data striping, mirroring, and parity to improve performance and to tolerate the failure of one or more drives without data loss.
Scope
This topic covers storage reliability through redundancy: the standard RAID levels and their trade-offs among performance, capacity, and fault tolerance; striping, mirroring, and parity; reliability metrics such as mean time to failure and the limits of redundancy; and how RAID complements but does not replace backups. It excludes the storage devices themselves (secondary storage devices) and the file-system layer (file systems).
Core questions
- How do striping, mirroring, and parity provide performance and fault tolerance?
- How do the common RAID levels trade off capacity, performance, and reliability?
- How is array reliability quantified, and what are the limits of redundancy?
- Why is RAID not a substitute for backups?
Key concepts
- data striping
- mirroring (RAID 1)
- parity (RAID 5/6)
- RAID levels and trade-offs
- fault tolerance
- mean time to failure (MTTF)
- rebuild and degraded mode
- redundancy is not backup
Key theories
- Redundancy for reliable storage
- Combining many commodity disks with redundant information (mirroring or parity) yields an array that is faster and far more reliable than a single disk; the RAID levels formalize how striping and redundancy are arranged to balance performance, usable capacity, and tolerance to failures.
Mechanisms
Striping spreads data across drives to parallelize access and raise throughput. Mirroring keeps full copies on multiple drives so the array survives a drive loss. Parity schemes store computed redundancy that lets data be reconstructed when a drive fails, using less capacity than mirroring. Standard RAID levels combine these techniques differently; when a drive fails the array runs degraded and rebuilds onto a replacement using the surviving data and redundancy.
Clinical relevance
RAID is ubiquitous in servers, storage systems, and data centers, where drive failures are routine at scale and continuous availability is required. Choosing the right RAID level balances cost, speed, and resilience, but RAID protects only against device failure, so it complements rather than replaces backups against deletion, corruption, and disasters.
History
The RAID concept was introduced in a 1988 paper by Patterson, Gibson, and Katz at Berkeley, which proposed using arrays of inexpensive disks with redundancy to match the reliability and performance of costly large drives. The taxonomy of RAID levels was widely adopted, becoming standard practice in enterprise and data-center storage.
Debates
- Parity RAID versus mirroring at scale
- As drive capacities grew, the long rebuild times of parity RAID raised the risk of a second failure during reconstruction, prompting debate over higher redundancy (such as double-parity) versus mirroring or alternative erasure-coding schemes for large arrays.
Key figures
- David A. Patterson
- Garth Gibson
- Randy H. Katz
- John L. Hennessy
Related topics
Seminal works
- patterson1988raid
- hennessy2019
Frequently asked questions
- Does RAID replace the need for backups?
- No. RAID protects against drive failure by storing redundant data, but it does not guard against accidental deletion, file corruption, malware, simultaneous multi-drive failures, or site disasters. Independent backups remain essential; RAID improves availability, not protection from data loss in general.
- What is the difference between mirroring and parity?
- Mirroring keeps complete duplicate copies of data on separate drives, giving simple, fast recovery but using half the capacity for redundancy. Parity stores computed redundancy that can reconstruct lost data using less space, but rebuilding is slower and more computationally involved.