Cluster and Grid Computing
Cluster computing aggregates networked machines into a single high-performance system, while grid computing federates resources across organizations into shared virtual infrastructure.
Definition
A cluster is a collection of interconnected computers managed as a single resource for parallel or high-throughput computing; a grid extends this to a federation of autonomously administered, distributed resources shared among a virtual organization through common protocols.
Scope
This topic covers the architecture and management of compute clusters—interconnects, batch schedulers, and resource managers—and the grid computing paradigm that federates heterogeneous, geographically distributed resources across administrative domains into virtual organizations. It covers job scheduling, resource discovery and allocation, and high-throughput computing for parameter-sweep and embarrassingly parallel workloads.
Core questions
- How are jobs scheduled and resources allocated across a shared cluster?
- How can resources owned by different organizations be federated and shared securely?
- What workloads are best served by high-throughput rather than tightly coupled parallel computing?
Key theories
- Virtual organizations and grid architecture
- The grid concept defines protocols for sharing computing, storage, and data resources across organizational boundaries to form virtual organizations, with layered services for security, resource management, and discovery.
- Batch scheduling and resource management
- Cluster resource managers queue and place jobs onto nodes according to policies balancing utilization, fairness, and priority, a function central to both clusters and grids.
- High-throughput computing
- For workloads composed of many independent tasks, systems harvest idle and distributed capacity to maximize completed jobs over long periods rather than minimizing the latency of any single computation.
Clinical relevance
Clusters and grids underpin scientific computing—from physics and bioinformatics to large collaborations sharing data and compute—and their scheduling and resource-management ideas carry directly into today's cloud and container orchestration platforms.
History
Clusters of commodity workstations emerged in the 1990s as a cost-effective alternative to supercomputers; Foster and Kesselman's grid vision (late 1990s, formalized in 2001) extended sharing across institutions, and systems like Condor demonstrated large-scale high-throughput computing that prefigured the cloud.
Key figures
- Ian Foster
- Carl Kesselman
- Miron Livny
Related topics
Seminal works
- foster2001
- foster2004
- thain2005
Frequently asked questions
- How does a grid differ from a single cluster?
- A cluster is usually homogeneous and under a single administrative authority, whereas a grid federates heterogeneous resources owned by different organizations. Grids therefore must solve harder problems of cross-domain security, trust, and resource discovery that a single cluster avoids.