Why did processors move to multiple cores?

Increasing a single core's clock frequency hit power and heat limits in the mid-2000s. Adding more cores raised total throughput within the same power budget, so multicore became the dominant way to keep performance growing — though it shifts the burden of speedup onto parallel software.

How is a GPU different from a multicore CPU?

A CPU has a few powerful cores optimized for low-latency, general-purpose execution. A GPU has many simpler cores optimized for high-throughput data-parallel work, executing the same operation across many data elements, which suits graphics and dense numerical computation but not all workloads.

Parallel and Multicore Architecture

Parallel and multicore architecture concerns hardware that executes many operations at once — multiple cores on a chip, vector and SIMD units, and massively parallel GPUs — together with the memory and communication structures that let parallel work proceed correctly and efficiently.

Definition

Parallel and multicore architecture is the design of computer hardware that performs multiple computations simultaneously through replicated cores, wide data-parallel units, or specialized accelerators, along with the interconnect and memory mechanisms that coordinate them.

Scope

This area covers hardware organizations for parallelism: chip multiprocessors and many-core designs, shared-memory systems and the coherence and consistency they require, SIMD and vector processors for data-level parallelism, and GPU architectures. It treats how parallel hardware is built and how its performance scales. It excludes the software side of parallel and distributed programming and cluster-scale distributed systems, which are covered under distributed and parallel computing, and the single-core execution engine covered under processor microarchitecture.

Sub-topics

Core questions

How does parallel hardware scale performance, and what limits that scaling?
How are multiple cores integrated on a chip and connected to shared memory?
What memory consistency and coherence guarantees must shared-memory hardware provide?
How do SIMD, vector, and GPU designs exploit data-level parallelism?
How are parallel architectures matched to workloads to maximize useful throughput per watt?

Key concepts

chip multiprocessor
thread-level parallelism
data-level parallelism
SIMD and vector processing
GPU and many-core
shared memory and coherence
memory consistency
interconnection network
Amdahl's law and scalability
synchronization hardware

Key theories

Amdahl's law: The speedup from parallelizing a computation is limited by the fraction that must run sequentially: even with unlimited processors, the serial portion caps overall speedup, which shapes how parallel architectures are designed and evaluated.
Flynn-style parallelism taxonomy: Parallel hardware is organized by how instruction and data streams combine — for example single-instruction multiple-data (SIMD) for data parallelism and multiple-instruction multiple-data (MIMD) for multicore and multiprocessor systems — a classification that frames architectural choices.

Mechanisms

Multicore processors place several cores on one die sharing one or more cache levels and a memory interface, connected by an on-chip interconnect. Coherence protocols keep their caches consistent, and a memory consistency model defines the ordering of memory operations seen across cores. Data-parallel hardware — vector units, SIMD lanes, and GPUs with many lightweight cores — applies one operation across many data elements, while synchronization primitives coordinate parallel threads.

Clinical relevance

After single-core clock scaling stalled, parallel and multicore architecture became the primary path to higher performance, so virtually all modern processors are multicore. GPUs and SIMD units now power graphics, scientific computing, and the matrix operations at the heart of deep learning, making parallel hardware central to high-performance and artificial-intelligence workloads.

History

Parallel machines date to vector supercomputers such as the Cray-1 in the 1970s and to research multiprocessors of the 1980s and 1990s. The end of frequency scaling around the mid-2000s pushed the industry toward multicore chips as the default. GPUs evolved from fixed-function graphics pipelines into programmable many-core accelerators, and data-parallel architectures became foundational to modern machine learning.

Debates

General-purpose multicore versus specialized accelerators: With diminishing returns from homogeneous multicore, there is debate over how far to favor domain-specific accelerators (GPUs, tensor units) versus general-purpose cores, trading programmability and flexibility against efficiency for particular workloads.

Key figures

Gene Amdahl
Michael J. Flynn
John L. Hennessy
David A. Patterson
David E. Culler

Seminal works

hennessy2019
amdahl1967
patterson2020

Frequently asked questions

Why did processors move to multiple cores?: Increasing a single core's clock frequency hit power and heat limits in the mid-2000s. Adding more cores raised total throughput within the same power budget, so multicore became the dominant way to keep performance growing — though it shifts the burden of speedup onto parallel software.
How is a GPU different from a multicore CPU?: A CPU has a few powerful cores optimized for low-latency, general-purpose execution. A GPU has many simpler cores optimized for high-throughput data-parallel work, executing the same operation across many data elements, which suits graphics and dense numerical computation but not all workloads.