Instruction-Level Parallelism
Instruction-level parallelism (ILP) is the potential for executing multiple instructions from a single program simultaneously, exploited by superscalar and VLIW processors that issue and complete several instructions per cycle.
Definition
Instruction-level parallelism is the degree to which the instructions of a single program can be executed in parallel; processors exploit it by issuing, executing, and completing more than one instruction per clock cycle, subject to data and control dependencies.
Scope
This topic covers the parallelism available within one instruction stream and how hardware and compilers extract it: superscalar issue of multiple instructions per cycle, very long instruction word (VLIW) designs, dependence analysis, register renaming, and speculation. It also covers the limits on available ILP. It excludes thread- and data-level parallelism across cores and lanes (parallel and multicore architecture) and the basic single-issue pipeline (pipelining and hazards).
Core questions
- How much parallelism is inherently available among the instructions of a typical program?
- How do superscalar processors decide which instructions can issue together?
- How do VLIW designs shift the burden of finding parallelism to the compiler?
- What dependencies and hazards limit how much ILP can be realized?
Key concepts
- superscalar issue
- very long instruction word (VLIW)
- data and name dependencies
- register renaming
- multiple functional units
- speculation
- ILP limits
- instructions per cycle (IPC)
Key theories
- Dynamic instruction scheduling
- Hardware can discover and exploit ILP at run time by tracking dependencies and scheduling ready instructions onto multiple functional units; Tomasulo's algorithm, with reservation stations and register renaming, is the canonical mechanism enabling out-of-order, parallel issue.
Mechanisms
Superscalar processors fetch and decode several instructions per cycle, check their dependencies, rename registers to remove false dependencies, and issue independent instructions to multiple functional units. VLIW designs instead rely on the compiler to bundle independent operations into wide instruction words. True data dependencies and control flow set an upper bound on realizable parallelism, which speculation and larger instruction windows try to extend.
Clinical relevance
ILP techniques drove decades of single-thread performance growth and remain central to high-performance CPU cores. Their diminishing returns — as available parallelism and complexity limits were reached — were a key reason the industry turned to multicore and explicit parallelism for further scaling.
History
Multiple functional units and dynamic scheduling appeared in the CDC 6600 and IBM System/360 Model 91 in the 1960s. Superscalar designs became mainstream in the 1990s, and VLIW architectures such as Intel's Itanium pursued compiler-driven parallelism. Studies of the limits of ILP in the early 1990s clarified why parallelism within a single thread is bounded.
Debates
- Hardware-driven versus compiler-driven ILP
- Superscalar out-of-order hardware finds parallelism dynamically at the cost of complexity and power, whereas VLIW relies on the compiler to schedule parallelism statically; experience showed dynamic approaches more robust across workloads, while static approaches remain attractive for predictable, low-power designs.
Key figures
- Robert Tomasulo
- Yale Patt
- John L. Hennessy
- Joseph A. Fisher
- James E. Smith
Related topics
Seminal works
- hennessy2019
- tomasulo1967
Frequently asked questions
- What is the difference between superscalar and VLIW?
- Both execute multiple operations per cycle. A superscalar processor decides in hardware, at run time, which instructions can issue together; a VLIW processor relies on the compiler to group independent operations into wide instructions ahead of time, simplifying hardware but demanding more from the compiler.
- Why is there a limit to instruction-level parallelism?
- Real programs have true data dependencies and frequent branches that constrain how many instructions can run in parallel. Beyond a certain window, the achievable parallelism saturates, so extracting more ILP yields diminishing returns relative to the added hardware complexity and power.