ScholarGate
Asistents

GPU Architecture

A graphics processing unit is a throughput-oriented, massively parallel processor built from many simple cores that execute thousands of threads to hide memory latency, originally for graphics and now widely used for general-purpose and machine-learning computation.

Atrast tematu ar PaperMindDrīzumāFind papers & topics
Tools & resources
Lejupielādēt slaidus
Learn & explore
VideoDrīzumā

Definition

A GPU architecture is a highly parallel processor design composed of many lightweight cores grouped into multiprocessors that execute large numbers of threads in lockstep groups, optimized for high arithmetic throughput and latency hiding rather than single-thread speed.

Scope

This topic covers GPU hardware organization: streaming multiprocessors, the single-instruction multiple-thread (SIMT) execution model, warps and thread blocks, the memory hierarchy of registers, shared memory, and global memory, and how GPUs hide latency through massive multithreading. It treats GPUs as parallel architecture. It excludes CPU SIMD extensions (SIMD and vector processors) and the software frameworks for programming GPUs beyond architectural implications.

Core questions

  • How do GPUs achieve high throughput with many simple cores instead of few complex ones?
  • What is the single-instruction multiple-thread execution model, and how do warps work?
  • How does massive multithreading hide memory latency?
  • How is the GPU memory hierarchy organized for parallel workloads?

Key concepts

  • streaming multiprocessor
  • single-instruction multiple-thread (SIMT)
  • warps and thread blocks
  • massive multithreading
  • latency hiding
  • shared, global, and register memory
  • memory coalescing
  • throughput-oriented design

Key theories

Latency hiding through massive multithreading
Rather than minimizing latency for one thread, a GPU keeps thousands of threads in flight and switches among ready groups whenever some stall on memory, so abundant parallelism keeps the many arithmetic units busy and hides long memory delays.

Mechanisms

A GPU comprises many streaming multiprocessors, each executing groups of threads (warps) in lockstep under the SIMT model. Threads are organized into blocks that share fast on-chip memory and synchronize locally. The hardware schedules many warps and switches among them with near-zero overhead, so when one warp waits on memory another runs, keeping the arithmetic units occupied; coalesced memory access patterns maximize bandwidth utilization.

Clinical relevance

GPUs have become central to high-performance and artificial-intelligence computing: their throughput-oriented design makes them the dominant platform for training and running deep neural networks, as well as for scientific simulation and data analytics. Their architecture rewards data-parallel algorithms and coalesced memory access, shaping how performance-critical software is written.

History

GPUs evolved from fixed-function graphics pipelines in the 1990s into programmable shaders and then fully programmable many-core processors. The introduction of general-purpose GPU computing frameworks in the late 2000s opened them to scientific and data workloads, and their throughput made them the engine of the deep-learning era from the 2010s onward.

Debates

GPU generality versus specialized accelerators
GPUs are flexible throughput engines, but increasingly specialized accelerators (such as tensor and matrix units) offer higher efficiency for specific workloads; designers weigh GPU programmability and broad applicability against the efficiency of dedicated hardware.

Key figures

  • David B. Kirk
  • Wen-mei Hwu
  • John L. Hennessy
  • David A. Patterson

Related topics

Seminal works

  • hennessy2019
  • kirk2016

Frequently asked questions

Why are GPUs so good for deep learning?
Training and running neural networks consists largely of dense matrix and tensor operations with abundant data parallelism. A GPU's many cores and high memory bandwidth execute these operations with very high throughput, far exceeding general-purpose CPUs on such regular, parallel workloads.
What is a warp in GPU execution?
A warp is a group of threads (commonly 32) that execute the same instruction in lockstep under the SIMT model. Grouping threads into warps lets the hardware amortize instruction control across many data elements, though divergent branches within a warp reduce efficiency.

Methods for this concept

Related concepts