ScholarGate
アシスタント

MapReduce and Data-Parallel Processing

MapReduce and its successors are programming models and frameworks that process very large data sets in parallel across clusters of commodity machines, automatically handling distribution, scheduling, and fault tolerance.

PaperMindでテーマを探す近日公開Find papers & topics
Tools & resources
スライドをダウンロード
Learn & explore
動画近日公開

Definition

MapReduce is a programming model in which a computation is expressed as a map function that emits key-value pairs and a reduce function that aggregates the values for each key, executed in parallel over data partitioned across a cluster with framework-managed distribution and fault tolerance.

Scope

This topic covers the MapReduce model — expressing computation as map and reduce functions over key-value pairs, with an automatic shuffle in between — and the surrounding execution machinery of partitioning, scheduling, and fault tolerance via re-execution. It covers the evolution to general dataflow and in-memory engines (such as the resilient distributed dataset model) and the distributed file systems that underpin them. It excludes the NoSQL storage models and the consistency theory, which are adjacent topics.

Core questions

  • How do the map, shuffle, and reduce phases parallelize a computation?
  • How does the framework handle data partitioning, scheduling, and stragglers?
  • How is fault tolerance achieved through re-execution of failed tasks?
  • Why did general dataflow and in-memory engines succeed MapReduce for many workloads?
  • What role do distributed file systems play in data-parallel processing?

Key concepts

  • map and reduce functions
  • shuffle and sort phase
  • data partitioning
  • task scheduling and stragglers
  • fault tolerance by re-execution
  • distributed file system
  • dataflow graphs
  • resilient distributed datasets

Key theories

The MapReduce model
Programmers write a map function that transforms input records into intermediate key-value pairs and a reduce function that combines all values for a key; the framework handles parallel execution, the shuffle that groups values by key, and recovery from failures.
Fault tolerance by re-execution
Because tasks are deterministic functions over partitioned input, the framework tolerates machine failures simply by re-running failed map or reduce tasks, and it mitigates slow nodes by launching backup (speculative) copies of straggling tasks.
In-memory dataflow engines
Later systems generalized MapReduce to arbitrary dataflow graphs and kept intermediate data in memory; the resilient distributed dataset abstraction recovers lost partitions by recomputing them from lineage, greatly speeding iterative and interactive workloads.

Clinical relevance

Data-parallel processing made cluster-scale computation accessible to ordinary programmers: MapReduce and its successors process logs, build search indexes, train models, and run analytics over petabytes, and they are foundational tools of data engineering and large-scale data science.

History

Google introduced MapReduce (2004) atop the Google File System (2003) to index the web on commodity clusters, and the open-source Hadoop reimplemented both, popularizing the model. By the early 2010s, in-memory dataflow engines built on the resilient distributed dataset abstraction (2012) superseded MapReduce for iterative and interactive analytics while keeping its fault-tolerance ideas.

Key figures

  • Jeffrey Dean
  • Sanjay Ghemawat
  • Matei Zaharia

Related topics

Seminal works

  • dean2008
  • ghemawat2003
  • zaharia2012

Frequently asked questions

Why was MapReduce so influential despite its simplicity?
Its power was in what it hid. By restricting computation to map and reduce functions, the framework could automatically partition data, schedule work across thousands of machines, recover from failures by re-running tasks, and balance load — letting programmers process enormous data sets without writing distributed-systems code.
Why did newer engines replace MapReduce for many tasks?
Classic MapReduce writes intermediate results to disk between every stage, which is slow for multi-step and iterative jobs such as machine learning. In-memory dataflow engines keep data in memory across stages and express richer computation graphs, offering large speedups while retaining lineage-based fault tolerance, so they became preferred for many analytic workloads.

Methods for this concept

Related concepts