Why did Spark and similar systems supersede plain MapReduce for many workloads?

MapReduce writes intermediate results to disk between every step, which is slow for iterative algorithms. In-memory abstractions like resilient distributed datasets keep data in memory across steps and recompute only lost partitions on failure, giving large speedups for iterative and interactive analytics.

Big Data Processing Frameworks

Big data frameworks let programmers process data sets far larger than any single machine by expressing computation as parallel dataflows that the runtime distributes and makes fault-tolerant.

Nájsť tému v PaperMindČoskoroFind papers & topics

Tools & resources

Stiahnuť snímky

Learn & explore

VideoČoskoro

Definition

A big data processing framework is a system that lets a programmer express a computation over a very large data set as high-level dataflow operators, and that automatically partitions the data, schedules parallel execution across a cluster, and tolerates node failures.

Scope

This topic covers the dataflow programming model for cluster-scale data processing: the MapReduce paradigm and its limitations, in-memory dataflow engines built on resilient distributed datasets, and unified batch-and-stream processing with windowing, event-time semantics, and exactly-once guarantees. It treats how massive, possibly unbounded, data is partitioned, processed in parallel, and recovered after failures.

Core questions

How can a computation over data too large for one machine be expressed and executed in parallel?
How do in-memory and streaming engines improve on batch MapReduce?
How are correctness, latency, and fault tolerance balanced for unbounded, out-of-order streams?

Key theories

MapReduce: MapReduce expresses data processing as a map step that transforms records into key-value pairs and a reduce step that aggregates by key, with the runtime handling parallelization, data shuffling, and re-execution of failed tasks.
Resilient distributed datasets: RDDs provide a fault-tolerant in-memory abstraction whose lineage of deterministic transformations allows lost partitions to be recomputed, enabling iterative and interactive cluster computing far faster than disk-based MapReduce.
Unified batch and stream dataflow: Modern engines treat batch as a special case of streaming, using event-time windowing and watermarks plus consistent snapshots to deliver exactly-once results over unbounded, out-of-order data.

Clinical relevance

These frameworks process the data behind search, analytics, recommendation, and machine-learning pipelines, and stream engines power real-time monitoring and event-driven applications, making them core infrastructure for data-intensive computing.

History

Google's 2004 MapReduce paper (revised 2008) established cluster-scale data processing; Spark's resilient distributed datasets (2012) brought fast in-memory and iterative processing; and systems like Flink and the dataflow model (2015) unified batch and streaming with strong correctness guarantees.

Debates

Batch versus streaming as the primary processing model: Batch processing is simpler and easy to make exactly-once but adds latency, while streaming offers low latency at the cost of harder correctness under out-of-order data; unified engines argue streaming can subsume batch, though batch remains common for large historical jobs.

Key figures

Jeffrey Dean
Sanjay Ghemawat
Matei Zaharia
Tyler Akidau

Seminal works

dean2008
zaharia2012
akidau2015

Frequently asked questions

Why did Spark and similar systems supersede plain MapReduce for many workloads?: MapReduce writes intermediate results to disk between every step, which is slow for iterative algorithms. In-memory abstractions like resilient distributed datasets keep data in memory across steps and recompute only lost partitions on failure, giving large speedups for iterative and interactive analytics.