What makes learning deep?

The depth refers to the number of successive layers of nonlinear transformation between input and output. Each layer builds on the features of the previous one, so a deep network learns a hierarchy of representations rather than a single direct mapping.

Why did deep learning take off only recently?

The core ideas existed for decades, but training deep networks required large labeled datasets, fast parallel hardware such as graphics processors, and techniques like better initialization and activation functions. These came together around 2012, enabling dramatic gains on perception tasks.

Deep Learning

Deep learning trains neural networks with many layers to learn hierarchical representations of data, achieving state-of-the-art results in vision, speech, and language.

Definition

Deep learning is the branch of machine learning that uses neural networks with multiple layers of nonlinear processing to learn representations of data at increasing levels of abstraction, with parameters fit end to end by gradient descent on a loss function.

Scope

This area covers multilayer neural networks and the techniques that make them trainable at scale: network architectures from feedforward to convolutional and recurrent, the backpropagation algorithm and gradient-based optimization, regularization methods such as dropout, and deep generative models. It addresses why depth enables learning of composed features and what challenges arise in training very deep models.

Sub-topics

Core questions

Why do many layers enable learning of hierarchical features?
How is gradient-based training made to work for deep networks?
Which architectures suit images, sequences, and other data types?
How do regularization and optimization choices affect generalization?

Key theories

Hierarchical representation learning: Stacking layers lets a network compose simple features into increasingly abstract ones, so that early layers detect edges or sounds and later layers detect objects or words, learned automatically from data.
End-to-end training by backpropagation: The whole network is optimized jointly by propagating error gradients backward through its layers, allowing feature extraction and prediction to be learned together rather than designed by hand.
Depth and expressive efficiency: Deep networks can represent certain functions far more compactly than shallow ones, which together with large datasets and computation underlies their empirical success.

Clinical relevance

Deep learning has driven breakthroughs in image and speech recognition, machine translation, and large language models, and underpins much of contemporary artificial intelligence; its reliance on large datasets and substantial computation, and the opacity of the resulting models, are central practical and ethical considerations in its deployment.

History

Neural networks date to the perceptron and to backpropagation, popularized in 1986, but deep networks were hard to train until the mid-2000s. Advances in initialization, activation functions, large labeled datasets, and graphics-processor computation enabled the deep-learning revolution from around 2012, reshaping computer vision, speech, and natural language processing.

Debates

Scale versus new ideas: Much recent progress has come from training larger models on more data and computation, prompting debate over how far scaling alone can go versus the need for new architectural or algorithmic ideas.

Key figures

Geoffrey Hinton
Yann LeCun
Yoshua Bengio
Juergen Schmidhuber

Seminal works

goodfellow2016
lecun2015
bengio2013

Frequently asked questions

What makes learning deep?: The depth refers to the number of successive layers of nonlinear transformation between input and output. Each layer builds on the features of the previous one, so a deep network learns a hierarchy of representations rather than a single direct mapping.
Why did deep learning take off only recently?: The core ideas existed for decades, but training deep networks required large labeled datasets, fast parallel hardware such as graphics processors, and techniques like better initialization and activation functions. These came together around 2012, enabling dramatic gains on perception tasks.