What does principal component analysis actually compute?

It finds new axes, the principal components, that are orthogonal directions ordered by how much variance of the data they capture. Keeping the top few components gives the best linear low-dimensional approximation of the data in a least-squares sense.

Why reduce dimensions instead of using all features?

In high dimensions data become sparse and distances less meaningful, models overfit more easily, and computation slows. Reducing to a few informative coordinates can improve generalization, speed, and the ability to visualize and interpret the data.

Dimensionality Reduction

Dimensionality reduction represents high-dimensional data with a small number of coordinates that preserve its most important structure, aiding visualization, compression, and downstream learning.

Onderwerp vinden met PaperMindBinnenkortFind papers & topics

Tools & resources

Dia's downloaden

Learn & explore

VideoBinnenkort

Definition

Dimensionality reduction maps data from a high-dimensional space to a lower-dimensional one while retaining as much relevant information as possible, either by linear projection onto directions of maximal variance or by nonlinear embeddings that respect the data's underlying manifold.

Scope

This topic covers linear methods such as principal component analysis and factor analysis that find directions of greatest variance, and nonlinear manifold-learning and embedding methods that uncover curved low-dimensional structure. It addresses the curse of dimensionality, reconstruction error, and the trade-off between preserving global geometry and local neighborhoods.

Core questions

How can many correlated features be summarized by a few?
What does principal component analysis optimize?
How do nonlinear methods recover curved manifolds?
What information is lost and how is that loss measured?

Key theories

Principal component analysis: Principal component analysis projects data onto the orthogonal directions of greatest variance, giving the best linear low-dimensional approximation in a least-squares sense and revealing dominant patterns of variation.
Probabilistic latent linear models: Probabilistic principal component analysis and factor analysis frame dimensionality reduction as a latent-variable model, providing a generative interpretation and a principled way to handle noise and missing data.
Manifold learning: Nonlinear methods assume data lie near a low-dimensional manifold and build embeddings that preserve local neighborhood relationships, capturing structure that linear projections cannot.

Clinical relevance

Dimensionality reduction is used to visualize complex datasets, to compress and denoise signals, and to produce compact features that make downstream learning faster and less prone to overfitting; it directly addresses the curse of dimensionality, in which distances and densities become uninformative as the number of features grows.

History

Principal component analysis was introduced by Pearson and developed by Hotelling in the early twentieth century. Factor analysis emerged in psychometrics, and from the early 2000s nonlinear manifold-learning and neighbor-embedding methods extended dimensionality reduction to data with curved low-dimensional structure, becoming standard tools for high-dimensional visualization.

Key figures

Karl Pearson
Harold Hotelling
Trevor Hastie

Seminal works

hastie2009
bishop2006
murphy2012

Frequently asked questions

What does principal component analysis actually compute?: It finds new axes, the principal components, that are orthogonal directions ordered by how much variance of the data they capture. Keeping the top few components gives the best linear low-dimensional approximation of the data in a least-squares sense.
Why reduce dimensions instead of using all features?: In high dimensions data become sparse and distances less meaningful, models overfit more easily, and computation slows. Reducing to a few informative coordinates can improve generalization, speed, and the ability to visualize and interpret the data.