How can 3D be recovered from flat images?

A point seen from two or more known viewpoints can be triangulated, and matching many points across views constrains both the scene structure and the camera positions enough to reconstruct them up to scale.

What is structure from motion?

It is the process of taking a set of overlapping images, finding matching features, and solving simultaneously for where each camera was and where the 3D points are, producing a sparse 3D model and camera trajectory.

Multi-View Geometry and 3D Reconstruction

Multi-view geometry studies the relationships among images of the same scene taken from different viewpoints, and 3D reconstruction uses these relationships to recover scene structure and camera positions.

Definition

Multi-view geometry is the study of geometric constraints relating multiple images of a scene, and 3D reconstruction is the recovery of scene structure and camera poses consistent with those images.

Scope

This topic covers epipolar geometry and the fundamental and essential matrices, two-view and multi-view stereo for depth estimation, triangulation, structure from motion that jointly recovers cameras and points, and bundle adjustment as the nonlinear refinement of the full reconstruction.

Core questions

What constraints relate the same scene point seen in two images?
How is depth recovered from stereo correspondence?
How are camera poses and scene structure recovered simultaneously?
How is a large reconstruction refined to minimize reprojection error?

Key concepts

Epipolar geometry
Fundamental and essential matrices
Stereo correspondence
Triangulation
Structure from motion
Bundle adjustment

Key theories

Epipolar geometry: For two views, a point in one image constrains its match to a line in the other, encoded by the fundamental matrix, which reduces correspondence search and underlies stereo and motion estimation.
Bundle adjustment: Reconstruction is refined by jointly optimizing all camera parameters and 3D points to minimize the total reprojection error, a large sparse nonlinear least-squares problem at the core of structure from motion.

Clinical relevance

Multi-view reconstruction enables 3D mapping and photogrammetry, visual simultaneous localization and mapping for robots and drones, augmented reality, cultural-heritage digitization, and the generation of 3D models from photo collections.

History

Building on photogrammetry, the projective formulation of multi-view geometry was consolidated in the 1990s; bundle adjustment was synthesized in 2000, and large-scale structure-from-motion systems later reconstructed cities from internet photo collections.

Key figures

Richard Hartley
Andrew Zisserman
Bill Triggs

Seminal works

hartley2004
triggs2000

Frequently asked questions

How can 3D be recovered from flat images?: A point seen from two or more known viewpoints can be triangulated, and matching many points across views constrains both the scene structure and the camera positions enough to reconstruct them up to scale.
What is structure from motion?: It is the process of taking a set of overlapping images, finding matching features, and solving simultaneously for where each camera was and where the 3D points are, producing a sparse 3D model and camera trajectory.