Object Recognition and Detection
Object recognition determines what is present in an image, and object detection additionally localizes each instance with a bounding box or region.
Definition
Object recognition is the assignment of category labels to images or regions, and object detection is the joint task of localizing and labeling each object instance in an image.
Scope
This topic covers image classification, sliding-window and region-proposal detection, the classic boosted-cascade face detector, and the convolutional neural networks that now dominate recognition, along with the role of large labeled datasets and benchmarks in driving progress.
Core questions
- How is the category of an object in an image determined?
- How are objects localized as well as classified?
- What features and models generalize across viewpoint and appearance?
- Why did learned representations overtake hand-designed features?
Key concepts
- Image classification
- Bounding-box detection
- Region proposals
- Boosted cascades
- Convolutional neural networks
- Benchmark datasets
Key theories
- Boosted cascade detection
- Real-time detection was achieved by combining simple rectangular features with a boosted classifier arranged in a cascade that quickly rejects non-object regions, exemplified by the Viola-Jones face detector.
- Deep convolutional recognition
- Convolutional neural networks trained on large labeled datasets learn hierarchical visual features end to end, sharply improving recognition accuracy and establishing learned representations as the dominant approach.
Clinical relevance
Recognition and detection enable face recognition, autonomous-vehicle and robotics perception, medical image diagnosis, content moderation and image search, retail and surveillance analytics, and many augmented-reality applications.
History
Detection moved from hand-crafted features and boosted cascades around 2001 to part-based models, and the 2012 success of deep convolutional networks on ImageNet triggered a rapid shift to learned representations across recognition and detection.
Debates
- Hand-crafted features versus learned representations
- For decades recognition relied on engineered features such as gradient histograms; deep learning replaced these with features learned from data, raising questions about interpretability, data and compute requirements, and robustness that remain active.
Key figures
- Paul Viola
- Michael Jones
- Geoffrey Hinton
Related topics
Seminal works
- viola2001
- krizhevsky2012
Frequently asked questions
- What is the difference between recognition and detection?
- Recognition says what is in an image, such as that it contains a cat, while detection also says where, drawing a box around each cat and labeling it, and may find several instances at once.
- Why did deep learning improve recognition so much?
- Convolutional networks learn the relevant visual features directly from large labeled datasets instead of relying on hand-designed ones, capturing patterns that are hard to specify manually and scaling with data and compute.