Object Detection
Overview
Object detection is one of the core tasks in computer vision. The goal is to simultaneously perform localization and classification — that is, to find all objects of interest in an image and output a bounding box and class label for each one.
Unlike image classification, which only outputs a single label, object detection must answer two questions at once:
- What objects are present in the image?
- Where is each object located?
Evaluation Metrics
IoU (Intersection over Union)
IoU is the standard metric for measuring the overlap between a predicted box and a ground-truth box:
where \(B_{\text{pred}}\) is the predicted bounding box and \(B_{\text{gt}}\) is the ground-truth bounding box. IoU ranges from \([0, 1]\), and a detection is typically considered successful when \(\text{IoU} \geq 0.5\).
Precision and Recall
- Precision: Among all samples predicted as positive, the proportion that are truly correct
- Recall: Among all truly positive samples, the proportion that are correctly predicted
AP and mAP
AP (Average Precision) is the area under the Precision-Recall curve for a single class:
mAP (mean Average Precision) is the average AP across all classes and serves as the most commonly used comprehensive evaluation metric for object detection:
Different benchmarks define mAP slightly differently: PASCAL VOC uses \(\text{IoU} = 0.5\) (i.e., mAP@0.5), while COCO averages over multiple IoU thresholds (mAP@[0.5:0.95]).
Two-Stage Detectors
The core idea behind two-stage detectors is: first generate region proposals, then classify and refine each proposal.
R-CNN (2014)
R-CNN (Regions with CNN features) was the pioneering work that introduced deep learning to object detection:
- Selective Search: Generate approximately 2,000 candidate regions on the input image
- Feature Extraction: Resize each candidate region to a fixed size and pass it through a CNN (e.g., AlexNet) to extract features
- Classification: Use an SVM to classify the features of each candidate region
- Bounding Box Regression: Use a linear regressor to refine bounding box locations
Drawback: Each candidate region requires a separate forward pass through the CNN, making it extremely slow (~47 seconds per image).
Fast R-CNN (2015)
The key improvement in Fast R-CNN is shared convolutional computation:
- Whole-image convolution: Run the CNN once on the entire image to obtain a full feature map
- RoI Pooling: Project each candidate region onto the feature map and extract a fixed-length feature vector via the RoI Pooling layer
- Multi-task loss: Jointly train the classification head and regression head using a combined loss function
where \(L_{\text{cls}}\) is the classification loss (cross-entropy) and \(L_{\text{loc}}\) is the bounding box regression loss (Smooth L1).
Speedup: Training is ~9x faster than R-CNN; inference is ~213x faster.
Faster R-CNN (2015)
Faster R-CNN replaces Selective Search with a Region Proposal Network (RPN), enabling end-to-end training:
How the RPN works:
- At each spatial location on the feature map, place \(k\) predefined anchors (combinations of different scales and aspect ratios, typically \(k=9\))
- For each anchor, predict: - A foreground/background binary classification score (whether it contains an object) - Bounding box offsets (4 values: \(\Delta x, \Delta y, \Delta w, \Delta h\))
- Select the top-scoring proposals and pass them to the downstream classification and regression heads
Anchor regression: Given an anchor \((x_a, y_a, w_a, h_a)\) and predicted offsets \((t_x, t_y, t_w, t_h)\):
Faster R-CNN achieved near real-time speed (~5 fps) on PASCAL VOC while maintaining high accuracy.
One-Stage Detectors
One-stage detectors skip the region proposal step and directly predict bounding boxes and class labels from the image, making them significantly faster.
YOLO v1 (2016)
YOLO (You Only Look Once) is the seminal one-stage detector, built on a remarkably simple core idea:
Core mechanism:
- Grid division: Divide the input image into an \(S \times S\) grid. Each grid cell is responsible for detecting objects whose center falls within that cell
- Each grid cell predicts: - \(B\) bounding boxes, each containing 5 values: \((x, y, w, h, \text{confidence})\) - \(C\) class conditional probabilities \(P(\text{Class}_i | \text{Object})\)
- Output tensor: \(S \times S \times (B \times 5 + C)\)
The confidence score is defined as:
The final class-specific confidence for each bounding box is:
Loss function (multi-task loss):
The square root on width and height is used to mitigate scale imbalance between large and small objects.
Pros and cons:
- Pros: Extremely fast (45 fps); global reasoning reduces false positives on backgrounds
- Cons: Poor performance on small and densely packed objects; each grid cell can only predict a limited number of objects
YOLO v2 / YOLO9000 (2017)
YOLOv2 introduced several improvements over v1:
- Batch Normalization: Added BN after every convolutional layer, improving convergence and accuracy
- High-resolution classifier: Fine-tuned the classification network at \(448 \times 448\) before using it for detection
- Anchor Boxes: Adopted predefined anchor boxes (inspired by Faster R-CNN) instead of direct bounding box prediction
- Dimension Clusters: Used K-Means clustering on the training set to find optimal anchor shapes
- Multi-scale training: Randomly switched input resolution during training (\(320\) to \(608\))
YOLO v3 (2018)
The core improvement in YOLOv3 is multi-scale prediction:
- Detection is performed on feature maps at 3 different scales (similar to FPN)
- Small-scale feature maps detect large objects; large-scale feature maps detect small objects
- Uses Darknet-53 as the backbone network (incorporating residual connections)
- Multi-label classification: Replaces Softmax with independent logistic classifiers to support multi-label prediction
YOLO v5 (2020)
YOLOv5, released by Ultralytics, was never accompanied by a formal paper but gained widespread adoption due to its engineering optimizations:
- Implemented in PyTorch (previous versions were based on Darknet/C)
- Auto-anchor computation: Automatically optimizes anchors for the dataset
- Data augmentation: Mosaic augmentation (4-image stitching), MixUp, random affine transforms
- Model scaling: Offers five model sizes — n/s/m/l/x
- Training strategies: Cosine annealing learning rate, warmup, EMA
YOLO v8 (2023)
YOLOv8 is the latest architecture from Ultralytics, featuring major improvements:
- Anchor-Free detection head: Eliminates anchor boxes, directly predicting object centers and dimensions
- Decoupled Head: Classification and regression use separate convolutional branches
- C2f module: An improved CSP bottleneck structure for more efficient feature extraction
- Distribution Focal Loss: Used for bounding box regression
- Unified framework: Supports detection, segmentation, classification, pose estimation, and more within a single framework
SSD (Single Shot MultiBox Detector, 2016)
SSD is another important one-stage detector:
Core idea: Perform detection simultaneously on feature maps at multiple resolutions, with each feature map responsible for detecting objects at different scales.
Architecture highlights:
- Base network: Uses VGG-16 (with fully connected layers truncated) as the feature extractor
- Multi-scale feature maps: Adds multiple progressively smaller convolutional layers after the base network (e.g., \(38 \times 38\), \(19 \times 19\), \(10 \times 10\), \(5 \times 5\), \(3 \times 3\), \(1 \times 1\))
- Default boxes: Places default boxes of different aspect ratios at each feature map location (similar to anchors)
- Prediction: Predicts class scores and bounding box offsets for each default box
Loss function:
where \(L_{\text{conf}}\) is the confidence loss, \(L_{\text{loc}}\) is the localization loss, and \(N\) is the number of matched default boxes.
SSD uses Hard Negative Mining: negative samples are sorted by confidence loss, maintaining a positive-to-negative ratio of approximately \(1:3\).
Non-Maximum Suppression (NMS)
NMS is a critical post-processing step in object detection for removing redundant detections:
Algorithm:
- Sort all detection boxes by confidence in descending order
- Select the box with the highest confidence and add it to the final results
- Compute the IoU between this box and all remaining boxes
- Remove all boxes with IoU greater than a threshold \(\tau\) (typically \(\tau = 0.5\))
- Repeat steps 2–4 until all boxes have been processed
Soft-NMS: Standard NMS hard-deletes overlapping boxes, which may incorrectly remove valid detections in crowded scenes. Soft-NMS instead reduces the confidence of overlapping boxes:
where \(M\) is the current highest-scoring box, \(b_i\) is another box, and \(\sigma\) controls the decay rate.
Focal Loss and Class Imbalance
Background
One-stage detectors with dense sampling produce an overwhelming number of negative samples (background regions), with positive-to-negative ratios reaching \(1:1000\). The multitude of easy-to-classify negatives dominates the loss function, drowning out useful learning signals.
Focal Loss (2017, RetinaNet)
Focal Loss adds a modulating factor to the standard cross-entropy loss, down-weighting the contribution of easily classified samples:
where:
- \(p_t\) is the model's predicted probability for the true class
- \((1 - p_t)^{\gamma}\) is the modulating factor, with focusing parameter \(\gamma \geq 0\)
- \(\alpha_t\) is the class-balancing weight
When \(\gamma = 0\), this reduces to standard cross-entropy. When \(\gamma = 2\) (a commonly used value):
- For easily classified samples (\(p_t = 0.9\)): weight is \((1-0.9)^2 = 0.01\), dramatically reducing the loss
- For hard samples (\(p_t = 0.1\)): weight is \((1-0.1)^2 = 0.81\), keeping the loss nearly unchanged
RetinaNet combines Focal Loss with an FPN (Feature Pyramid Network) backbone, making a one-stage detector surpass two-stage detectors in accuracy for the first time.
Anchor-Free Methods
Anchor-free methods eliminate the reliance on predefined anchor boxes, avoiding the hyperparameter tuning that anchor design entails.
CenterNet (2019)
CenterNet reformulates object detection as a keypoint detection problem:
- Uses a heatmap to predict the center point of each object
- Regresses object width and height at each center point
- Requires no NMS post-processing (or uses simple peak extraction)
The loss function uses a modified Focal Loss for heatmaps and an L1 loss for size regression.
FCOS (Fully Convolutional One-Stage Detection, 2019)
FCOS proposes a fully pixel-level prediction approach to detection:
- For each location on the feature map, predicts distances to the four sides of the bounding box \((l, t, r, b)\)
- Introduces a center-ness branch to suppress low-quality predictions far from the object center:
- Combines with FPN for multi-scale prediction, with different levels handling different object scales
Summary and Comparison
| Method | Type | Speed | Accuracy | Key Features |
|---|---|---|---|---|
| Faster R-CNN | Two-stage | Medium | High | RPN + RoI Pooling |
| YOLO v1 | One-stage | Fast | Medium | Global reasoning, simple |
| YOLO v8 | One-stage | Fast | High | Anchor-free, multi-task |
| SSD | One-stage | Fast | Medium | Multi-scale feature maps |
| RetinaNet | One-stage | Medium | High | Focal Loss + FPN |
| FCOS | One-stage | Fast | High | Anchor-free, per-pixel |
Key trends in object detection:
- Two-stage to one-stage: The speed-accuracy trade-off continues to improve
- Anchor-based to anchor-free: Fewer hyperparameters, simpler design
- Multi-task unification: Detection, segmentation, pose estimation, and more within a single framework
- Transformer-based approaches: Works like DETR (Detection Transformer) bring attention mechanisms to detection, eliminating hand-crafted post-processing steps like NMS