Object Detection

Overview

Object detection is one of the core tasks in computer vision. The goal is to simultaneously perform localization and classification — that is, to find all objects of interest in an image and output a bounding box and class label for each one.

Unlike image classification, which only outputs a single label, object detection must answer two questions at once:

What objects are present in the image?
Where is each object located?

Evaluation Metrics

IoU (Intersection over Union)

IoU is the standard metric for measuring the overlap between a predicted box and a ground-truth box:

\[ \text{IoU} = \frac{|B_{\text{pred}} \cap B_{\text{gt}}|}{|B_{\text{pred}} \cup B_{\text{gt}}|} \]

where \(B_{\text{pred}}\) is the predicted bounding box and \(B_{\text{gt}}\) is the ground-truth bounding box. IoU ranges from \([0, 1]\), and a detection is typically considered successful when \(\text{IoU} \geq 0.5\).

Precision and Recall

Precision: Among all samples predicted as positive, the proportion that are truly correct
Recall: Among all truly positive samples, the proportion that are correctly predicted

\[ \text{Precision} = \frac{TP}{TP + FP}, \quad \text{Recall} = \frac{TP}{TP + FN} \]

AP and mAP

AP (Average Precision) is the area under the Precision-Recall curve for a single class:

\[ \text{AP} = \int_0^1 P(R) \, dR \]

mAP (mean Average Precision) is the average AP across all classes and serves as the most commonly used comprehensive evaluation metric for object detection:

\[ \text{mAP} = \frac{1}{C} \sum_{i=1}^{C} \text{AP}_i \]

Different benchmarks define mAP slightly differently: PASCAL VOC uses \(\text{IoU} = 0.5\) (i.e., mAP@0.5), while COCO averages over multiple IoU thresholds (mAP@[0.5:0.95]).

Two-Stage Detectors

The core idea behind two-stage detectors is: first generate region proposals, then classify and refine each proposal.

R-CNN (2014)

R-CNN (Regions with CNN features) was the pioneering work that introduced deep learning to object detection:

Selective Search: Generate approximately 2,000 candidate regions on the input image
Feature Extraction: Resize each candidate region to a fixed size and pass it through a CNN (e.g., AlexNet) to extract features
Classification: Use an SVM to classify the features of each candidate region
Bounding Box Regression: Use a linear regressor to refine bounding box locations

Drawback: Each candidate region requires a separate forward pass through the CNN, making it extremely slow (~47 seconds per image).

Fast R-CNN (2015)

The key improvement in Fast R-CNN is shared convolutional computation:

Whole-image convolution: Run the CNN once on the entire image to obtain a full feature map
RoI Pooling: Project each candidate region onto the feature map and extract a fixed-length feature vector via the RoI Pooling layer
Multi-task loss: Jointly train the classification head and regression head using a combined loss function

\[ L = L_{\text{cls}}(p, u) + \lambda \cdot [u \geq 1] \cdot L_{\text{loc}}(t^u, v) \]

where \(L_{\text{cls}}\) is the classification loss (cross-entropy) and \(L_{\text{loc}}\) is the bounding box regression loss (Smooth L1).

Speedup: Training is ~9x faster than R-CNN; inference is ~213x faster.

Faster R-CNN (2015)

Faster R-CNN replaces Selective Search with a Region Proposal Network (RPN), enabling end-to-end training:

How the RPN works:

At each spatial location on the feature map, place \(k\) predefined anchors (combinations of different scales and aspect ratios, typically \(k=9\))
For each anchor, predict: - A foreground/background binary classification score (whether it contains an object) - Bounding box offsets (4 values: \(\Delta x, \Delta y, \Delta w, \Delta h\))
Select the top-scoring proposals and pass them to the downstream classification and regression heads

Anchor regression: Given an anchor \((x_a, y_a, w_a, h_a)\) and predicted offsets \((t_x, t_y, t_w, t_h)\):

\[ \hat{x} = t_x \cdot w_a + x_a, \quad \hat{y} = t_y \cdot h_a + y_a \]

\[ \hat{w} = w_a \cdot e^{t_w}, \quad \hat{h} = h_a \cdot e^{t_h} \]

Faster R-CNN achieved near real-time speed (~5 fps) on PASCAL VOC while maintaining high accuracy.

One-Stage Detectors

One-stage detectors skip the region proposal step and directly predict bounding boxes and class labels from the image, making them significantly faster.

YOLO v1 (2016)

YOLO (You Only Look Once) is the seminal one-stage detector, built on a remarkably simple core idea:

Core mechanism:

Grid division: Divide the input image into an \(S \times S\) grid. Each grid cell is responsible for detecting objects whose center falls within that cell
Each grid cell predicts: - \(B\) bounding boxes, each containing 5 values: \((x, y, w, h, \text{confidence})\) - \(C\) class conditional probabilities \(P(\text{Class}_i | \text{Object})\)
Output tensor: \(S \times S \times (B \times 5 + C)\)

The confidence score is defined as:

\[ \text{Confidence} = P(\text{Object}) \times \text{IoU}_{\text{pred}}^{\text{truth}} \]

The final class-specific confidence for each bounding box is:

\[ P(\text{Class}_i | \text{Object}) \times P(\text{Object}) \times \text{IoU}_{\text{pred}}^{\text{truth}} = P(\text{Class}_i) \times \text{IoU}_{\text{pred}}^{\text{truth}} \]

Loss function (multi-task loss):

\[ L = \lambda_{\text{coord}} \sum_{i=0}^{S^2} \sum_{j=0}^{B} \mathbb{1}_{ij}^{\text{obj}} \left[ (x_i - \hat{x}_i)^2 + (y_i - \hat{y}_i)^2 \right] \]

\[ + \lambda_{\text{coord}} \sum_{i=0}^{S^2} \sum_{j=0}^{B} \mathbb{1}_{ij}^{\text{obj}} \left[ (\sqrt{w_i} - \sqrt{\hat{w}_i})^2 + (\sqrt{h_i} - \sqrt{\hat{h}_i})^2 \right] \]

\[ + \sum_{i=0}^{S^2} \sum_{j=0}^{B} \mathbb{1}_{ij}^{\text{obj}} (C_i - \hat{C}_i)^2 + \lambda_{\text{noobj}} \sum_{i=0}^{S^2} \sum_{j=0}^{B} \mathbb{1}_{ij}^{\text{noobj}} (C_i - \hat{C}_i)^2 \]

\[ + \sum_{i=0}^{S^2} \mathbb{1}_{i}^{\text{obj}} \sum_{c \in \text{classes}} (p_i(c) - \hat{p}_i(c))^2 \]

The square root on width and height is used to mitigate scale imbalance between large and small objects.

Pros and cons:

Pros: Extremely fast (45 fps); global reasoning reduces false positives on backgrounds
Cons: Poor performance on small and densely packed objects; each grid cell can only predict a limited number of objects

YOLO v2 / YOLO9000 (2017)

YOLOv2 introduced several improvements over v1:

Batch Normalization: Added BN after every convolutional layer, improving convergence and accuracy
High-resolution classifier: Fine-tuned the classification network at \(448 \times 448\) before using it for detection
Anchor Boxes: Adopted predefined anchor boxes (inspired by Faster R-CNN) instead of direct bounding box prediction
Dimension Clusters: Used K-Means clustering on the training set to find optimal anchor shapes
Multi-scale training: Randomly switched input resolution during training (\(320\) to \(608\))

YOLO v3 (2018)

The core improvement in YOLOv3 is multi-scale prediction:

Detection is performed on feature maps at 3 different scales (similar to FPN)
Small-scale feature maps detect large objects; large-scale feature maps detect small objects
Uses Darknet-53 as the backbone network (incorporating residual connections)
Multi-label classification: Replaces Softmax with independent logistic classifiers to support multi-label prediction

YOLO v5 (2020)

YOLOv5, released by Ultralytics, was never accompanied by a formal paper but gained widespread adoption due to its engineering optimizations:

Implemented in PyTorch (previous versions were based on Darknet/C)
Auto-anchor computation: Automatically optimizes anchors for the dataset
Data augmentation: Mosaic augmentation (4-image stitching), MixUp, random affine transforms
Model scaling: Offers five model sizes — n/s/m/l/x
Training strategies: Cosine annealing learning rate, warmup, EMA

YOLO v8 (2023)

YOLOv8 is the latest architecture from Ultralytics, featuring major improvements:

Anchor-Free detection head: Eliminates anchor boxes, directly predicting object centers and dimensions
Decoupled Head: Classification and regression use separate convolutional branches
C2f module: An improved CSP bottleneck structure for more efficient feature extraction
Distribution Focal Loss: Used for bounding box regression
Unified framework: Supports detection, segmentation, classification, pose estimation, and more within a single framework

SSD (Single Shot MultiBox Detector, 2016)

SSD is another important one-stage detector:

Core idea: Perform detection simultaneously on feature maps at multiple resolutions, with each feature map responsible for detecting objects at different scales.

Architecture highlights:

Base network: Uses VGG-16 (with fully connected layers truncated) as the feature extractor
Multi-scale feature maps: Adds multiple progressively smaller convolutional layers after the base network (e.g., \(38 \times 38\), \(19 \times 19\), \(10 \times 10\), \(5 \times 5\), \(3 \times 3\), \(1 \times 1\))
Default boxes: Places default boxes of different aspect ratios at each feature map location (similar to anchors)
Prediction: Predicts class scores and bounding box offsets for each default box

Loss function:

\[ L = \frac{1}{N} (L_{\text{conf}} + \alpha \cdot L_{\text{loc}}) \]

where \(L_{\text{conf}}\) is the confidence loss, \(L_{\text{loc}}\) is the localization loss, and \(N\) is the number of matched default boxes.

SSD uses Hard Negative Mining: negative samples are sorted by confidence loss, maintaining a positive-to-negative ratio of approximately \(1:3\).

Non-Maximum Suppression (NMS)

NMS is a critical post-processing step in object detection for removing redundant detections:

Algorithm:

Sort all detection boxes by confidence in descending order
Select the box with the highest confidence and add it to the final results
Compute the IoU between this box and all remaining boxes
Remove all boxes with IoU greater than a threshold \(\tau\) (typically \(\tau = 0.5\))
Repeat steps 2–4 until all boxes have been processed

Soft-NMS: Standard NMS hard-deletes overlapping boxes, which may incorrectly remove valid detections in crowded scenes. Soft-NMS instead reduces the confidence of overlapping boxes:

\[ s_i = s_i \cdot e^{-\frac{\text{IoU}(M, b_i)^2}{\sigma}} \]

where \(M\) is the current highest-scoring box, \(b_i\) is another box, and \(\sigma\) controls the decay rate.

Focal Loss and Class Imbalance

Background

One-stage detectors with dense sampling produce an overwhelming number of negative samples (background regions), with positive-to-negative ratios reaching \(1:1000\). The multitude of easy-to-classify negatives dominates the loss function, drowning out useful learning signals.

Focal Loss (2017, RetinaNet)

Focal Loss adds a modulating factor to the standard cross-entropy loss, down-weighting the contribution of easily classified samples:

\[ \text{FL}(p_t) = -\alpha_t (1 - p_t)^{\gamma} \log(p_t) \]

where:

\(p_t\) is the model's predicted probability for the true class
\((1 - p_t)^{\gamma}\) is the modulating factor, with focusing parameter \(\gamma \geq 0\)
\(\alpha_t\) is the class-balancing weight

When \(\gamma = 0\), this reduces to standard cross-entropy. When \(\gamma = 2\) (a commonly used value):

For easily classified samples (\(p_t = 0.9\)): weight is \((1-0.9)^2 = 0.01\), dramatically reducing the loss
For hard samples (\(p_t = 0.1\)): weight is \((1-0.1)^2 = 0.81\), keeping the loss nearly unchanged

RetinaNet combines Focal Loss with an FPN (Feature Pyramid Network) backbone, making a one-stage detector surpass two-stage detectors in accuracy for the first time.

Anchor-Free Methods

Anchor-free methods eliminate the reliance on predefined anchor boxes, avoiding the hyperparameter tuning that anchor design entails.

CenterNet (2019)

CenterNet reformulates object detection as a keypoint detection problem:

Uses a heatmap to predict the center point of each object
Regresses object width and height at each center point
Requires no NMS post-processing (or uses simple peak extraction)

The loss function uses a modified Focal Loss for heatmaps and an L1 loss for size regression.

FCOS (Fully Convolutional One-Stage Detection, 2019)

FCOS proposes a fully pixel-level prediction approach to detection:

For each location on the feature map, predicts distances to the four sides of the bounding box \((l, t, r, b)\)
Introduces a center-ness branch to suppress low-quality predictions far from the object center:

\[ \text{centerness} = \sqrt{\frac{\min(l, r)}{\max(l, r)} \times \frac{\min(t, b)}{\max(t, b)}} \]

Combines with FPN for multi-scale prediction, with different levels handling different object scales

Summary and Comparison

Method	Type	Speed	Accuracy	Key Features
Faster R-CNN	Two-stage	Medium	High	RPN + RoI Pooling
YOLO v1	One-stage	Fast	Medium	Global reasoning, simple
YOLO v8	One-stage	Fast	High	Anchor-free, multi-task
SSD	One-stage	Fast	Medium	Multi-scale feature maps
RetinaNet	One-stage	Medium	High	Focal Loss + FPN
FCOS	One-stage	Fast	High	Anchor-free, per-pixel

Key trends in object detection:

Two-stage to one-stage: The speed-accuracy trade-off continues to improve
Anchor-based to anchor-free: Fewer hyperparameters, simpler design
Multi-task unification: Detection, segmentation, pose estimation, and more within a single framework
Transformer-based approaches: Works like DETR (Detection Transformer) bring attention mechanisms to detection, eliminating hand-crafted post-processing steps like NMS