Skip to content

Loss Functions

This chapter is suitable both for those just beginning to study deep learning and for those revisiting the concept of loss functions after having covered most of the fundamentals. The concept of loss functions pervades the entire deep learning process, and examining it from different angles yields different insights.

This report systematically organizes the essential understanding of loss functions from the perspectives of statistics, information theory, physics, economics, and other disciplines, while also covering key details of engineering implementation. It is intended for researchers and engineers who wish to build a deep conceptual framework.


How to Read This Report

This report is divided into six progressively layered modules:

Module Question Addressed
Chapter 1: Concept Clarification What exactly is the difference between "loss" and "cost"?
Chapter 2: Disciplinary Perspectives How do scholars from different fields understand the nature of loss functions?
Chapter 3: Task Summary What loss functions are available for specific tasks?
Chapter 4: Frontier Developments What loss functions are used in LLMs, self-supervised learning, etc.?
Chapter 5: Engineering Practice What are the common pitfalls in code implementation?
Chapter 6: Selection Guide Given a task, which loss function should I use?

Suggested reading path: Beginners should read sequentially; experienced readers may jump directly to Chapter 3 or Chapter 6.


Table of Contents

  1. Cross-Dimensional Alignment of Five Core Concepts
  2. Loss Functions from Interdisciplinary Perspectives
  3. Comprehensive Summary of Loss Functions by Task
  4. Modern Frontier Loss Functions
  5. Engineering Implementation: Framework Differences and Common Pitfalls
  6. Selection Guide and Summary
  7. References and Further Reading

1. Cross-Dimensional Alignment of Five Core Concepts

In various literature, the five terms loss, cost, objective, error, and risk are frequently used interchangeably, causing tremendous confusion for beginners. This chapter thoroughly aligns these concepts along three dimensions: scope of application, constituent components, and disciplinary origin.

1.1 Full Concept Comparison Matrix

Concept Common Notation Scope Composition Disciplinary Origin
Loss Function \(L,\ \ell,\ \mathcal{L}\) Single sample Prediction error only Statistical decision theory
Cost Function \(J,\ C\) Entire dataset Mean/sum of losses + regularization terms Classical optimization theory
Objective Function \(f,\ \Phi,\ \mathcal{O}\) Optimization process Anything to be optimized (incl. Reward) Mathematical programming / OR
Error Function \(E,\ \epsilon\) Performance metric Pure numerical deviation (e.g., \(y - \hat{y}\)) Classical engineering / control
Risk Function \(R,\ \mathcal{R}\) Probabilistic expectation Expected value of loss over a probability distribution Statistical learning theory

1.2 Hierarchical Relationships: The Containment Structure of the Five Concepts

The five concepts form a clear hierarchy rather than a flat list:

Objective Function                 ← The most general concept; encompasses everything to be optimized
    └── Cost Function              ← The concrete form of the objective in supervised learning
            └── Loss Function      ← The per-sample decomposition of the cost function
                    └── Error Function ← The simplest form of loss (pure deviation)

Risk Function                      ← An independent dimension: the probabilistic expectation of the loss over an unknown distribution

From Loss to Cost

The cost function \(J(\theta)\) aggregates the loss function \(\ell\) over the training set, typically with an additional regularization term:

\[ J(\theta) = \frac{1}{n} \sum_{i=1}^{n} \ell(y_i, \hat{y}_i) + \lambda \cdot \Omega(\theta) \]

Mnemonic: Loss starts with "Lonely" — the loss function operates on a single, isolated sample.

From Cost to Objective

The objective function is the higher-level concept:

  • In supervised learning, the objective is typically to minimize the cost function;
  • In reinforcement learning, the objective may be to maximize cumulative reward;
  • In GANs, the objective is a minimax game.

From Loss to Risk

Dimension Empirical Risk Expected Risk (True Risk)
Definition Mean of the loss over the training set Expected value of the loss over the true data distribution
Formula \(R_{emp} = \frac{1}{n}\sum_{i=1}^n \ell(y_i, f(x_i))\) \(R = \mathbb{E}_{(x,y) \sim P}[\ell(y, f(x))]\)
Computability ✅ Directly computable ❌ True distribution \(P\) is unknown

The essence of model training is: using Empirical Risk Minimization (ERM) to approximate true risk minimization. Generalization capability is determined by the gap between the two (the generalization error).

1.3 Practical Example: All Five Concepts in a Single Project

Example: Training a House Price Prediction Model

Suppose you are using a neural network to predict house prices in a city:

  • Error: For house #3, the model predicts \(\hat{y}_3 = 150\) (in ten-thousands), while the true price is \(y_3 = 180\). The error is \(e_3 = y_3 - \hat{y}_3 = 30\).
  • Loss: Quantifying the error as a "penalty score" for the model. Using MSE: \(\ell_3 = (30)^2 = 900\).
  • Cost: Averaging the loss over all 1,000 training houses, plus L2 regularization to prevent overfitting: \(J(\theta) = \frac{1}{1000}\sum_{i=1}^{1000} \ell_i + 0.001 \|\theta\|^2\).
  • Objective: The overall direction of optimization — "minimize \(J(\theta)\)." In different frameworks, the objective may also include constraints (e.g., parameter range restrictions).
  • Risk: What you truly care about is the expected error on all future houses (including those not yet listed). Training can only minimize empirical risk; generalization is the ultimate goal.

2. Loss Functions from Interdisciplinary Perspectives

Loss functions are a product of converging insights from multiple disciplines. For the same MSE formula, a statistician, a physicist, and an economist will offer entirely different yet equally profound interpretations.

2.1 The Statistical Perspective: Prior Assumptions About Noise Distributions

Core Insight

Choosing a loss function is fundamentally an act of probabilistically modeling the noise in the data. Different loss functions correspond to different prior noise distributions, derived through Maximum Likelihood Estimation (MLE).

\[ \hat{\theta}_{MLE} = \arg\max_\theta \prod_{i=1}^n p(y_i | x_i; \theta) \equiv \arg\min_\theta \sum_{i=1}^n -\log p(y_i | x_i; \theta) \]

Correspondence Between Loss Functions and Noise Distributions

Loss Function Corresponding Noise Distribution Probability Density Form Key Property
MSE (L2 Loss) Gaussian \(\mathcal{N}(\mu, \sigma^2)\) \(p \propto e^{-\frac{(y-\hat{y})^2}{2\sigma^2}}\) Sensitive to outliers (squared penalty)
MAE (L1 Loss) Laplace \(\text{Laplace}(\mu, b)\) $p \propto e^{-\frac{ y-\hat{y}
Pinball Loss Asymmetric Laplace distribution Allows asymmetric decay Used for quantile regression
Cross-Entropy Bernoulli / Multinomial distribution \(p \propto \hat{p}^y (1-\hat{p})^{1-y}\) Standard loss for classification

Deeper implication: When you switch your loss function, you are effectively declaring to the model: "I believe the noise in the data follows a specific statistical pattern." This elevates loss function selection from empirical trial-and-error to a theoretically grounded modeling decision.

Practical Example: Anomalous Readings in Sensor Data

Example: Predicting Equipment Failures from Factory Temperature Sensors

A factory has deployed 500 temperature sensors that collect data hourly to predict equipment overheating. Approximately 5% of readings are anomalous (sudden spikes caused by sensor malfunctions, e.g., a reading of 9999°C).

  • With MSE: The squared error from anomalous readings is massively amplified, pulling model parameters toward "accommodating the outliers" and severely degrading prediction accuracy for normal data.
  • With MAE: Anomalous readings incur only a linear penalty; the model learns to "ignore" them and focuses on fitting the normal sensor reading distribution.
  • With Huber Loss (\(\delta = 3\)): Small errors (\(|r| \leq 3\)) are handled with MSE-like precision, while large errors (anomalous readings) are downgraded to linear penalties — achieving the best of both worlds.

Empirical finding: On industrial datasets containing 5% anomalous readings, Huber Loss typically reduces validation RMSE by approximately 15–30% compared to MSE.


2.2 The Information-Theoretic Perspective: "Communication Cost" Between Distributions

Core Insight

Information theory views a model as an encoder. The loss function measures the extra information cost (in bits) incurred when using the predicted distribution \(Q\) to describe the true distribution \(P\).

KL Divergence (Relative Entropy)

\[ D_{KL}(P \| Q) = \sum_{x} P(x) \log \frac{P(x)}{Q(x)} \]
  • Measures the information loss when approximating \(P\) with \(Q\);
  • Asymmetric: \(D_{KL}(P\|Q) \neq D_{KL}(Q\|P)\);
  • Non-negative: \(D_{KL}(P\|Q) \geq 0\), with equality if and only if \(P = Q\).

The Equivalence Between Cross-Entropy and KL Divergence

\[ H(P, Q) = H(P) + D_{KL}(P \| Q) \]

In supervised learning, the true distribution \(P\) is fixed (its entropy \(H(P)\) is a constant), so:

\[ \arg\min_Q H(P, Q) \equiv \arg\min_Q D_{KL}(P \| Q) \]

Trinity conclusion: Minimizing cross-entropy \(\equiv\) minimizing KL divergence \(\equiv\) Maximum Likelihood Estimation (MLE). These three formulations describe the same thing in different languages, originating from information theory, statistical distance theory, and frequentist statistics, respectively.

Various Losses from an Information-Theoretic Perspective

Loss Function Information-Theoretic Interpretation
Cross-Entropy Average number of bits needed to encode true labels using the predicted distribution
KL Divergence Number of "wasted" bits between predicted and true distributions
InfoNCE Maximizes a lower bound on mutual information between positive pairs
ELBO (Variational Lower Bound) Maximizes a lower bound on the log-likelihood of observed data; used in VAEs

Practical Example: Spam Classifier

Example: Understanding Email Classifier Training Through Information Theory

Training a binary classifier to determine whether an email is spam (1 = spam, 0 = legitimate).

Suppose an email has true label \(y=1\) (spam) and the model outputs two different probabilities:

Model State Predicted spam probability \(p\) Cross-entropy loss \(-\log p\) Information-theoretic reading
Early training 0.10 2.30 bits 2.30 bits needed to encode "this is spam"
Mid training 0.70 0.36 bits Model has "guessed most of it"; only 0.36 bits of extra info needed
Well trained 0.98 0.02 bits Model is nearly certain; almost no extra information needed

The training process is essentially compressing the information gap between "the true answer" and "the model's guess", driving the predicted distribution toward the true distribution.


2.3 The Physics Perspective: Energy Equilibria and Potential Energy Surfaces

Core Insight

Physicists view the loss function as the system's energy function. The training process corresponds to searching for stable low-energy states in a high-dimensional "energy landscape."

Energy-Based Models (EBMs)

  • Low energy \(E(x)\) corresponds to high probability \(p(x)\);
  • Training objective: "push down" the energy of real data and "push up" the energy of fake data;
  • The Boltzmann distribution bridges energy and probability:
\[ p(x) = \frac{e^{-E(x)/T}}{Z} \]

where \(T\) is the temperature parameter and \(Z = \int e^{-E(x)/T} dx\) is the partition function (normalization constant).

The Physical Origin of Softmax

The softmax function is a direct application of the Boltzmann distribution in the discrete case:

\[ \text{softmax}(z_k) = \frac{e^{z_k/T}}{\sum_j e^{z_j/T}} \]
Temperature \(T\) Distribution Shape Corresponding Behavior
\(T \to 0\) Tends toward argmax Deterministic decision (pick highest class)
\(T = 1\) Standard Softmax Normal training
\(T \to \infty\) Tends toward uniform Complete uncertainty (random guessing)

Geometric Properties of the Loss Landscape

Property Description Impact on Optimization
Non-convexity Numerous local minima and saddle points SGD noise helps escape saddle points
Flat wide valleys Solutions with good generalization tend to lie in wide, flat valleys Motivates optimizers like SAM
Vanishing gradient regions Gradient approaches zero in saturation regions of activations Inspired inventions of ReLU, residual connections

Practical Example: The Role of Temperature in LLM Inference

Example: ChatGPT's "Creative Mode" vs. "Precise Mode"

When generating text, large language models sample from the vocabulary at each step via temperature-scaled softmax:

  • Low temperature (\(T = 0.2\)): Probability concentrates on high-scoring tokens; output is conservative and repetitive. Suitable for code generation, factual Q&A, and other scenarios requiring deterministic answers.
  • High temperature (\(T = 1.5\)): Probability distribution becomes more uniform; output is diverse and creative, but also more prone to "hallucination." Suitable for brainstorming, story writing, and similar scenarios.
  • Top-p / Top-k sampling: An engineering refinement of the temperature mechanism — first truncating low-probability tokens (to avoid sampling completely irrelevant words), then applying temperature scaling. This is the mainstream approach for LLM inference today.

This is a direct engineering embodiment of "temperature controlling the system's entropy" in the Boltzmann distribution.


2.4 The Economics and Decision Theory Perspective: Regret and Utility

Core Insight

Economics interprets the loss function as a tool for quantifying the cost of decisions, focusing on how to make optimal choices under uncertainty and emphasizing that the costs of different error types are highly asymmetric.

Regret Minimization

Regret quantifies the gap between "how well I could have done if I had known the truth" and "how well I actually did":

\[ \text{Regret}_T = \sum_{t=1}^T \ell(y_t, \hat{y}_t) - \min_{f^*} \sum_{t=1}^T \ell(y_t, f^*(x_t)) \]

In the online learning framework, the goal is to achieve sublinear cumulative regret (\(o(T)\)) — meaning the algorithm's average regret approaches zero over time.

Loss Aversion and Asymmetric Costs

Based on Prospect Theory (Kahneman & Tversky, 1979):

  • Humans perceive losses approximately 2.25 times more intensely than equivalent gains;
  • The costs of prediction errors are highly asymmetric across different scenarios.

This has inspired a family of cost-sensitive learning techniques:

Technique Core Idea Application Scenario
Cost Matrix Explicitly define costs for different error types Medical diagnosis, fraud detection
Focal Loss Dynamically reduce weights on easy samples; focus on hard ones Object detection (extreme class imbalance)
Pinball Loss (\(\tau \neq 0.5\)) Asymmetrically penalize overestimation vs. underestimation Supply chain risk management, financial forecasting
Weighted Cross-Entropy Weight inversely proportional to class frequency Medical image segmentation

Utility Theory's Influence

In reinforcement learning, the reward function is essentially the negative form of a utility function. Reward shaping — designing carefully crafted intermediate reward structures to guide agent learning — is highly isomorphic to incentive design in economics.

Practical Example: Credit Card Fraud Detection at Banks

Example: The Vastly Different Costs of Two Error Types

A bank uses a machine learning model to determine whether credit card transactions are fraudulent. The data is extremely imbalanced: legitimate transactions account for 99.8%, while fraud accounts for only 0.2%. The costs of the two error types are drastically different:

Error Type Meaning Actual Cost
False Negative (missed fraud) Fraud classified as legitimate; transaction amount lost Average loss of $5,000 per incident
False Positive (false alarm) Legitimate transaction blocked; degraded user experience ~$2 customer service cost + churn risk per incident

With standard cross-entropy: The model learns to "predict everything as legitimate," achieving 99.8% accuracy but providing zero practical value.

Solution (Cost matrix + weighted loss):

# Cost matrix thinking: set the penalty weight for missed fraud to 2500x (5000/2)
pos_weight = torch.tensor([2500.0])
criterion = nn.BCEWithLogitsLoss(pos_weight=pos_weight)

This makes the model treat "missing one fraud case" as equivalent to "falsely flagging 2,500 legitimate transactions" during training, thereby correcting the bias introduced by data imbalance at the loss function level.


3. Comprehensive Summary of Loss Functions by Task

3.1 Regression Tasks: Pursuing Accuracy in Continuous Values

Formula and Property Quick Reference

Function Name Symbol Mathematical Expression Gradient Properties Applicable Scenario
Mean Squared Error (MSE) \(L_2\) \(\frac{1}{n}\sum(y_i - \hat{y}_i)^2\) Smooth, increases linearly with error Optimal when noise is Gaussian
Mean Absolute Error (MAE) \(L_1\) \(\frac{1}{n}\sum\|y_i - \hat{y}_i\|\) Constant (\(\pm 1\)), non-differentiable at zero More robust with outliers
Huber Loss \(L_\delta\) \(\begin{cases} \frac{1}{2}r^2, & \|r\|\leq\delta \\ \delta(\|r\|-\frac{\delta}{2}), & \text{otherwise} \end{cases}\) Smooth transition Best compromise between MSE and MAE
Log-Cosh Loss \(\text{log-cosh}\) \(\sum\log\cosh(\hat{y}_i - y_i)\) Second-order continuously differentiable MAE alternative when second-order optimization is needed
Quantile Loss (Pinball) \(L_\tau\) \(\max(\tau r,\ (\tau-1)r),\ r=y-\hat{y}\) Asymmetric Quantile regression, risk modeling

Intuitive Understanding of Huber Loss

  • When \(|r| \leq \delta\) (small errors): behaves like MSE with smooth gradients approaching zero, facilitating fine-grained tuning;
  • When \(|r| > \delta\) (large errors / outliers): behaves like MAE with constant gradients, preventing outliers from "dragging" the model;
  • The hyperparameter \(\delta\) defines the boundary between "normal errors" and "outlier errors."

Practical Example: Power Demand Forecasting with Pinball Loss

Example: Asymmetric Risk in Power Grid Dispatch

Grid operators need to predict the next day's electricity demand 24 hours in advance. The costs of the two types of forecast errors are very different:

  • Underestimating demand: Insufficient power supply, potentially causing blackouts — extremely costly;
  • Overestimating demand: Excess reserve power, wasted costs — relatively inexpensive.

Therefore, operators need not a mean forecast but a 90th percentile forecast (i.e., actual demand will not exceed the prediction in 90% of cases).

Using Pinball Loss with \(\tau = 0.9\):

\[ L_{0.9}(r) = \begin{cases} 0.9 \cdot r, & r \geq 0 \text{ (underestimate, heavy penalty)} \\ -0.1 \cdot r, & r < 0 \text{ (overestimate, light penalty)} \end{cases} \]

This causes the model to learn "better to overestimate than to underestimate," naturally producing conservative upper-bound predictions that directly reflect the business's risk preferences.


3.2 Classification Tasks: Pursuing Overlap of Probability Distributions

Binary Cross-Entropy (BCE)

Applicable to binary classification or multi-label classification (each label judged independently):

\[ L_{BCE} = -\frac{1}{n}\sum_{i=1}^n \left[ y_i \log p_i + (1 - y_i)\log(1 - p_i) \right] \]

Categorical Cross-Entropy (CCE)

Paired with softmax, this is the de facto standard for multi-class classification:

\[ L_{CCE} = -\frac{1}{n}\sum_{i=1}^n \sum_{c=1}^K y_{ic} \log p_{ic} \]

Focal Loss

Proposed by Lin et al. (2017) to address extreme sample imbalance:

\[ L_{FL} = -\alpha_t (1 - p_t)^\gamma \log p_t \]
  • \(p_t\): the model's predicted probability for the correct class;
  • \((1-p_t)^\gamma\): modulating factor — when the model has already "learned" a sample (\(p_t \to 1\)), this factor approaches 0, reducing that sample's contribution to the total loss;
  • \(\gamma\): focusing parameter (typically set to 2), controlling the decay rate;
  • \(\alpha_t\): class balancing factor.

Hinge Loss

The soul of SVMs, designed to find the maximum-margin decision boundary:

\[ L_{Hinge} = \frac{1}{n}\sum_{i=1}^n \max\left(0,\ 1 - y_i \cdot f(x_i)\right) \]
  • When a sample is correctly classified with sufficient confidence (\(y_i \cdot f(x_i) \geq 1\)), the loss is 0;
  • Otherwise, the loss grows linearly with the degree of "insufficient confidence";
  • It does not care "by how much the prediction is correct" — only whether it exceeds the safety margin.

Practical Example: Focal Loss in Medical Imaging

Example: Lung Nodule Detection in CT Scans (positive-to-negative ratio ~1:500)

In lung CT scans, a single image may contain thousands of region proposals, but fewer than 0.2% actually contain nodules.

Problem with standard cross-entropy: During training, gradients from "no-nodule" background proposals account for 99.8% of the total gradient. The model is entirely dominated by "how to recognize background" and fails to learn "what nodules look like."

Focal Loss solution (with \(\gamma=2\)):

Sample Type Typical \(p_t\) Modulating factor \((1-p_t)^2\) Relative Weight
Easy background 0.98 0.0004 Greatly reduced
Hard background 0.70 0.09 Moderately retained
Nodule candidate 0.20 0.64 Heavily emphasized

RetinaNet (Lin et al., 2017) was the first to apply Focal Loss to object detection, achieving accuracy on the COCO dataset comparable to two-stage methods (Faster R-CNN) with a single-stage detector.


3.3 Metric Learning and Geometric Spaces: Pursuing Clustering in Embeddings

These losses do not directly predict class labels. Instead, they learn an embedding space such that:

  • Same-class samples are close in the space (cohesion);
  • Different-class samples are far apart in the space (separation).

Contrastive Loss

For sample pairs \((x_1, x_2)\), where \(y=1\) indicates same class and \(y=0\) indicates different classes:

\[ L_{Contrastive} = y \cdot d^2 + (1-y) \cdot \max(0,\ m - d)^2 \]

where \(d = \|f(x_1) - f(x_2)\|_2\) and \(m\) is the margin hyperparameter.

Triplet Loss

For triplets \((a, p, n)\) (anchor, positive, negative):

\[ L_{Triplet} = \max\left(0,\ d(a, p) - d(a, n) + \alpha\right) \]
  • Requires the positive sample to be closer to the anchor than the negative sample by at least \(\alpha\) (the safety margin);
  • Key challenge: Hard negative mining — most randomly sampled triplets are "easy" and contribute nothing to learning.

Contrastive Loss vs. Triplet Loss

Dimension Contrastive Loss Triplet Loss
Input unit Sample pair (2) Triplet (3)
Training signal density Lower Higher
Combinatorial complexity \(O(n^2)\) \(O(n^3)\)
Sensitivity to hard samples Lower Higher; requires hard mining

Practical Example: Face Recognition and Product Retrieval

Example A: FaceNet's Face Recognition (Google, 2015)

Google's FaceNet uses triplet loss to train face embeddings:

  • Anchor: A frontal photo of a person;
  • Positive: Another photo of the same person (different lighting, angle, expression);
  • Negative: A photo of a different person.

Training objective: make the distance between any two photos of the same person in the embedding space less than the maximum intra-class distance, while ensuring inter-person distances exceed \(\alpha\) (the safety margin).

On the LFW dataset, FaceNet achieved 99.63% recognition accuracy, subsequently becoming the industrial standard architecture for face recognition.

Example B: Visual Search on E-commerce Platforms

The "snap and search" feature on platforms like Taobao and Pinterest uses contrastive or triplet loss to train product image embeddings:

  • Different shooting angles of the same product → close in the embedding space;
  • Different product categories (shirt vs. shoes) → far apart in the embedding space. When a user uploads an image, the system performs approximate nearest neighbor (ANN) search in the embedding space, returning results within milliseconds.

4. Modern Frontier Loss Functions

4.1 Preference Alignment for Large Language Models (LLMs)

Background: The Challenges of RLHF

RLHF (Reinforcement Learning from Human Feedback) requires training a separate reward model, then optimizing the language model with the PPO algorithm. The pipeline is complex, hyperparameter-sensitive, and demands enormous GPU memory.

DPO (Direct Preference Optimization)

Proposed by Rafailov et al. (2023), DPO aligns the language model directly from preference data without an explicit reward model:

\[ L_{DPO}(\pi_\theta; \pi_{ref}) = -\mathbb{E}_{(x, y_w, y_l)}\left[\log \sigma\left(\beta \log \frac{\pi_\theta(y_w|x)}{\pi_{ref}(y_w|x)} - \beta \log \frac{\pi_\theta(y_l|x)}{\pi_{ref}(y_l|x)}\right)\right] \]
  • \(\pi_\theta\): the policy model being trained;
  • \(\pi_{ref}\): the reference model (typically the SFT model);
  • \(y_w\) (Winner): the human-preferred response; \(y_l\) (Loser): the human-rejected response;
  • \(\beta\): KL divergence penalty coefficient, controlling how far the model may deviate from the reference;
  • Core idea: increase the log-probability ratio of "good responses" relative to the reference model while decreasing that of "bad responses."

RLHF vs. DPO Training Pipeline Comparison

Dimension RLHF DPO
Training stages SFT → Reward model training → PPO SFT → Direct preference optimization
Reward model Requires separate training Implicitly contained in the loss
Training stability Lower (PPO is hyperparameter-sensitive) Higher (supervised learning form)
Memory footprint High (multiple models loaded simultaneously) Lower

Practical Example: Training a Dialogue Assistant for Safety with DPO

Example: Teaching a Model to Refuse Harmful Requests

Suppose you are performing safety alignment on a base LLM:

For request \(x\): "Tell me how to make a bomb", prepare a preference data pair:

  • \(y_w\) (chosen good response): "This is a dangerous request. I cannot provide such information."
  • \(y_l\) (rejected bad response): "The steps to make a bomb are..."

The DPO loss simultaneously does two things:

  1. Boosts the generation probability of \(y_w\) relative to the reference model (encouraging the model to refuse);
  2. Suppresses the generation probability of \(y_l\) relative to the reference model (penalizing harmful outputs).

The \(\beta\) parameter controls "how far the model deviates from the reference": smaller \(\beta\) yields more aggressive alignment but may harm general capabilities; larger \(\beta\) yields more conservative alignment with weaker safety boundaries. Open-source models such as Meta's Llama-2 and Mistral have adopted DPO or DPO-like approaches for alignment.


4.2 Self-Supervised Learning (SSL)

InfoNCE Loss (NT-Xent Loss)

The core loss of modern self-supervised learning (SimCLR, MoCo, etc.), which casts representation learning as a contrastive classification problem:

\[ L_{InfoNCE} = -\mathbb{E}\left[\log \frac{e^{f(x)^\top f(x^+)/\tau}}{\sum_{x^- \in \mathcal{N}} e^{f(x)^\top f(x^-)/\tau}}\right] \]
  • \(x^+\): the positive sample for \(x\) (a different augmented view of the same image);
  • \(\mathcal{N}\): the set of negative samples (all other samples in the batch);
  • \(\tau\): the temperature parameter, controlling the "sharpness" of the distribution;
  • Essence: perform \((2N-1)\)-way classification within a large batch to identify the unique positive pair.

Information-theoretic interpretation: Minimizing InfoNCE is equivalent to maximizing a lower bound on mutual information:

\[ I(f(x); f(x^+)) \geq \log(N) - L_{InfoNCE} \]

Practical Example: CLIP's Image-Text Alignment

Example: OpenAI CLIP Uses InfoNCE to Align Images and Text

CLIP (Contrastive Language-Image Pre-Training) was trained on 400M image-text pairs using a symmetric variant of InfoNCE:

Given a batch (e.g., 512 images + their corresponding 512 text descriptions):

  • Positive pairs: Image \(i\) with its corresponding text description \(i\) (512 pairs total);
  • Negative pairs: Image \(i\) with all other 511 non-matching texts (\(512 \times 511\) pairs total).

Training objective: maximize cosine similarity for matching image-text pairs and minimize it for non-matching pairs in the embedding space.

Capabilities after training:

  • Zero-shot image classification: classify images directly using text descriptions without any fine-tuning;
  • Image-to-text and text-to-image retrieval;
  • Serving as the visual encoder for multimodal large models (e.g., GPT-4V, LLaVA).

4.3 Loss Functions for Generative Models

VAE (Variational Autoencoder): ELBO

\[ \mathcal{L}_{ELBO} = \underbrace{\mathbb{E}_{z \sim q_\phi(z|x)}[\log p_\theta(x|z)]}_{\text{Reconstruction loss}} - \underbrace{D_{KL}(q_\phi(z|x) \| p(z))}_{\text{Regularization term}} \]
  • Reconstruction loss: ensures the decoder can reconstruct the original input from the latent variable;
  • KL term: pulls the approximate posterior \(q_\phi\) toward the prior \(p(z)\) (typically standard normal), making the latent space structured and smooth to support interpolation and generation.

GAN (Generative Adversarial Network): Minimax Objective

\[ \min_G \max_D\ \mathbb{E}_{x \sim p_{data}}[\log D(x)] + \mathbb{E}_{z \sim p_z}[\log(1 - D(G(z)))] \]
  • The discriminator \(D\) maximizes its ability to distinguish real from fake samples;
  • The generator \(G\) minimizes its probability of being detected by the discriminator;
  • At Nash equilibrium: the generated distribution \(\equiv\) the real distribution, and \(D(x) \equiv 0.5\).

The Fundamental Difference Between VAE and GAN

VAE pursues likelihood maximization (the generated distribution should cover the real data as much as possible), resulting in samples that tend to be "blurry but diverse." GAN pursues discriminator deception (generated samples should be indistinguishable to the discriminator), resulting in samples that tend to be "sharp but with limited diversity" (the mode collapse problem). Diffusion models, through their step-by-step denoising approach, have found a better balance between the two.


5. Engineering Implementation: Framework Differences and Common Pitfalls

5.1 Logits vs. Probabilities: The Most Common Fatal Error

This is the most common silent error when translating formulas into code — the model runs, but training is effectively broken.

Framework Expected Input Internal Processing Consequence of Incorrect Input
PyTorch nn.CrossEntropyLoss Logits (raw scores) Internally combines LogSoftmax + NLLLoss
TF/Keras CategoricalCrossentropy Probabilities (after Softmax) Directly computes \(-\sum y \log p\) If passing logits, must explicitly set from_logits=True

Why Does PyTorch Accept Logits by Default?

It uses the LogSumExp trick to avoid numerical overflow:

\[ \log\sum_j e^{z_j} = z_{max} + \log\sum_j e^{z_j - z_{max}} \]

If you first compute softmax (where large exponents may overflow to inf) and then take the log, numerical stability is far worse than operating directly on logits.

Correct vs. Incorrect Usage

# ✅ PyTorch correct usage: pass logits, do NOT manually apply softmax
criterion = nn.CrossEntropyLoss()
loss = criterion(logits, labels)   # logits: (N, C), labels: (N,)

# ✅ TensorFlow correct usage (when passing logits)
loss_fn = tf.keras.losses.CategoricalCrossentropy(from_logits=True)
loss = loss_fn(y_true, logits)

# ❌ Dangerous usage: passing probabilities without informing the framework
loss_fn = tf.keras.losses.CategoricalCrossentropy()  # from_logits=False (default)
loss = loss_fn(y_true, logits)  # Silently computes incorrect results; extremely weak gradients; model fails to converge

How to detect this bug?

The most telling symptom is: training loss decreases slowly but validation loss barely moves, and the model's predicted probability distribution remains close to uniform (all class probabilities approximately \(1/K\)). When you encounter this, check the loss function's input type first.


5.2 Reduction Strategies

The \(\sum\) or \(\frac{1}{n}\) in formulas is controlled in frameworks via the reduction parameter:

reduction parameter Meaning Return Shape Typical Use Case
'mean' (default) Mean of losses across all samples Scalar Standard training
'sum' Sum of losses across all samples Scalar Certain theoretical derivations
'none' Independent loss for each sample (N,) vector Sample weighting, contrastive learning, debugging

Important Application of reduction='none'

# Adaptive sample-difficulty-based weighted training (manual Focal Loss implementation)
criterion = nn.CrossEntropyLoss(reduction='none')
per_sample_loss = criterion(logits, labels)   # shape: (N,)

# Compute the Focal Loss modulating factor
probs = torch.softmax(logits, dim=1)
pt = probs.gather(1, labels.unsqueeze(1)).squeeze(1)
focal_weight = (1 - pt) ** gamma
weighted_loss = (focal_weight * per_sample_loss).mean()

5.3 Numerical Stability Best Practices

Scenario Dangerous Implementation (avoid) Safe Implementation (recommended)
Cross-Entropy -(y * torch.log(p)).sum() nn.CrossEntropyLoss()(logits, y)
KL Divergence (p * (p/q).log()).sum() nn.KLDivLoss()(q.log(), p)
BCE -(y*p.log() + (1-y)*(1-p).log()).mean() nn.BCEWithLogitsLoss()(logits, y)
Log-Sum-Exp (x.exp()).sum().log() torch.logsumexp(x, dim=-1)

Why is torch.log(p) dangerous when \(p \approx 0\)?

When the model's predicted probability for a class is very low (e.g., \(p = 10^{-8}\)), log(p) produces an extremely large negative number (approximately \(-18.4\)), which can cause gradient explosion. The framework's built-in CrossEntropyLoss circumvents this via the log-sum-exp trick, so never implement cross-entropy manually.


6. Selection Guide and Summary

6.1 Decision Tree: How to Choose a Loss Function

What is your task type?
│
├── Regression (continuous value prediction)
│   ├── Clean data, Gaussian noise              →  MSE (L2 Loss)
│   ├── Data contains many outliers              →  MAE (L1) or Huber Loss
│   ├── Need to predict a specific percentile (e.g., P90)  →  Pinball Loss (Quantile)
│   └── Need second-order continuous differentiability      →  Log-Cosh Loss
│
├── Classification (discrete label prediction)
│   ├── Binary / multi-label                     →  BCE (Binary Cross-Entropy)
│   ├── Multi-class (balanced dataset)           →  Categorical Cross-Entropy
│   ├── Multi-class (severe class imbalance)     →  Focal Loss
│   └── Need maximum-margin decision boundary (SVM) →  Hinge Loss
│
├── Metric Learning (embedding / retrieval)
│   ├── Pair-based                               →  Contrastive Loss
│   ├── Triplet-based                            →  Triplet Loss
│   └── Large-batch contrastive learning         →  InfoNCE (NT-Xent)
│
├── Generative Models
│   ├── Variational Autoencoder (VAE)            →  ELBO (Reconstruction loss + KL divergence)
│   └── Generative Adversarial Network (GAN)     →  Adversarial Loss (Minimax)
│
└── LLM Alignment
    ├── Sufficient compute, strong alignment needed  →  PPO + RLHF
    └── Limited resources, learn directly from preferences  →  DPO Loss

6.2 Core Philosophy Summary

Goal Recommended Loss Disciplinary Basis Typical Applications
Accuracy (statistical optimality) MSE or Cross-Entropy MLE + Gaussian/multinomial noise assumption House price prediction, image classification
Robustness (outlier resistance) MAE or Huber Laplace distribution prior Industrial sensors, financial return prediction
Distributional properties (generative) KL divergence, ELBO Information theory, variational inference VAE, diffusion models
Business decisions (asymmetric costs) Pinball, Focal, weighted CE Decision theory, cost-sensitive learning Fraud detection, power dispatch, medical diagnosis
Spatial geometry (representation learning) Triplet, InfoNCE Metric learning, mutual information maximization Face recognition, image-text retrieval, CLIP
Preference alignment (LLMs) DPO, PPO Reinforcement learning, human preference modeling ChatGPT, Llama-2 alignment

6.3 The Essence in One Sentence

A loss function transforms an otherwise imperceptible "learning process" into a clear, differentiable "valley."

It is not merely a directional guide for optimization — it is the primary interface through which researchers communicate their worldview (noise assumptions), value system (cost weights), and objectives (optimization direction) to the model.

Choosing a loss function is choosing your complete philosophy of what "error" means.


7. References and Further Reading

Foundational Theory

  • Statistical Learning Theory: Vapnik, V. (1998). Statistical Learning Theory. Wiley.
  • Information Theory: Cover, T. & Thomas, J. (2006). Elements of Information Theory. Wiley.
  • Prospect Theory: Kahneman, D. & Tversky, A. (1979). Prospect Theory: An Analysis of Decision under Risk. Econometrica.

Classic Papers

  • Focal Loss: Lin, T. et al. (2017). Focal Loss for Dense Object Detection. ICCV.
  • FaceNet: Schroff, F. et al. (2015). FaceNet: A Unified Embedding for Face Recognition and Clustering. CVPR.
  • SimCLR / InfoNCE: Chen, T. et al. (2020). A Simple Framework for Contrastive Learning of Visual Representations. ICML.
  • CLIP: Radford, A. et al. (2021). Learning Transferable Visual Models From Natural Language Supervision. ICML.
  • Energy-Based Models: LeCun, Y. et al. (2006). A Tutorial on Energy-Based Learning. MIT Press.

Frontier Alignment Research

  • DPO: Rafailov, R. et al. (2023). Direct Preference Optimization: Your Language Model is Secretly a Reward Model. NeurIPS.
  • RLHF: Christiano, P. et al. (2017). Deep Reinforcement Learning from Human Preferences. NeurIPS.
  • InstructGPT: Ouyang, L. et al. (2022). Training language models to follow instructions with human feedback. NeurIPS.

评论 #