Loss Functions

This chapter is suitable both for those just beginning to study deep learning and for those revisiting the concept of loss functions after having covered most of the fundamentals. The concept of loss functions pervades the entire deep learning process, and examining it from different angles yields different insights.

This report systematically organizes the essential understanding of loss functions from the perspectives of statistics, information theory, physics, economics, and other disciplines, while also covering key details of engineering implementation. It is intended for researchers and engineers who wish to build a deep conceptual framework.

How to Read This Report

This report is divided into six progressively layered modules:

Module	Question Addressed
Chapter 1: Concept Clarification	What exactly is the difference between "loss" and "cost"?
Chapter 2: Disciplinary Perspectives	How do scholars from different fields understand the nature of loss functions?
Chapter 3: Task Summary	What loss functions are available for specific tasks?
Chapter 4: Frontier Developments	What loss functions are used in LLMs, self-supervised learning, etc.?
Chapter 5: Engineering Practice	What are the common pitfalls in code implementation?
Chapter 6: Selection Guide	Given a task, which loss function should I use?

Suggested reading path: Beginners should read sequentially; experienced readers may jump directly to Chapter 3 or Chapter 6.

Cross-Dimensional Alignment of Five Core Concepts
Loss Functions from Interdisciplinary Perspectives
Comprehensive Summary of Loss Functions by Task
Modern Frontier Loss Functions
Engineering Implementation: Framework Differences and Common Pitfalls
Selection Guide and Summary
References and Further Reading

1. Cross-Dimensional Alignment of Five Core Concepts

In various literature, the five terms loss, cost, objective, error, and risk are frequently used interchangeably, causing tremendous confusion for beginners. This chapter thoroughly aligns these concepts along three dimensions: scope of application, constituent components, and disciplinary origin.

1.1 Full Concept Comparison Matrix

Concept	Common Notation	Scope	Composition	Disciplinary Origin
Loss Function	$L,\ \ell,\ \mathcal{L}$	Single sample	Prediction error only	Statistical decision theory
Cost Function	$J,\ C$	Entire dataset	Mean/sum of losses + regularization terms	Classical optimization theory
Objective Function	$f,\ \Phi,\ \mathcal{O}$	Optimization process	Anything to be optimized (incl. Reward)	Mathematical programming / OR
Error Function	$E,\ \epsilon$	Performance metric	Pure numerical deviation (e.g., $y - \hat{y}$)	Classical engineering / control
Risk Function	$R,\ \mathcal{R}$	Probabilistic expectation	Expected value of loss over a probability distribution	Statistical learning theory

1.2 Hierarchical Relationships: The Containment Structure of the Five Concepts

The five concepts form a clear hierarchy rather than a flat list:

Objective Function                 ← The most general concept; encompasses everything to be optimized
    └── Cost Function              ← The concrete form of the objective in supervised learning
            └── Loss Function      ← The per-sample decomposition of the cost function
                    └── Error Function ← The simplest form of loss (pure deviation)

Risk Function                      ← An independent dimension: the probabilistic expectation of the loss over an unknown distribution

From Loss to Cost

The cost function $J(\theta)$ aggregates the loss function $\ell$ over the training set, typically with an additional regularization term:

\[ J(\theta) = \frac{1}{n} \sum_{i=1}^{n} \ell(y_i, \hat{y}_i) + \lambda \cdot \Omega(\theta) \]

Mnemonic: Loss starts with "Lonely" — the loss function operates on a single, isolated sample.

From Cost to Objective

The objective function is the higher-level concept:

In supervised learning, the objective is typically to minimize the cost function;
In reinforcement learning, the objective may be to maximize cumulative reward;
In GANs, the objective is a minimax game.

From Loss to Risk

Dimension	Empirical Risk	Expected Risk (True Risk)
Definition	Mean of the loss over the training set	Expected value of the loss over the true data distribution
Formula	$R_{emp} = \frac{1}{n}\sum_{i=1}^n \ell(y_i, f(x_i))$	$R = \mathbb{E}_{(x,y) \sim P}[\ell(y, f(x))]$
Computability	✅ Directly computable	❌ True distribution $P$ is unknown

The essence of model training is: using Empirical Risk Minimization (ERM) to approximate true risk minimization. Generalization capability is determined by the gap between the two (the generalization error).

1.3 Practical Example: All Five Concepts in a Single Project

Example: Training a House Price Prediction Model

Suppose you are using a neural network to predict house prices in a city:

Error: For house #3, the model predicts $\hat{y}_3 = 150$ (in ten-thousands), while the true price is $y_3 = 180$. The error is $e_3 = y_3 - \hat{y}_3 = 30$.

Loss: Quantifying the error as a "penalty score" for the model. Using MSE: $\ell_3 = (30)^2 = 900$.

Cost: Averaging the loss over all 1,000 training houses, plus L2 regularization to prevent overfitting: $J(\theta) = \frac{1}{1000}\sum_{i=1}^{1000} \ell_i + 0.001 \|\theta\|^2$.

Objective: The overall direction of optimization — "minimize $J(\theta)$." In different frameworks, the objective may also include constraints (e.g., parameter range restrictions).

Risk: What you truly care about is the expected error on all future houses (including those not yet listed). Training can only minimize empirical risk; generalization is the ultimate goal.

2. Loss Functions from Interdisciplinary Perspectives

Loss functions are a product of converging insights from multiple disciplines. For the same MSE formula, a statistician, a physicist, and an economist will offer entirely different yet equally profound interpretations.

2.1 The Statistical Perspective: Prior Assumptions About Noise Distributions

Core Insight

Choosing a loss function is fundamentally an act of probabilistically modeling the noise in the data. Different loss functions correspond to different prior noise distributions, derived through Maximum Likelihood Estimation (MLE).

\[ \hat{\theta}_{MLE} = \arg\max_\theta \prod_{i=1}^n p(y_i | x_i; \theta) \equiv \arg\min_\theta \sum_{i=1}^n -\log p(y_i | x_i; \theta) \]

Correspondence Between Loss Functions and Noise Distributions

Loss Function	Corresponding Noise Distribution	Probability Density Form	Key Property
MSE (L2 Loss)	Gaussian $\mathcal{N}(\mu, \sigma^2)$	$p \propto e^{-\frac{(y-\hat{y})^2}{2\sigma^2}}$	Sensitive to outliers (squared penalty)
MAE (L1 Loss)	Laplace $\text{Laplace}(\mu, b)$	$p \propto e^{-\frac{	y-\hat{y}
Pinball Loss	Asymmetric Laplace distribution	Allows asymmetric decay	Used for quantile regression
Cross-Entropy	Bernoulli / Multinomial distribution	$p \propto \hat{p}^y (1-\hat{p})^{1-y}$	Standard loss for classification

Deeper implication: When you switch your loss function, you are effectively declaring to the model: "I believe the noise in the data follows a specific statistical pattern." This elevates loss function selection from empirical trial-and-error to a theoretically grounded modeling decision.

Practical Example: Anomalous Readings in Sensor Data

Example: Predicting Equipment Failures from Factory Temperature Sensors

A factory has deployed 500 temperature sensors that collect data hourly to predict equipment overheating. Approximately 5% of readings are anomalous (sudden spikes caused by sensor malfunctions, e.g., a reading of 9999°C).

With MSE: The squared error from anomalous readings is massively amplified, pulling model parameters toward "accommodating the outliers" and severely degrading prediction accuracy for normal data.

With MAE: Anomalous readings incur only a linear penalty; the model learns to "ignore" them and focuses on fitting the normal sensor reading distribution.

With Huber Loss ($\delta = 3$): Small errors ($|r| \leq 3$) are handled with MSE-like precision, while large errors (anomalous readings) are downgraded to linear penalties — achieving the best of both worlds.

Empirical finding: On industrial datasets containing 5% anomalous readings, Huber Loss typically reduces validation RMSE by approximately 15–30% compared to MSE.

2.2 The Information-Theoretic Perspective: "Communication Cost" Between Distributions

Core Insight

Information theory views a model as an encoder. The loss function measures the extra information cost (in bits) incurred when using the predicted distribution $Q$ to describe the true distribution $P$.

KL Divergence (Relative Entropy)

\[ D_{KL}(P \| Q) = \sum_{x} P(x) \log \frac{P(x)}{Q(x)} \]

Measures the information loss when approximating $P$ with $Q$;
Asymmetric: $D_{KL}(P\|Q) \neq D_{KL}(Q\|P)$;
Non-negative: $D_{KL}(P\|Q) \geq 0$, with equality if and only if $P = Q$.

The Equivalence Between Cross-Entropy and KL Divergence

\[ H(P, Q) = H(P) + D_{KL}(P \| Q) \]

In supervised learning, the true distribution $P$ is fixed (its entropy $H(P)$ is a constant), so:

\[ \arg\min_Q H(P, Q) \equiv \arg\min_Q D_{KL}(P \| Q) \]

Trinity conclusion: Minimizing cross-entropy $\equiv$ minimizing KL divergence $\equiv$ Maximum Likelihood Estimation (MLE). These three formulations describe the same thing in different languages, originating from information theory, statistical distance theory, and frequentist statistics, respectively.

Various Losses from an Information-Theoretic Perspective

Loss Function	Information-Theoretic Interpretation
Cross-Entropy	Average number of bits needed to encode true labels using the predicted distribution
KL Divergence	Number of "wasted" bits between predicted and true distributions
InfoNCE	Maximizes a lower bound on mutual information between positive pairs
ELBO (Variational Lower Bound)	Maximizes a lower bound on the log-likelihood of observed data; used in VAEs

Practical Example: Spam Classifier

Example: Understanding Email Classifier Training Through Information Theory

Training a binary classifier to determine whether an email is spam (1 = spam, 0 = legitimate).

Suppose an email has true label $y=1$ (spam) and the model outputs two different probabilities:

Model State Predicted spam probability $p$ Cross-entropy loss $-\log p$ Information-theoretic reading

Early training 0.10 2.30 bits 2.30 bits needed to encode "this is spam"

Mid training 0.70 0.36 bits Model has "guessed most of it"; only 0.36 bits of extra info needed

Well trained 0.98 0.02 bits Model is nearly certain; almost no extra information needed

The training process is essentially compressing the information gap between "the true answer" and "the model's guess", driving the predicted distribution toward the true distribution.

2.3 The Physics Perspective: Energy Equilibria and Potential Energy Surfaces

Core Insight

Physicists view the loss function as the system's energy function. The training process corresponds to searching for stable low-energy states in a high-dimensional "energy landscape."

Energy-Based Models (EBMs)

Low energy $E(x)$ corresponds to high probability $p(x)$;
Training objective: "push down" the energy of real data and "push up" the energy of fake data;
The Boltzmann distribution bridges energy and probability:

\[ p(x) = \frac{e^{-E(x)/T}}{Z} \]

where $T$ is the temperature parameter and $Z = \int e^{-E(x)/T} dx$ is the partition function (normalization constant).

The Physical Origin of Softmax

The softmax function is a direct application of the Boltzmann distribution in the discrete case:

\[ \text{softmax}(z_k) = \frac{e^{z_k/T}}{\sum_j e^{z_j/T}} \]

Temperature $T$	Distribution Shape	Corresponding Behavior
$T \to 0$	Tends toward argmax	Deterministic decision (pick highest class)
$T = 1$	Standard Softmax	Normal training
$T \to \infty$	Tends toward uniform	Complete uncertainty (random guessing)

Geometric Properties of the Loss Landscape

Property	Description	Impact on Optimization
Non-convexity	Numerous local minima and saddle points	SGD noise helps escape saddle points
Flat wide valleys	Solutions with good generalization tend to lie in wide, flat valleys	Motivates optimizers like SAM
Vanishing gradient regions	Gradient approaches zero in saturation regions of activations	Inspired inventions of ReLU, residual connections

Practical Example: The Role of Temperature in LLM Inference

Example: ChatGPT's "Creative Mode" vs. "Precise Mode"

When generating text, large language models sample from the vocabulary at each step via temperature-scaled softmax:

Low temperature ($T = 0.2$): Probability concentrates on high-scoring tokens; output is conservative and repetitive. Suitable for code generation, factual Q&A, and other scenarios requiring deterministic answers.

High temperature ($T = 1.5$): Probability distribution becomes more uniform; output is diverse and creative, but also more prone to "hallucination." Suitable for brainstorming, story writing, and similar scenarios.

Top-p / Top-k sampling: An engineering refinement of the temperature mechanism — first truncating low-probability tokens (to avoid sampling completely irrelevant words), then applying temperature scaling. This is the mainstream approach for LLM inference today.

This is a direct engineering embodiment of "temperature controlling the system's entropy" in the Boltzmann distribution.

2.4 The Economics and Decision Theory Perspective: Regret and Utility

Core Insight

Economics interprets the loss function as a tool for quantifying the cost of decisions, focusing on how to make optimal choices under uncertainty and emphasizing that the costs of different error types are highly asymmetric.

Regret Minimization

Regret quantifies the gap between "how well I could have done if I had known the truth" and "how well I actually did":

\[ \text{Regret}_T = \sum_{t=1}^T \ell(y_t, \hat{y}_t) - \min_{f^*} \sum_{t=1}^T \ell(y_t, f^*(x_t)) \]

In the online learning framework, the goal is to achieve sublinear cumulative regret ($o(T)$) — meaning the algorithm's average regret approaches zero over time.

Loss Aversion and Asymmetric Costs

Based on Prospect Theory (Kahneman & Tversky, 1979):

Humans perceive losses approximately 2.25 times more intensely than equivalent gains;
The costs of prediction errors are highly asymmetric across different scenarios.

This has inspired a family of cost-sensitive learning techniques:

Technique	Core Idea	Application Scenario
Cost Matrix	Explicitly define costs for different error types	Medical diagnosis, fraud detection
Focal Loss	Dynamically reduce weights on easy samples; focus on hard ones	Object detection (extreme class imbalance)
Pinball Loss ($\tau \neq 0.5$)	Asymmetrically penalize overestimation vs. underestimation	Supply chain risk management, financial forecasting
Weighted Cross-Entropy	Weight inversely proportional to class frequency	Medical image segmentation

Utility Theory's Influence

In reinforcement learning, the reward function is essentially the negative form of a utility function. Reward shaping — designing carefully crafted intermediate reward structures to guide agent learning — is highly isomorphic to incentive design in economics.

Practical Example: Credit Card Fraud Detection at Banks

Example: The Vastly Different Costs of Two Error Types

A bank uses a machine learning model to determine whether credit card transactions are fraudulent. The data is extremely imbalanced: legitimate transactions account for 99.8%, while fraud accounts for only 0.2%. The costs of the two error types are drastically different:

Error Type Meaning Actual Cost

False Negative (missed fraud) Fraud classified as legitimate; transaction amount lost Average loss of $5,000 per incident

False Positive (false alarm) Legitimate transaction blocked; degraded user experience ~$2 customer service cost + churn risk per incident

With standard cross-entropy: The model learns to "predict everything as legitimate," achieving 99.8% accuracy but providing zero practical value.

Solution (Cost matrix + weighted loss):
# Cost matrix thinking: set the penalty weight for missed fraud to 2500x (5000/2)
pos_weight = torch.tensor([2500.0])
criterion = nn.BCEWithLogitsLoss(pos_weight=pos_weight)
This makes the model treat "missing one fraud case" as equivalent to "falsely flagging 2,500 legitimate transactions" during training, thereby correcting the bias introduced by data imbalance at the loss function level.

3. Comprehensive Summary of Loss Functions by Task

3.1 Regression Tasks: Pursuing Accuracy in Continuous Values

Formula and Property Quick Reference

Function Name	Symbol	Mathematical Expression	Gradient Properties	Applicable Scenario
Mean Squared Error (MSE)	$L_2$	$\frac{1}{n}\sum(y_i - \hat{y}_i)^2$	Smooth, increases linearly with error	Optimal when noise is Gaussian
Mean Absolute Error (MAE)	$L_1$	$\frac{1}{n}\sum\\|y_i - \hat{y}_i\\|$	Constant ($\pm 1$), non-differentiable at zero	More robust with outliers
Huber Loss	$L_\delta$	$\begin{cases} \frac{1}{2}r^2, & \\|r\\|\leq\delta \\ \delta(\\|r\\|-\frac{\delta}{2}), & \text{otherwise} \end{cases}$	Smooth transition	Best compromise between MSE and MAE
Log-Cosh Loss	$\text{log-cosh}$	$\sum\log\cosh(\hat{y}_i - y_i)$	Second-order continuously differentiable	MAE alternative when second-order optimization is needed
Quantile Loss (Pinball)	$L_\tau$	$\max(\tau r,\ (\tau-1)r),\ r=y-\hat{y}$	Asymmetric	Quantile regression, risk modeling

Intuitive Understanding of Huber Loss

When $|r| \leq \delta$ (small errors): behaves like MSE with smooth gradients approaching zero, facilitating fine-grained tuning;
When $|r| > \delta$ (large errors / outliers): behaves like MAE with constant gradients, preventing outliers from "dragging" the model;
The hyperparameter $\delta$ defines the boundary between "normal errors" and "outlier errors."

Practical Example: Power Demand Forecasting with Pinball Loss

Example: Asymmetric Risk in Power Grid Dispatch

Grid operators need to predict the next day's electricity demand 24 hours in advance. The costs of the two types of forecast errors are very different:

Underestimating demand: Insufficient power supply, potentially causing blackouts — extremely costly;

Overestimating demand: Excess reserve power, wasted costs — relatively inexpensive.

Therefore, operators need not a mean forecast but a 90th percentile forecast (i.e., actual demand will not exceed the prediction in 90% of cases).

Using Pinball Loss with $\tau = 0.9$:

\[ L_{0.9}(r) = \begin{cases} 0.9 \cdot r, & r \geq 0 \text{ (underestimate, heavy penalty)} \\ -0.1 \cdot r, & r < 0 \text{ (overestimate, light penalty)} \end{cases} \]

This causes the model to learn "better to overestimate than to underestimate," naturally producing conservative upper-bound predictions that directly reflect the business's risk preferences.

3.2 Classification Tasks: Pursuing Overlap of Probability Distributions

Binary Cross-Entropy (BCE)

Applicable to binary classification or multi-label classification (each label judged independently):

\[ L_{BCE} = -\frac{1}{n}\sum_{i=1}^n \left[ y_i \log p_i + (1 - y_i)\log(1 - p_i) \right] \]

Categorical Cross-Entropy (CCE)

Paired with softmax, this is the de facto standard for multi-class classification:

\[ L_{CCE} = -\frac{1}{n}\sum_{i=1}^n \sum_{c=1}^K y_{ic} \log p_{ic} \]

Focal Loss

Proposed by Lin et al. (2017) to address extreme sample imbalance:

\[ L_{FL} = -\alpha_t (1 - p_t)^\gamma \log p_t \]

$p_t$: the model's predicted probability for the correct class;
$(1-p_t)^\gamma$: modulating factor — when the model has already "learned" a sample ($p_t \to 1$), this factor approaches 0, reducing that sample's contribution to the total loss;
$\gamma$: focusing parameter (typically set to 2), controlling the decay rate;
$\alpha_t$: class balancing factor.

Hinge Loss

The soul of SVMs, designed to find the maximum-margin decision boundary:

\[ L_{Hinge} = \frac{1}{n}\sum_{i=1}^n \max\left(0,\ 1 - y_i \cdot f(x_i)\right) \]

When a sample is correctly classified with sufficient confidence ($y_i \cdot f(x_i) \geq 1$), the loss is 0;
Otherwise, the loss grows linearly with the degree of "insufficient confidence";
It does not care "by how much the prediction is correct" — only whether it exceeds the safety margin.

Practical Example: Focal Loss in Medical Imaging

Example: Lung Nodule Detection in CT Scans (positive-to-negative ratio ~1:500)

In lung CT scans, a single image may contain thousands of region proposals, but fewer than 0.2% actually contain nodules.

Problem with standard cross-entropy: During training, gradients from "no-nodule" background proposals account for 99.8% of the total gradient. The model is entirely dominated by "how to recognize background" and fails to learn "what nodules look like."

Focal Loss solution (with $\gamma=2$):

Sample Type Typical $p_t$ Modulating factor $(1-p_t)^2$ Relative Weight

Easy background 0.98 0.0004 Greatly reduced

Hard background 0.70 0.09 Moderately retained

Nodule candidate 0.20 0.64 Heavily emphasized

RetinaNet (Lin et al., 2017) was the first to apply Focal Loss to object detection, achieving accuracy on the COCO dataset comparable to two-stage methods (Faster R-CNN) with a single-stage detector.

3.3 Metric Learning and Geometric Spaces: Pursuing Clustering in Embeddings

These losses do not directly predict class labels. Instead, they learn an embedding space such that:

Same-class samples are close in the space (cohesion);
Different-class samples are far apart in the space (separation).

Contrastive Loss

For sample pairs $(x_1, x_2)$, where $y=1$ indicates same class and $y=0$ indicates different classes:

\[ L_{Contrastive} = y \cdot d^2 + (1-y) \cdot \max(0,\ m - d)^2 \]

where $d = \|f(x_1) - f(x_2)\|_2$ and $m$ is the margin hyperparameter.

Triplet Loss

For triplets $(a, p, n)$ (anchor, positive, negative):

\[ L_{Triplet} = \max\left(0,\ d(a, p) - d(a, n) + \alpha\right) \]

Requires the positive sample to be closer to the anchor than the negative sample by at least $\alpha$ (the safety margin);
Key challenge: Hard negative mining — most randomly sampled triplets are "easy" and contribute nothing to learning.

Contrastive Loss vs. Triplet Loss

Dimension	Contrastive Loss	Triplet Loss
Input unit	Sample pair (2)	Triplet (3)
Training signal density	Lower	Higher
Combinatorial complexity	$O(n^2)$	$O(n^3)$
Sensitivity to hard samples	Lower	Higher; requires hard mining

Practical Example: Face Recognition and Product Retrieval

Example A: FaceNet's Face Recognition (Google, 2015)

Google's FaceNet uses triplet loss to train face embeddings:

Anchor: A frontal photo of a person;

Positive: Another photo of the same person (different lighting, angle, expression);

Negative: A photo of a different person.

Training objective: make the distance between any two photos of the same person in the embedding space less than the maximum intra-class distance, while ensuring inter-person distances exceed $\alpha$ (the safety margin).

On the LFW dataset, FaceNet achieved 99.63% recognition accuracy, subsequently becoming the industrial standard architecture for face recognition.

Example B: Visual Search on E-commerce Platforms

The "snap and search" feature on platforms like Taobao and Pinterest uses contrastive or triplet loss to train product image embeddings:

Different shooting angles of the same product → close in the embedding space;

Different product categories (shirt vs. shoes) → far apart in the embedding space. When a user uploads an image, the system performs approximate nearest neighbor (ANN) search in the embedding space, returning results within milliseconds.

4. Modern Frontier Loss Functions

4.1 Preference Alignment for Large Language Models (LLMs)

Background: The Challenges of RLHF

RLHF (Reinforcement Learning from Human Feedback) requires training a separate reward model, then optimizing the language model with the PPO algorithm. The pipeline is complex, hyperparameter-sensitive, and demands enormous GPU memory.

DPO (Direct Preference Optimization)

Proposed by Rafailov et al. (2023), DPO aligns the language model directly from preference data without an explicit reward model:

\[ L_{DPO}(\pi_\theta; \pi_{ref}) = -\mathbb{E}_{(x, y_w, y_l)}\left[\log \sigma\left(\beta \log \frac{\pi_\theta(y_w|x)}{\pi_{ref}(y_w|x)} - \beta \log \frac{\pi_\theta(y_l|x)}{\pi_{ref}(y_l|x)}\right)\right] \]

$\pi_\theta$: the policy model being trained;
$\pi_{ref}$: the reference model (typically the SFT model);
$y_w$ (Winner): the human-preferred response; $y_l$ (Loser): the human-rejected response;
$\beta$: KL divergence penalty coefficient, controlling how far the model may deviate from the reference;
Core idea: increase the log-probability ratio of "good responses" relative to the reference model while decreasing that of "bad responses."

RLHF vs. DPO Training Pipeline Comparison

Dimension	RLHF	DPO
Training stages	SFT → Reward model training → PPO	SFT → Direct preference optimization
Reward model	Requires separate training	Implicitly contained in the loss
Training stability	Lower (PPO is hyperparameter-sensitive)	Higher (supervised learning form)
Memory footprint	High (multiple models loaded simultaneously)	Lower

Practical Example: Training a Dialogue Assistant for Safety with DPO

Example: Teaching a Model to Refuse Harmful Requests

Suppose you are performing safety alignment on a base LLM:

For request $x$: "Tell me how to make a bomb", prepare a preference data pair:

$y_w$ (chosen good response): "This is a dangerous request. I cannot provide such information."

$y_l$ (rejected bad response): "The steps to make a bomb are..."

The DPO loss simultaneously does two things:

Boosts the generation probability of $y_w$ relative to the reference model (encouraging the model to refuse);

Suppresses the generation probability of $y_l$ relative to the reference model (penalizing harmful outputs).

The $\beta$ parameter controls "how far the model deviates from the reference": smaller $\beta$ yields more aggressive alignment but may harm general capabilities; larger $\beta$ yields more conservative alignment with weaker safety boundaries. Open-source models such as Meta's Llama-2 and Mistral have adopted DPO or DPO-like approaches for alignment.

4.2 Self-Supervised Learning (SSL)

InfoNCE Loss (NT-Xent Loss)

The core loss of modern self-supervised learning (SimCLR, MoCo, etc.), which casts representation learning as a contrastive classification problem:

\[ L_{InfoNCE} = -\mathbb{E}\left[\log \frac{e^{f(x)^\top f(x^+)/\tau}}{\sum_{x^- \in \mathcal{N}} e^{f(x)^\top f(x^-)/\tau}}\right] \]

$x^+$: the positive sample for $x$ (a different augmented view of the same image);
$\mathcal{N}$: the set of negative samples (all other samples in the batch);
$\tau$: the temperature parameter, controlling the "sharpness" of the distribution;
Essence: perform $(2N-1)$-way classification within a large batch to identify the unique positive pair.

Information-theoretic interpretation: Minimizing InfoNCE is equivalent to maximizing a lower bound on mutual information:

\[ I(f(x); f(x^+)) \geq \log(N) - L_{InfoNCE} \]

Practical Example: CLIP's Image-Text Alignment

Example: OpenAI CLIP Uses InfoNCE to Align Images and Text

CLIP (Contrastive Language-Image Pre-Training) was trained on 400M image-text pairs using a symmetric variant of InfoNCE:

Given a batch (e.g., 512 images + their corresponding 512 text descriptions):

Positive pairs: Image $i$ with its corresponding text description $i$ (512 pairs total);

Negative pairs: Image $i$ with all other 511 non-matching texts ($512 \times 511$ pairs total).

Training objective: maximize cosine similarity for matching image-text pairs and minimize it for non-matching pairs in the embedding space.

Capabilities after training:

Zero-shot image classification: classify images directly using text descriptions without any fine-tuning;

Image-to-text and text-to-image retrieval;

Serving as the visual encoder for multimodal large models (e.g., GPT-4V, LLaVA).

4.3 Loss Functions for Generative Models

VAE (Variational Autoencoder): ELBO

\[ \mathcal{L}_{ELBO} = \underbrace{\mathbb{E}_{z \sim q_\phi(z|x)}[\log p_\theta(x|z)]}_{\text{Reconstruction loss}} - \underbrace{D_{KL}(q_\phi(z|x) \| p(z))}_{\text{Regularization term}} \]

Reconstruction loss: ensures the decoder can reconstruct the original input from the latent variable;
KL term: pulls the approximate posterior $q_\phi$ toward the prior $p(z)$ (typically standard normal), making the latent space structured and smooth to support interpolation and generation.

GAN (Generative Adversarial Network): Minimax Objective

\[ \min_G \max_D\ \mathbb{E}_{x \sim p_{data}}[\log D(x)] + \mathbb{E}_{z \sim p_z}[\log(1 - D(G(z)))] \]

The discriminator $D$ maximizes its ability to distinguish real from fake samples;
The generator $G$ minimizes its probability of being detected by the discriminator;
At Nash equilibrium: the generated distribution $\equiv$ the real distribution, and $D(x) \equiv 0.5$.

The Fundamental Difference Between VAE and GAN

VAE pursues likelihood maximization (the generated distribution should cover the real data as much as possible), resulting in samples that tend to be "blurry but diverse." GAN pursues discriminator deception (generated samples should be indistinguishable to the discriminator), resulting in samples that tend to be "sharp but with limited diversity" (the mode collapse problem). Diffusion models, through their step-by-step denoising approach, have found a better balance between the two.

5. Engineering Implementation: Framework Differences and Common Pitfalls

5.1 Logits vs. Probabilities: The Most Common Fatal Error

This is the most common silent error when translating formulas into code — the model runs, but training is effectively broken.

Framework	Expected Input	Internal Processing	Consequence of Incorrect Input
PyTorch `nn.CrossEntropyLoss`	Logits (raw scores)	Internally combines LogSoftmax + NLLLoss	—
TF/Keras `CategoricalCrossentropy`	Probabilities (after Softmax)	Directly computes $-\sum y \log p$	If passing logits, must explicitly set `from_logits=True`

Why Does PyTorch Accept Logits by Default?

It uses the LogSumExp trick to avoid numerical overflow:

\[ \log\sum_j e^{z_j} = z_{max} + \log\sum_j e^{z_j - z_{max}} \]

If you first compute softmax (where large exponents may overflow to inf) and then take the log, numerical stability is far worse than operating directly on logits.

Correct vs. Incorrect Usage

# ✅ PyTorch correct usage: pass logits, do NOT manually apply softmax
criterion = nn.CrossEntropyLoss()
loss = criterion(logits, labels)   # logits: (N, C), labels: (N,)

# ✅ TensorFlow correct usage (when passing logits)
loss_fn = tf.keras.losses.CategoricalCrossentropy(from_logits=True)
loss = loss_fn(y_true, logits)

# ❌ Dangerous usage: passing probabilities without informing the framework
loss_fn = tf.keras.losses.CategoricalCrossentropy()  # from_logits=False (default)
loss = loss_fn(y_true, logits)  # Silently computes incorrect results; extremely weak gradients; model fails to converge

How to detect this bug?

The most telling symptom is: training loss decreases slowly but validation loss barely moves, and the model's predicted probability distribution remains close to uniform (all class probabilities approximately $1/K$). When you encounter this, check the loss function's input type first.

5.2 Reduction Strategies

The $\sum$ or $\frac{1}{n}$ in formulas is controlled in frameworks via the reduction parameter:

`reduction` parameter	Meaning	Return Shape	Typical Use Case
`'mean'` (default)	Mean of losses across all samples	Scalar	Standard training
`'sum'`	Sum of losses across all samples	Scalar	Certain theoretical derivations
`'none'`	Independent loss for each sample	`(N,)` vector	Sample weighting, contrastive learning, debugging

Important Application of `reduction='none'`

# Adaptive sample-difficulty-based weighted training (manual Focal Loss implementation)
criterion = nn.CrossEntropyLoss(reduction='none')
per_sample_loss = criterion(logits, labels)   # shape: (N,)

# Compute the Focal Loss modulating factor
probs = torch.softmax(logits, dim=1)
pt = probs.gather(1, labels.unsqueeze(1)).squeeze(1)
focal_weight = (1 - pt) ** gamma
weighted_loss = (focal_weight * per_sample_loss).mean()

5.3 Numerical Stability Best Practices

Scenario	Dangerous Implementation (avoid)	Safe Implementation (recommended)
Cross-Entropy	`-(y * torch.log(p)).sum()`	`nn.CrossEntropyLoss()(logits, y)`
KL Divergence	`(p * (p/q).log()).sum()`	`nn.KLDivLoss()(q.log(), p)`
BCE	`-(yp.log() + (1-y)(1-p).log()).mean()`	`nn.BCEWithLogitsLoss()(logits, y)`
Log-Sum-Exp	`(x.exp()).sum().log()`	`torch.logsumexp(x, dim=-1)`

Why is torch.log(p) dangerous when $p \approx 0$?

When the model's predicted probability for a class is very low (e.g., $p = 10^{-8}$), log(p) produces an extremely large negative number (approximately $-18.4$), which can cause gradient explosion. The framework's built-in CrossEntropyLoss circumvents this via the log-sum-exp trick, so never implement cross-entropy manually.

6. Selection Guide and Summary

6.1 Decision Tree: How to Choose a Loss Function

What is your task type?
│
├── Regression (continuous value prediction)
│   ├── Clean data, Gaussian noise              →  MSE (L2 Loss)
│   ├── Data contains many outliers              →  MAE (L1) or Huber Loss
│   ├── Need to predict a specific percentile (e.g., P90)  →  Pinball Loss (Quantile)
│   └── Need second-order continuous differentiability      →  Log-Cosh Loss
│
├── Classification (discrete label prediction)
│   ├── Binary / multi-label                     →  BCE (Binary Cross-Entropy)
│   ├── Multi-class (balanced dataset)           →  Categorical Cross-Entropy
│   ├── Multi-class (severe class imbalance)     →  Focal Loss
│   └── Need maximum-margin decision boundary (SVM) →  Hinge Loss
│
├── Metric Learning (embedding / retrieval)
│   ├── Pair-based                               →  Contrastive Loss
│   ├── Triplet-based                            →  Triplet Loss
│   └── Large-batch contrastive learning         →  InfoNCE (NT-Xent)
│
├── Generative Models
│   ├── Variational Autoencoder (VAE)            →  ELBO (Reconstruction loss + KL divergence)
│   └── Generative Adversarial Network (GAN)     →  Adversarial Loss (Minimax)
│
└── LLM Alignment
    ├── Sufficient compute, strong alignment needed  →  PPO + RLHF
    └── Limited resources, learn directly from preferences  →  DPO Loss

6.2 Core Philosophy Summary

Goal	Recommended Loss	Disciplinary Basis	Typical Applications
Accuracy (statistical optimality)	MSE or Cross-Entropy	MLE + Gaussian/multinomial noise assumption	House price prediction, image classification
Robustness (outlier resistance)	MAE or Huber	Laplace distribution prior	Industrial sensors, financial return prediction
Distributional properties (generative)	KL divergence, ELBO	Information theory, variational inference	VAE, diffusion models
Business decisions (asymmetric costs)	Pinball, Focal, weighted CE	Decision theory, cost-sensitive learning	Fraud detection, power dispatch, medical diagnosis
Spatial geometry (representation learning)	Triplet, InfoNCE	Metric learning, mutual information maximization	Face recognition, image-text retrieval, CLIP
Preference alignment (LLMs)	DPO, PPO	Reinforcement learning, human preference modeling	ChatGPT, Llama-2 alignment

6.3 The Essence in One Sentence

A loss function transforms an otherwise imperceptible "learning process" into a clear, differentiable "valley."

It is not merely a directional guide for optimization — it is the primary interface through which researchers communicate their worldview (noise assumptions), value system (cost weights), and objectives (optimization direction) to the model.

Choosing a loss function is choosing your complete philosophy of what "error" means.

7. References and Further Reading

Foundational Theory

Statistical Learning Theory: Vapnik, V. (1998). Statistical Learning Theory. Wiley.
Information Theory: Cover, T. & Thomas, J. (2006). Elements of Information Theory. Wiley.
Prospect Theory: Kahneman, D. & Tversky, A. (1979). Prospect Theory: An Analysis of Decision under Risk. Econometrica.

Classic Papers

Focal Loss: Lin, T. et al. (2017). Focal Loss for Dense Object Detection. ICCV.
FaceNet: Schroff, F. et al. (2015). FaceNet: A Unified Embedding for Face Recognition and Clustering. CVPR.
SimCLR / InfoNCE: Chen, T. et al. (2020). A Simple Framework for Contrastive Learning of Visual Representations. ICML.
CLIP: Radford, A. et al. (2021). Learning Transferable Visual Models From Natural Language Supervision. ICML.
Energy-Based Models: LeCun, Y. et al. (2006). A Tutorial on Energy-Based Learning. MIT Press.

Frontier Alignment Research

DPO: Rafailov, R. et al. (2023). Direct Preference Optimization: Your Language Model is Secretly a Reward Model. NeurIPS.
RLHF: Christiano, P. et al. (2017). Deep Reinforcement Learning from Human Preferences. NeurIPS.
InstructGPT: Ouyang, L. et al. (2022). Training language models to follow instructions with human feedback. NeurIPS.

Concept	Common Notation	Scope	Composition	Disciplinary Origin
Loss Function	\(L,\ \ell,\ \mathcal{L}\)	Single sample	Prediction error only	Statistical decision theory
Cost Function	\(J,\ C\)	Entire dataset	Mean/sum of losses + regularization terms	Classical optimization theory
Objective Function	\(f,\ \Phi,\ \mathcal{O}\)	Optimization process	Anything to be optimized (incl. Reward)	Mathematical programming / OR
Error Function	\(E,\ \epsilon\)	Performance metric	Pure numerical deviation (e.g., \(y - \hat{y}\))	Classical engineering / control
Risk Function	\(R,\ \mathcal{R}\)	Probabilistic expectation	Expected value of loss over a probability distribution	Statistical learning theory

Loss Function	Corresponding Noise Distribution	Probability Density Form	Key Property
MSE (L2 Loss)	Gaussian \(\mathcal{N}(\mu, \sigma^2)\)	\(p \propto e^{-\frac{(y-\hat{y})^2}{2\sigma^2}}\)	Sensitive to outliers (squared penalty)
MAE (L1 Loss)	Laplace \(\text{Laplace}(\mu, b)\)	$p \propto e^{-\frac{	y-\hat{y}
Pinball Loss	Asymmetric Laplace distribution	Allows asymmetric decay	Used for quantile regression
Cross-Entropy	Bernoulli / Multinomial distribution	\(p \propto \hat{p}^y (1-\hat{p})^{1-y}\)	Standard loss for classification

Model State	Predicted spam probability \(p\)	Cross-entropy loss \(-\log p\)	Information-theoretic reading
Early training	0.10	2.30 bits	2.30 bits needed to encode "this is spam"
Mid training	0.70	0.36 bits	Model has "guessed most of it"; only 0.36 bits of extra info needed
Well trained	0.98	0.02 bits	Model is nearly certain; almost no extra information needed

Temperature \(T\)	Distribution Shape	Corresponding Behavior
\(T \to 0\)	Tends toward argmax	Deterministic decision (pick highest class)
\(T = 1\)	Standard Softmax	Normal training
\(T \to \infty\)	Tends toward uniform	Complete uncertainty (random guessing)

Error Type	Meaning	Actual Cost
False Negative (missed fraud)	Fraud classified as legitimate; transaction amount lost	Average loss of $5,000 per incident
False Positive (false alarm)	Legitimate transaction blocked; degraded user experience	~$2 customer service cost + churn risk per incident

Function Name	Symbol	Mathematical Expression	Gradient Properties	Applicable Scenario
Mean Squared Error (MSE)	\(L_2\)	\(\frac{1}{n}\sum(y_i - \hat{y}_i)^2\)	Smooth, increases linearly with error	Optimal when noise is Gaussian
Mean Absolute Error (MAE)	\(L_1\)	\(\frac{1}{n}\sum\\|y_i - \hat{y}_i\\|\)	Constant (\(\pm 1\)), non-differentiable at zero	More robust with outliers
Huber Loss	\(L_\delta\)	\(\begin{cases} \frac{1}{2}r^2, & \\|r\\|\leq\delta \\ \delta(\\|r\\|-\frac{\delta}{2}), & \text{otherwise} \end{cases}\)	Smooth transition	Best compromise between MSE and MAE
Log-Cosh Loss	\(\text{log-cosh}\)	\(\sum\log\cosh(\hat{y}_i - y_i)\)	Second-order continuously differentiable	MAE alternative when second-order optimization is needed
Quantile Loss (Pinball)	\(L_\tau\)	\(\max(\tau r,\ (\tau-1)r),\ r=y-\hat{y}\)	Asymmetric	Quantile regression, risk modeling

Sample Type	Typical \(p_t\)	Modulating factor \((1-p_t)^2\)	Relative Weight
Easy background	0.98	0.0004	Greatly reduced
Hard background	0.70	0.09	Moderately retained
Nodule candidate	0.20	0.64	Heavily emphasized

Loss Functions

How to Read This Report

Table of Contents

1. Cross-Dimensional Alignment of Five Core Concepts

1.1 Full Concept Comparison Matrix

1.2 Hierarchical Relationships: The Containment Structure of the Five Concepts

1.3 Practical Example: All Five Concepts in a Single Project

2. Loss Functions from Interdisciplinary Perspectives

2.1 The Statistical Perspective: Prior Assumptions About Noise Distributions

Core Insight

Correspondence Between Loss Functions and Noise Distributions

Practical Example: Anomalous Readings in Sensor Data

2.2 The Information-Theoretic Perspective: "Communication Cost" Between Distributions

Core Insight

KL Divergence (Relative Entropy)

The Equivalence Between Cross-Entropy and KL Divergence

Various Losses from an Information-Theoretic Perspective

Practical Example: Spam Classifier

2.3 The Physics Perspective: Energy Equilibria and Potential Energy Surfaces

Core Insight

Energy-Based Models (EBMs)

The Physical Origin of Softmax

Geometric Properties of the Loss Landscape

Practical Example: The Role of Temperature in LLM Inference

2.4 The Economics and Decision Theory Perspective: Regret and Utility

Core Insight

Regret Minimization

Loss Aversion and Asymmetric Costs

Utility Theory's Influence

Practical Example: Credit Card Fraud Detection at Banks

3. Comprehensive Summary of Loss Functions by Task

3.1 Regression Tasks: Pursuing Accuracy in Continuous Values

Formula and Property Quick Reference

Intuitive Understanding of Huber Loss

Practical Example: Power Demand Forecasting with Pinball Loss

3.2 Classification Tasks: Pursuing Overlap of Probability Distributions

Binary Cross-Entropy (BCE)

Categorical Cross-Entropy (CCE)

Focal Loss

Hinge Loss

Practical Example: Focal Loss in Medical Imaging

3.3 Metric Learning and Geometric Spaces: Pursuing Clustering in Embeddings

Contrastive Loss

Triplet Loss

Contrastive Loss vs. Triplet Loss

Practical Example: Face Recognition and Product Retrieval

4. Modern Frontier Loss Functions

4.1 Preference Alignment for Large Language Models (LLMs)

Background: The Challenges of RLHF

DPO (Direct Preference Optimization)

RLHF vs. DPO Training Pipeline Comparison

Practical Example: Training a Dialogue Assistant for Safety with DPO

4.2 Self-Supervised Learning (SSL)

InfoNCE Loss (NT-Xent Loss)

Practical Example: CLIP's Image-Text Alignment

4.3 Loss Functions for Generative Models

VAE (Variational Autoencoder): ELBO

GAN (Generative Adversarial Network): Minimax Objective

5. Engineering Implementation: Framework Differences and Common Pitfalls

5.1 Logits vs. Probabilities: The Most Common Fatal Error

Why Does PyTorch Accept Logits by Default?

Correct vs. Incorrect Usage

5.2 Reduction Strategies

Important Application of reduction='none'

5.3 Numerical Stability Best Practices

6. Selection Guide and Summary

6.1 Decision Tree: How to Choose a Loss Function

6.2 Core Philosophy Summary

6.3 The Essence in One Sentence

7. References and Further Reading

Foundational Theory

Classic Papers

Frontier Alignment Research

评论 #

Important Application of `reduction='none'`