Information Theory
Entropy
Entropy measures the uncertainty inherent in a probability distribution. When the probability of outcomes is spread out widely, entropy is high; when an outcome is nearly certain, entropy is low.
For a discrete random variable \(X\) with probability distribution \(P(X)\), the entropy \(H(P)\) is defined as:
If the logarithm is base \(2\), the unit is bits; if the base is \(e\), the unit is nats. \(1 \text{ nat} = \frac{1}{\ln 2} \approx 1.44 \text{ bits}\).
Self-Information
Self-information (also called information content) measures how "surprising" it is to observe a particular event \(x\).
If \(P(x) = 1\) (a certain event), then \(I(x) = 0\) — you already knew it would happen, so no new information is gained. If \(P(x) \to 0\) (an extremely rare event), then \(I(x) \to \infty\) — if it does occur, the surprise is enormous and the amount of information gained is immense.
Entropy is, in fact, the expected value of self-information.
Cross-Entropy
In machine learning, we typically do not know the true probability distribution \(P\); instead, a model produces a predicted distribution \(Q\). Cross-entropy measures the average level of "surprise" when we use the incorrect distribution \(Q\) to decode data drawn from the true distribution \(P\).
Cross-entropy reaches its minimum if and only if \(Q = P\), in which case \(H(P, Q) = H(P)\).
In classification tasks:
- \(P\) is the ground-truth label (typically a one-hot vector, e.g., \([0, 1, 0]\)).
- \(Q\) is the predicted probability output by the model's Softmax layer (e.g., \([0.1, 0.8, 0.1]\)).
- Minimizing cross-entropy essentially drives the predicted distribution \(Q\) as close as possible to the true distribution \(P\).
KL Divergence (Relative Entropy)
KL divergence measures the distance between two distributions:
Since the entropy of the true distribution \(P\), namely \(H(P)\), is a constant:
Minimizing the cross-entropy \(H(P, Q)\) \(\iff\) Minimizing the KL divergence \(D_{KL}(P \| Q)\)****
Conditional Entropy
Conditional entropy \(H(X|Y)\) measures the remaining uncertainty in \(X\) given knowledge of random variable \(Y\):
Properties:
- \(H(X|Y) \geq 0\), with equality if and only if \(X\) is a deterministic function of \(Y\)
- \(H(X|Y) \leq H(X)\) — additional knowledge never increases uncertainty ("information never hurts")
- When \(X\) and \(Y\) are independent, \(H(X|Y) = H(X)\)
Joint Entropy
Joint entropy \(H(X, Y)\) measures the total uncertainty of the joint distribution of two random variables:
Chain rule:
More generally, for \(n\) random variables:
Mutual Information
Mutual information \(I(X; Y)\) measures the amount of information shared between two random variables — how much knowing one reduces the uncertainty about the other.
Equivalent definition:
Properties:
- \(I(X; Y) \geq 0\), with equality if and only if \(X\) and \(Y\) are independent
- \(I(X; Y) = I(Y; X)\) (symmetry)
- \(I(X; X) = H(X)\) (self-information equals entropy)
Relationship to Other Entropy Measures
Mutual information can be expressed as:
This formula reveals an intuitive relationship: joint entropy = sum of individual entropies - overlap (mutual information).
Relationships Among Entropy Measures (Venn Diagram)
The following set-theoretic diagram illustrates the relationships among various information quantities:
+---------------------------+
| H(X, Y) |
| |
| +-------+ +-------+ |
| | | | | |
| | H(X|Y)|I(X;Y)|H(Y|X)| |
| | | | | |
| +-------+ +-------+ |
| |<-- H(X) -->| |
| |<-- H(Y) -->||
+---------------------------+
Summary of key identities:
| Relationship | Formula |
|---|---|
| Joint entropy decomposition | \(H(X,Y) = H(X) + H(Y\|X) = H(Y) + H(X\|Y)\) |
| Mutual information definition | \(I(X;Y) = H(X) - H(X\|Y) = H(Y) - H(Y\|X)\) |
| Venn diagram identity | \(H(X,Y) = H(X) + H(Y) - I(X;Y)\) |
| Upper bound | \(I(X;Y) \leq \min\{H(X), H(Y)\}\) |
Information Gain
Information gain is the direct application of mutual information in decision tree learning. When selecting a feature for splitting, information gain measures how much a feature \(A\) reduces the classification uncertainty of dataset \(D\):
where \(D_v\) is the subset of data for which feature \(A\) takes value \(v\).
Applications in decision trees:
- ID3 algorithm: Directly uses information gain to select the splitting feature. Its drawback is a bias toward features with many distinct values.
- C4.5 algorithm: Uses the information gain ratio to correct this bias:
where \(H_A(D)\) is called the intrinsic value of feature \(A\).
Information Bottleneck
Core Idea
The information bottleneck theory (Tishby et al., 1999) asks: how can we find a compressed representation \(T\) of input \(X\) that retains as much information about the target \(Y\) as possible while compressing \(X\) as much as possible?
This is fundamentally a trade-off between compression and prediction.
IB-Lagrangian
The information bottleneck optimization objective is written in Lagrangian form:
where:
- \(I(X; T)\): How much information \(T\) retains from \(X\) (to be minimized — compression)
- \(I(T; Y)\): How much information \(T\) contains about target \(Y\) (to be maximized — prediction)
- \(\beta > 0\): Lagrange multiplier controlling the compression-prediction balance
When \(\beta \to 0\), extreme compression occurs (\(T\) contains almost no information); when \(\beta \to \infty\), almost no compression occurs (\(T \approx X\)).
Connection to Deep Learning
Shwartz-Ziv & Tishby (2017) proposed that the training process of deep neural networks can be understood through the information bottleneck framework:
- Fitting phase: \(I(T; Y)\) increases rapidly as the network learns useful features
- Compression phase: \(I(X; T)\) gradually decreases as the network discards noise unrelated to \(Y\)
Although this view remains debated (particularly whether the compression phase universally occurs), it provides an information-theoretic perspective on understanding generalization in deep learning.
Maximum Entropy Principle
The maximum entropy principle (Jaynes, 1957) states that among all probability distributions satisfying known constraints, one should choose the distribution with the highest entropy.
Intuition
The maximum-entropy distribution is the most "conservative" choice — it introduces no additional assumptions beyond the known constraints and is the most honest expression of ignorance.
Mathematical Formulation
Given constraints \(E[f_k(X)] = \alpha_k\) (\(k = 1, \ldots, K\)), the maximum entropy distribution is obtained by solving:
Using Lagrange multipliers, the solution takes the form of an exponential family distribution:
where \(Z\) is the normalization constant (partition function) and \(\lambda_k\) are the Lagrange multipliers.
Common Examples
| Constraints | Maximum Entropy Distribution |
|---|---|
| No constraints (finite discrete) | Uniform distribution |
| Given mean and variance | Gaussian distribution |
| Given mean (non-negative) | Exponential distribution |
Applications in Machine Learning
- The maximum entropy classifier (MaxEnt) is essentially softmax regression / logistic regression
- Conditional Random Fields (CRFs) can be viewed as conditional maximum entropy models
- Language models: Maximum entropy language models use n-gram feature constraints to model word distributions