Information Theory

Entropy

Entropy measures the uncertainty inherent in a probability distribution. When the probability of outcomes is spread out widely, entropy is high; when an outcome is nearly certain, entropy is low.

For a discrete random variable \(X\) with probability distribution \(P(X)\), the entropy \(H(P)\) is defined as:

\[ H(P) = E_{x \sim P}[-\log P(x)] = -\sum_{i=1}^{n} P(x_i) \log P(x_i) \]

If the logarithm is base \(2\), the unit is bits; if the base is \(e\), the unit is nats. \(1 \text{ nat} = \frac{1}{\ln 2} \approx 1.44 \text{ bits}\).

Self-Information

Self-information (also called information content) measures how "surprising" it is to observe a particular event \(x\).

\[ I(x) = -\log P(x) \]

If \(P(x) = 1\) (a certain event), then \(I(x) = 0\) — you already knew it would happen, so no new information is gained. If \(P(x) \to 0\) (an extremely rare event), then \(I(x) \to \infty\) — if it does occur, the surprise is enormous and the amount of information gained is immense.

Entropy is, in fact, the expected value of self-information.

Cross-Entropy

In machine learning, we typically do not know the true probability distribution \(P\); instead, a model produces a predicted distribution \(Q\). Cross-entropy measures the average level of "surprise" when we use the incorrect distribution \(Q\) to decode data drawn from the true distribution \(P\).

\[ H(P, Q) = E_{x \sim P}[-\log Q(x)] = -\sum_{i=1}^{n} P(x_i) \log Q(x_i) \]

Cross-entropy reaches its minimum if and only if \(Q = P\), in which case \(H(P, Q) = H(P)\).

In classification tasks:

\(P\) is the ground-truth label (typically a one-hot vector, e.g., \([0, 1, 0]\)).
\(Q\) is the predicted probability output by the model's Softmax layer (e.g., \([0.1, 0.8, 0.1]\)).
Minimizing cross-entropy essentially drives the predicted distribution \(Q\) as close as possible to the true distribution \(P\).

KL Divergence (Relative Entropy)

KL divergence measures the distance between two distributions:

\[ D_{KL}(P \| Q) = H(P, Q) - H(P) \]

Since the entropy of the true distribution \(P\), namely \(H(P)\), is a constant:

Minimizing the cross-entropy \(H(P, Q)\) \(\iff\) Minimizing the KL divergence \(D_{KL}(P \| Q)\)****

Conditional Entropy

Conditional entropy \(H(X|Y)\) measures the remaining uncertainty in \(X\) given knowledge of random variable \(Y\):

\[ H(X|Y) = -\sum_{y} P(y) \sum_{x} P(x|y) \log P(x|y) = -\sum_{x,y} P(x,y) \log P(x|y) \]

Properties:

\(H(X|Y) \geq 0\), with equality if and only if \(X\) is a deterministic function of \(Y\)
\(H(X|Y) \leq H(X)\) — additional knowledge never increases uncertainty ("information never hurts")
When \(X\) and \(Y\) are independent, \(H(X|Y) = H(X)\)

Joint Entropy

Joint entropy \(H(X, Y)\) measures the total uncertainty of the joint distribution of two random variables:

\[ H(X, Y) = -\sum_{x,y} P(x, y) \log P(x, y) \]

Chain rule:

\[ H(X, Y) = H(X) + H(Y|X) = H(Y) + H(X|Y) \]

More generally, for \(n\) random variables:

\[ H(X_1, X_2, \ldots, X_n) = \sum_{i=1}^{n} H(X_i | X_1, \ldots, X_{i-1}) \]

Mutual Information

Mutual information \(I(X; Y)\) measures the amount of information shared between two random variables — how much knowing one reduces the uncertainty about the other.

\[ I(X; Y) = H(X) - H(X|Y) = H(Y) - H(Y|X) \]

Equivalent definition:

\[ I(X; Y) = \sum_{x,y} P(x,y) \log \frac{P(x,y)}{P(x)P(y)} = D_{KL}(P(X,Y) \| P(X)P(Y)) \]

Properties:

\(I(X; Y) \geq 0\), with equality if and only if \(X\) and \(Y\) are independent
\(I(X; Y) = I(Y; X)\) (symmetry)
\(I(X; X) = H(X)\) (self-information equals entropy)

Relationship to Other Entropy Measures

Mutual information can be expressed as:

\[ I(X; Y) = H(X) + H(Y) - H(X, Y) \]

This formula reveals an intuitive relationship: joint entropy = sum of individual entropies - overlap (mutual information).

Relationships Among Entropy Measures (Venn Diagram)

The following set-theoretic diagram illustrates the relationships among various information quantities:

+---------------------------+
|         H(X, Y)           |
|                           |
|   +-------+   +-------+  |
|   |       |   |       |  |
|   | H(X|Y)|I(X;Y)|H(Y|X)|  |
|   |       |   |       |  |
|   +-------+   +-------+  |
|   |<-- H(X) -->|         |
|            |<-- H(Y) -->||
+---------------------------+

Summary of key identities:

Relationship	Formula
Joint entropy decomposition	\(H(X,Y) = H(X) + H(Y\\|X) = H(Y) + H(X\\|Y)\)
Mutual information definition	\(I(X;Y) = H(X) - H(X\\|Y) = H(Y) - H(Y\\|X)\)
Venn diagram identity	\(H(X,Y) = H(X) + H(Y) - I(X;Y)\)
Upper bound	\(I(X;Y) \leq \min\{H(X), H(Y)\}\)

Information Gain

Information gain is the direct application of mutual information in decision tree learning. When selecting a feature for splitting, information gain measures how much a feature \(A\) reduces the classification uncertainty of dataset \(D\):

\[ \text{IG}(D, A) = H(D) - H(D|A) = H(D) - \sum_{v \in \text{Values}(A)} \frac{|D_v|}{|D|} H(D_v) \]

where \(D_v\) is the subset of data for which feature \(A\) takes value \(v\).

Applications in decision trees:

ID3 algorithm: Directly uses information gain to select the splitting feature. Its drawback is a bias toward features with many distinct values.
C4.5 algorithm: Uses the information gain ratio to correct this bias:

\[ \text{IGR}(D, A) = \frac{\text{IG}(D, A)}{H_A(D)}, \quad H_A(D) = -\sum_{v} \frac{|D_v|}{|D|} \log \frac{|D_v|}{|D|} \]

where \(H_A(D)\) is called the intrinsic value of feature \(A\).

Information Bottleneck

Core Idea

The information bottleneck theory (Tishby et al., 1999) asks: how can we find a compressed representation \(T\) of input \(X\) that retains as much information about the target \(Y\) as possible while compressing \(X\) as much as possible?

This is fundamentally a trade-off between compression and prediction.

IB-Lagrangian

The information bottleneck optimization objective is written in Lagrangian form:

\[ \mathcal{L}_{\text{IB}} = I(X; T) - \beta \cdot I(T; Y) \]

where:

\(I(X; T)\): How much information \(T\) retains from \(X\) (to be minimized — compression)
\(I(T; Y)\): How much information \(T\) contains about target \(Y\) (to be maximized — prediction)
\(\beta > 0\): Lagrange multiplier controlling the compression-prediction balance

When \(\beta \to 0\), extreme compression occurs (\(T\) contains almost no information); when \(\beta \to \infty\), almost no compression occurs (\(T \approx X\)).

Connection to Deep Learning

Shwartz-Ziv & Tishby (2017) proposed that the training process of deep neural networks can be understood through the information bottleneck framework:

Fitting phase: \(I(T; Y)\) increases rapidly as the network learns useful features
Compression phase: \(I(X; T)\) gradually decreases as the network discards noise unrelated to \(Y\)

Although this view remains debated (particularly whether the compression phase universally occurs), it provides an information-theoretic perspective on understanding generalization in deep learning.

Maximum Entropy Principle

The maximum entropy principle (Jaynes, 1957) states that among all probability distributions satisfying known constraints, one should choose the distribution with the highest entropy.

Intuition

The maximum-entropy distribution is the most "conservative" choice — it introduces no additional assumptions beyond the known constraints and is the most honest expression of ignorance.

Mathematical Formulation

Given constraints \(E[f_k(X)] = \alpha_k\) (\(k = 1, \ldots, K\)), the maximum entropy distribution is obtained by solving:

\[ \max_{P} \; H(P) = -\sum_{x} P(x) \log P(x) \]

\[ \text{s.t.} \quad \sum_{x} P(x) f_k(x) = \alpha_k, \quad \sum_{x} P(x) = 1 \]

Using Lagrange multipliers, the solution takes the form of an exponential family distribution:

\[ P^*(x) = \frac{1}{Z} \exp\!\Big(-\sum_{k=1}^{K} \lambda_k f_k(x)\Big) \]

where \(Z\) is the normalization constant (partition function) and \(\lambda_k\) are the Lagrange multipliers.

Common Examples

Constraints	Maximum Entropy Distribution
No constraints (finite discrete)	Uniform distribution
Given mean and variance	Gaussian distribution
Given mean (non-negative)	Exponential distribution

Applications in Machine Learning

The maximum entropy classifier (MaxEnt) is essentially softmax regression / logistic regression
Conditional Random Fields (CRFs) can be viewed as conditional maximum entropy models
Language models: Maximum entropy language models use n-gram feature constraints to model word distributions