Probability Theory
Fundamental Concepts
Probability Space
Core Definition: A triple \((\Omega, \mathcal{F}, P)\) that provides the rigorous mathematical structure for describing a random experiment.
- Sample Space (\(\Omega\)): The set of all possible outcomes of the experiment.
- Event Space (\(\mathcal{F}\)): The collection of all possible events (subsets of the sample space).
- Probability Function (\(P\)): A function that maps events to real numbers in \([0, 1]\).
Probability Axioms (Laws of Probability):
Addition Rule for Mutually Exclusive Events: If \(\alpha \cap \beta = \emptyset\), then:
.
Random Variable
Core Definition: A mapping function that assigns a real numerical value to each outcome in the sample space.
- Notation: \(P(X=a)\) denotes the probability that the random variable \(X\) takes the value \(a\).
- Range: Denoted as \(Val(X)\).
- Types: * Discrete: The set of possible values is finite or countably infinite. * Continuous: Values are taken from a continuous interval.
Types of Probability Distributions
Core Definition: Mathematical expressions that describe the likelihood of a random variable taking different values.
- Probability Distribution \(P(X)\): Describes the likelihood of a single random variable \(X\) taking each value.
- Joint Distribution \(P(X, Y)\): Describes the likelihood of multiple random variables simultaneously taking specific values.
- Marginal Distribution \(P(X)\): The distribution of a single variable obtained by summing (or integrating) over the other variables in a joint distribution.
- Conditional Distribution \(P(X|Y)\): The probability distribution of variable \(X\) given that variable \(Y\) has been observed.
Definition of Conditional Probability:
.
Bernoulli Distribution
The Bernoulli Distribution is the simplest discrete distribution, describing a single random trial with only two possible outcomes (e.g., a coin flip).
- Expectation: \(E[X] = p\)
- Variance: \(\text{Var}(X) = p(1-p)\)
Performing \(n\) independent Bernoulli trials, the number of successes follows a Binomial Distribution: \(X \sim B(n, p)\).
Gaussian Distribution
The Gaussian Distribution, also known as the Normal Distribution, is the most important continuous probability distribution.
Denoted as \(X \sim \mathcal{N}(\mu, \sigma^2)\), where \(\mu\) is the mean and \(\sigma^2\) is the variance.
Standard Normal Distribution: When \(\mu=0, \sigma=1\), we have \(X \sim \mathcal{N}(0, 1)\).
The importance of the Gaussian distribution in machine learning:
- Central Limit Theorem: The sum of a large number of independent random variables approximately follows a normal distribution.
- Maximum Entropy Principle: Given constraints on the mean and variance, the normal distribution has the maximum entropy among all continuous distributions.
- In Bayesian inference, the normal distribution is commonly used as a prior.
Multivariate Gaussian Distribution: \(\mathbf{x} \sim \mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma})\), where \(\boldsymbol{\Sigma}\) is the covariance matrix.
Multinomial Distribution
The Multinomial Distribution is a generalization of the binomial distribution, describing how many times each of \(k\) possible outcomes occurs in \(n\) independent trials.
where \(\sum_{i=1}^k n_i = n\) and \(\sum_{i=1}^k p_i = 1\).
In NLP, the multinomial distribution is commonly used to model word frequencies (Bag-of-Words model). When \(n=1\), it reduces to the Categorical Distribution, i.e., the probability distribution output by Softmax.
Core Rules Connecting Probabilities
Core Definition: The computational logic that establishes conversion relationships among marginal, joint, and conditional probabilities.
Sum Rule
- Purpose: Used to derive marginal distributions from joint distributions (eliminating nuisance variables).
- Discrete Form:
- Continuous Form:
Product Rule / Chain Rule
- Purpose: Used to decompose a joint distribution into the product of a marginal distribution and a conditional distribution.
- Core Formula:
- General Form:
Bayes' Theorem
- Core Definition: Describes how prior probability is transformed into posterior probability after observing new data.
- Core Formula:
.
Law of Total Probability
Core Definition: Computes the total probability of a target event through a weighted average of conditional probabilities over a complete partition of events.
- Condition: The event set \(B_1, B_2, \dots, B_n\) must be mutually exclusive and collectively exhaustive (i.e., their union is the entire sample space \(\Omega\)).
- Core Formula:
.
Probability Distributions
Discrete Distribution: Probability Mass Function (PMF)
Core Concept: A discrete distribution is defined by directly assigning a specific probability "mass" to each possible value of the random variable.
- Definition: Probability Mass Function (PMF), denoted as \(P(X=x)\).
- Requirements:
- Non-negativity: \(P(X=x_i) \ge 0\)
- Normalization: The sum of probabilities over all possible values must equal 1.
- Formula:
- Example (Coin Flip):
.
Continuous Distribution: Probability Density Function (PDF)
Core Concept: A continuous distribution cannot assign probability to individual points (the probability of any single point is 0); instead, it is defined by describing the "density" of probability over intervals.
- Definition: Probability Density Function (PDF), denoted as \(f(x)\).
- Requirements:
- Non-negativity: \(f(x) \ge 0\)
- Normalization: The integral over the entire domain must equal 1.
- Formula (Interval Probability):
- Extension to Joint Distributions:
.
Cumulative Distribution Function (CDF)
Core Concept: Describes the total probability that a random variable falls within the range less than or equal to a specific value \(x\).
- Definition: Cumulative Distribution Function (CDF), denoted as \(F(x)\).
- Relationship between PDF and CDF:
- Inverse Derivation:
Expectation / Variance
Expectation
Core Definition: The first moment reflecting the average value of a random variable; its physical interpretation is the "center of mass" of the probability distribution.
- Basic Definition:
- Linearity of Expectation (General Rule): Regardless of whether the variables are independent, the expectation of a sum equals the sum of expectations.
- Product of Independent Variables: If \(X \perp Y\), then:
.
Variance
Core Definition: Measures the degree of deviation of a random variable from its expected value, i.e., the dispersion of the distribution.
- Basic Definition:
- Commonly Used Formula in Machine Learning:
- Linear Transformation Property:
- Sum of Independent Variables: If \(X \perp Y\), then:
.
Covariance
Core Definition: Measures the direction and strength of the linear relationship between two random variables.
- Core Formula:
- Special Case: When \(X=Y\), \(Cov(X, X) = Var(X)\).
Independence
Binary Independence
Core Concept: The most fundamental criterion for determining whether two events have a causal or statistical association.
- Core Definition: $$ P(A \cap B) = P(A)P(B) $$
- Equivalent Conditional Probability Criterion: $$ P(A|B) = P(A) \quad (\text{provided } P(B) > 0) $$
- Property: If \(A\) and \(B\) are independent, then their complementary event combinations (e.g., \(\bar{A}\) and \(B\)) are also independent.
Pairwise Independence
Core Concept: In a set of events, any two events satisfy independence, but this does not necessarily extend to the group as a whole.
- Condition: For an event set \(\{A_1, A_2, \dots, A_n\}\), the following holds: $$ P(A_i A_j) = P(A_i)P(A_j) \quad (\forall i \neq j) $$
- Limitation: This does not imply the multiplicative probability relationship for three or more events occurring simultaneously.
Mutual Independence
Core Concept: The strictest form of independence, requiring that the occurrence of any subset of events does not affect the remaining events.
- Core Definition: For any \(k\) events among \(n\) events (\(2 \le k \le n\)), the following holds: $$ P(A_{i_1} A_{i_2} \dots A_{i_k}) = P(A_{i_1})P(A_{i_2}) \dots P(A_{i_k}) $$
- Implication: Mutual independence \(\implies\) pairwise independence (the converse does not hold).
Independence of Random Variables
Core Concept: Extends the notion of event independence to distribution functions of continuous or discrete random variables.
- Definition (Distribution Functions): $$ F(x, y) = F_X(x)F_Y(y) $$
- Definition (Density Functions): $$ f(x, y) = f_X(x)f_Y(y) $$
Conditional Independence
Given an observed third variable \(C\), events \(A\) and \(B\) become uncorrelated.
- Core Definition: $$ P(AB|C) = P(A|C)P(B|C) $$
- Equivalent Expression: \(P(A|BC) = P(A|C)\)
.
Correlation Coefficient
A dimensionless measure of the degree of linear dependence between two random variables.
- Formula: $$ \rho_{XY} = \frac{\text{Cov}(X, Y)}{\sqrt{\text{Var}(X)}\sqrt{\text{Var}(Y)}} $$
- Properties: \(|\rho_{XY}| \le 1\); if \(|\rho_{XY}|=1\), then \(X\) and \(Y\) are linearly dependent; if \(\rho_{XY}=0\), then they are uncorrelated.
Bayes' Theorem
Conditional Probability
The probability that event \(A\) occurs given that event \(B\) has already occurred.
- Condition: \(P(B) > 0\)
- Formula: $$ P(A|B) = \frac{P(AB)}{P(B)} $$
- Multiplication Rule: \(P(AB) = P(B)P(A|B) = P(A)P(B|A)\)
Bayes' Theorem
Describes how to update one's prior belief about a hypothesis after observing new evidence (reasoning from effect to cause).
- Core Formula: $$ P(A_i|B) = \frac{P(B|A_i)P(A_i)}{\sum_{j=1}^{n} P(B|A_j)P(A_j)} $$
- Terminology: \(P(A_i)\) is the prior probability, \(P(A_i|B)\) is the posterior probability, and \(P(B|A_i)\) is the likelihood.
Law of Large Numbers
Describes the statistical law that the sample mean converges to the expected value as the number of trials tends to infinity.
- Chebyshev's Law of Large Numbers: For a sequence of independent variables with finite variance, the arithmetic mean converges in probability to the expected value. $$ \lim_{n \to \infty} P\left( \left| \frac{1}{n}\sum_{i=1}^n X_i - \frac{1}{n}\sum_{i=1}^n E(X_i) \right| < \epsilon \right) = 1 $$
- Bernoulli's Law of Large Numbers: The frequency of an event converges in probability to its probability \(p\).
Central Limit Theorem
The sum of a large number of independent and identically distributed random variables approaches a normal distribution, regardless of the original distribution.
- i.i.d. Case: Let \(X_1, \dots, X_n\) be independent and identically distributed with mean \(\mu\) and variance \(\sigma^2\).
- Standardized Convergence Formula: $$ \lim_{n \to \infty} P\left( \frac{\sum_{i=1}^n X_i - n\mu}{\sigma\sqrt{n}} \le x \right) = \Phi(x) $$
Multivariate Random Variables
The study of the joint cumulative distribution and individual marginal distributions of multiple random variables defined on the same sample space.
Distributions
- Joint Distribution: \(F(x, y) = P(X \le x, Y \le y)\)
- Marginal Distribution: The single-variable distribution obtained by integrating out the other variables. $$ f_X(x) = \int_{-\infty}^{\infty} f(x, y) dy $$
Covariance Matrix
Arranges the variances and covariances among all dimensions of a multivariate random variable in matrix form, characterizing the dispersion and correlation of the vector.
- Matrix Definition: For a random vector \(\mathbf{X} = [X_1, X_2, \dots, X_n]^T\): $$ \Sigma_{ij} = \text{Cov}(X_i, X_j) = E[(X_i - E[X_i])(X_j - E[X_j])] $$
- Properties: The covariance matrix is symmetric and positive semi-definite.
Maximum Likelihood Estimation (MLE)
Given data \(D = \{x_1, \ldots, x_n\}\), the likelihood function of parameter \(\theta\):
Log-likelihood:
MLE: \(\hat{\theta}_{MLE} = \arg\max_\theta \ell(\theta)\)
Example: MLE for the Gaussian Distribution
KL Divergence and Information-Theoretic Metrics
KL Divergence (relative entropy):
Properties: \(D_{KL} \geq 0\) (Gibbs' inequality), asymmetric.
Cross-Entropy:
In deep learning, minimizing cross-entropy is equivalent to minimizing KL divergence (since \(H(P)\) is a constant).
Additional Common Distributions
Beta Distribution
The conjugate prior for Bernoulli/Binomial distributions. \(\mathbb{E}[X] = \frac{\alpha}{\alpha+\beta}\)
Gamma Distribution
The conjugate prior for the Poisson distribution. The exponential distribution is a special case: \(\text{Gamma}(1, \lambda)\).
Dirichlet Distribution
The conjugate prior for the Multinomial distribution, used in topic models (LDA).
Fundamentals of Bayesian Inference
Bayes' formula from the parameter estimation perspective:
- Prior \(p(\theta)\): Prior belief about the parameters
- Likelihood \(p(D|\theta)\): The probability of the data given the parameters
- Posterior \(p(\theta|D)\): Updated belief after observing data
- MAP Estimate: \(\hat{\theta}_{MAP} = \arg\max_\theta p(\theta|D)\)
MLE vs MAP: MAP = MLE + regularization (the prior acts as a regularization term).
See Information Theory for more on entropy and mutual information. See Bayesian Learning for MCMC and variational inference.