Skip to content

Probability Theory

Fundamental Concepts

Probability Space

Core Definition: A triple \((\Omega, \mathcal{F}, P)\) that provides the rigorous mathematical structure for describing a random experiment.

  • Sample Space (\(\Omega\)): The set of all possible outcomes of the experiment.
  • Event Space (\(\mathcal{F}\)): The collection of all possible events (subsets of the sample space).
  • Probability Function (\(P\)): A function that maps events to real numbers in \([0, 1]\).

Probability Axioms (Laws of Probability):

\[ P(A) \in [0, 1], \quad \forall A \in \mathcal{F} \]
\[ P(\Omega) = 1 \]

Addition Rule for Mutually Exclusive Events: If \(\alpha \cap \beta = \emptyset\), then:

\[ P(\alpha \cup \beta) = P(\alpha) + P(\beta) \]

.

Random Variable

Core Definition: A mapping function that assigns a real numerical value to each outcome in the sample space.

  • Notation: \(P(X=a)\) denotes the probability that the random variable \(X\) takes the value \(a\).
  • Range: Denoted as \(Val(X)\).
  • Types: * Discrete: The set of possible values is finite or countably infinite. * Continuous: Values are taken from a continuous interval.

Types of Probability Distributions

Core Definition: Mathematical expressions that describe the likelihood of a random variable taking different values.

  • Probability Distribution \(P(X)\): Describes the likelihood of a single random variable \(X\) taking each value.
  • Joint Distribution \(P(X, Y)\): Describes the likelihood of multiple random variables simultaneously taking specific values.
  • Marginal Distribution \(P(X)\): The distribution of a single variable obtained by summing (or integrating) over the other variables in a joint distribution.
  • Conditional Distribution \(P(X|Y)\): The probability distribution of variable \(X\) given that variable \(Y\) has been observed.

Definition of Conditional Probability:

\[ P(X=a | Y=b) = \frac{P(X=a, Y=b)}{P(Y=b)} \]

.

Bernoulli Distribution

The Bernoulli Distribution is the simplest discrete distribution, describing a single random trial with only two possible outcomes (e.g., a coin flip).

\[ P(X=k) = p^k (1-p)^{1-k}, \quad k \in \{0, 1\} \]
  • Expectation: \(E[X] = p\)
  • Variance: \(\text{Var}(X) = p(1-p)\)

Performing \(n\) independent Bernoulli trials, the number of successes follows a Binomial Distribution: \(X \sim B(n, p)\).

Gaussian Distribution

The Gaussian Distribution, also known as the Normal Distribution, is the most important continuous probability distribution.

\[ p(x) = \frac{1}{\sqrt{2\pi}\sigma} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right) \]

Denoted as \(X \sim \mathcal{N}(\mu, \sigma^2)\), where \(\mu\) is the mean and \(\sigma^2\) is the variance.

Standard Normal Distribution: When \(\mu=0, \sigma=1\), we have \(X \sim \mathcal{N}(0, 1)\).

The importance of the Gaussian distribution in machine learning:

  • Central Limit Theorem: The sum of a large number of independent random variables approximately follows a normal distribution.
  • Maximum Entropy Principle: Given constraints on the mean and variance, the normal distribution has the maximum entropy among all continuous distributions.
  • In Bayesian inference, the normal distribution is commonly used as a prior.

Multivariate Gaussian Distribution: \(\mathbf{x} \sim \mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma})\), where \(\boldsymbol{\Sigma}\) is the covariance matrix.

\[ p(\mathbf{x}) = \frac{1}{(2\pi)^{d/2}|\boldsymbol{\Sigma}|^{1/2}} \exp\left(-\frac{1}{2}(\mathbf{x}-\boldsymbol{\mu})^T \boldsymbol{\Sigma}^{-1} (\mathbf{x}-\boldsymbol{\mu})\right) \]

Multinomial Distribution

The Multinomial Distribution is a generalization of the binomial distribution, describing how many times each of \(k\) possible outcomes occurs in \(n\) independent trials.

\[ P(X_1=n_1, \ldots, X_k=n_k) = \frac{n!}{n_1! \cdots n_k!} p_1^{n_1} \cdots p_k^{n_k} \]

where \(\sum_{i=1}^k n_i = n\) and \(\sum_{i=1}^k p_i = 1\).

In NLP, the multinomial distribution is commonly used to model word frequencies (Bag-of-Words model). When \(n=1\), it reduces to the Categorical Distribution, i.e., the probability distribution output by Softmax.

Core Rules Connecting Probabilities

Core Definition: The computational logic that establishes conversion relationships among marginal, joint, and conditional probabilities.

Sum Rule

  • Purpose: Used to derive marginal distributions from joint distributions (eliminating nuisance variables).
  • Discrete Form:
\[ P(X) = \sum_{y \in Val(Y)} P(X, Y) \]
  • Continuous Form:
\[ P(X) = \int_{y \in Val(Y)} P(X, Y) dy \]

Product Rule / Chain Rule

  • Purpose: Used to decompose a joint distribution into the product of a marginal distribution and a conditional distribution.
  • Core Formula:
\[ P(X, Y) = P(X)P(Y|X) \]
  • General Form:
\[ P(X_1, \dots, X_n) = \prod_{i=1}^n P(X_i | X_1, \dots, X_{i-1}) \]

Bayes' Theorem

  • Core Definition: Describes how prior probability is transformed into posterior probability after observing new data.
  • Core Formula:
\[ P(X|Y) = \frac{P(Y|X)P(X)}{P(Y)} \]

.

Law of Total Probability

Core Definition: Computes the total probability of a target event through a weighted average of conditional probabilities over a complete partition of events.

  • Condition: The event set \(B_1, B_2, \dots, B_n\) must be mutually exclusive and collectively exhaustive (i.e., their union is the entire sample space \(\Omega\)).
  • Core Formula:
\[ P(A) = \sum_{i=1}^n P(B_i)P(A|B_i) \]

.

Probability Distributions

Discrete Distribution: Probability Mass Function (PMF)

Core Concept: A discrete distribution is defined by directly assigning a specific probability "mass" to each possible value of the random variable.

  • Definition: Probability Mass Function (PMF), denoted as \(P(X=x)\).
  • Requirements:
  1. Non-negativity: \(P(X=x_i) \ge 0\)
  2. Normalization: The sum of probabilities over all possible values must equal 1.
  • Formula:
\[ \sum_{i} P(X=x_i) = 1 \]
  • Example (Coin Flip):
\[ P(X=x) = 0.5, \quad x \in \{0, 1\} \]

.

Continuous Distribution: Probability Density Function (PDF)

Core Concept: A continuous distribution cannot assign probability to individual points (the probability of any single point is 0); instead, it is defined by describing the "density" of probability over intervals.

  • Definition: Probability Density Function (PDF), denoted as \(f(x)\).
  • Requirements:
  1. Non-negativity: \(f(x) \ge 0\)
  2. Normalization: The integral over the entire domain must equal 1.
  • Formula (Interval Probability):
\[ P(a \le X \le b) = \int_{a}^{b} f(x) dx \]
  • Extension to Joint Distributions:
\[ P(a \le X \le b, c \le Y \le d) = \int_{a}^{b} \int_{c}^{d} f(x, y) dx dy \]

.

Cumulative Distribution Function (CDF)

Core Concept: Describes the total probability that a random variable falls within the range less than or equal to a specific value \(x\).

  • Definition: Cumulative Distribution Function (CDF), denoted as \(F(x)\).
  • Relationship between PDF and CDF:
\[ F(x) = P(X \le x) = \int_{-\infty}^{x} f(u) du \]
  • Inverse Derivation:
\[ f(x) = \frac{d}{dx} F(x) \]

Expectation / Variance

Expectation

Core Definition: The first moment reflecting the average value of a random variable; its physical interpretation is the "center of mass" of the probability distribution.

  • Basic Definition:
\[ E(X) = \sum_{i} p_i x_i \quad \text{or} \quad \int x f(x) dx \]
  • Linearity of Expectation (General Rule): Regardless of whether the variables are independent, the expectation of a sum equals the sum of expectations.
\[ E\left(\sum_{i=1}^{n} X_i\right) = \sum_{i=1}^{n} E(X_i) \]
  • Product of Independent Variables: If \(X \perp Y\), then:
\[ E(XY) = E(X)E(Y) \]

.

Variance

Core Definition: Measures the degree of deviation of a random variable from its expected value, i.e., the dispersion of the distribution.

  • Basic Definition:
\[ Var(X) = E\left[(X - E(X))^2\right] \]
  • Commonly Used Formula in Machine Learning:
\[ Var(X) = E(X^2) - [E(X)]^2 \]
  • Linear Transformation Property:
\[ Var(aX + b) = a^2 Var(X) \]
  • Sum of Independent Variables: If \(X \perp Y\), then:
\[ Var(X \pm Y) = Var(X) + Var(Y) \]

.

Covariance

Core Definition: Measures the direction and strength of the linear relationship between two random variables.

  • Core Formula:
\[ Cov(X, Y) = E[(X - E(X))(Y - E(Y))] \]
  • Special Case: When \(X=Y\), \(Cov(X, X) = Var(X)\).

Independence

Binary Independence

Core Concept: The most fundamental criterion for determining whether two events have a causal or statistical association.

  • Core Definition: $$ P(A \cap B) = P(A)P(B) $$
  • Equivalent Conditional Probability Criterion: $$ P(A|B) = P(A) \quad (\text{provided } P(B) > 0) $$
  • Property: If \(A\) and \(B\) are independent, then their complementary event combinations (e.g., \(\bar{A}\) and \(B\)) are also independent.

Pairwise Independence

Core Concept: In a set of events, any two events satisfy independence, but this does not necessarily extend to the group as a whole.

  • Condition: For an event set \(\{A_1, A_2, \dots, A_n\}\), the following holds: $$ P(A_i A_j) = P(A_i)P(A_j) \quad (\forall i \neq j) $$
  • Limitation: This does not imply the multiplicative probability relationship for three or more events occurring simultaneously.

Mutual Independence

Core Concept: The strictest form of independence, requiring that the occurrence of any subset of events does not affect the remaining events.

  • Core Definition: For any \(k\) events among \(n\) events (\(2 \le k \le n\)), the following holds: $$ P(A_{i_1} A_{i_2} \dots A_{i_k}) = P(A_{i_1})P(A_{i_2}) \dots P(A_{i_k}) $$
  • Implication: Mutual independence \(\implies\) pairwise independence (the converse does not hold).

Independence of Random Variables

Core Concept: Extends the notion of event independence to distribution functions of continuous or discrete random variables.

  • Definition (Distribution Functions): $$ F(x, y) = F_X(x)F_Y(y) $$
  • Definition (Density Functions): $$ f(x, y) = f_X(x)f_Y(y) $$

Conditional Independence

Given an observed third variable \(C\), events \(A\) and \(B\) become uncorrelated.

  • Core Definition: $$ P(AB|C) = P(A|C)P(B|C) $$
  • Equivalent Expression: \(P(A|BC) = P(A|C)\)

.

Correlation Coefficient

A dimensionless measure of the degree of linear dependence between two random variables.

  • Formula: $$ \rho_{XY} = \frac{\text{Cov}(X, Y)}{\sqrt{\text{Var}(X)}\sqrt{\text{Var}(Y)}} $$
  • Properties: \(|\rho_{XY}| \le 1\); if \(|\rho_{XY}|=1\), then \(X\) and \(Y\) are linearly dependent; if \(\rho_{XY}=0\), then they are uncorrelated.

Bayes' Theorem

Conditional Probability

The probability that event \(A\) occurs given that event \(B\) has already occurred.

  • Condition: \(P(B) > 0\)
  • Formula: $$ P(A|B) = \frac{P(AB)}{P(B)} $$
  • Multiplication Rule: \(P(AB) = P(B)P(A|B) = P(A)P(B|A)\)

Bayes' Theorem

Describes how to update one's prior belief about a hypothesis after observing new evidence (reasoning from effect to cause).

  • Core Formula: $$ P(A_i|B) = \frac{P(B|A_i)P(A_i)}{\sum_{j=1}^{n} P(B|A_j)P(A_j)} $$
  • Terminology: \(P(A_i)\) is the prior probability, \(P(A_i|B)\) is the posterior probability, and \(P(B|A_i)\) is the likelihood.

Law of Large Numbers

Describes the statistical law that the sample mean converges to the expected value as the number of trials tends to infinity.

  • Chebyshev's Law of Large Numbers: For a sequence of independent variables with finite variance, the arithmetic mean converges in probability to the expected value. $$ \lim_{n \to \infty} P\left( \left| \frac{1}{n}\sum_{i=1}^n X_i - \frac{1}{n}\sum_{i=1}^n E(X_i) \right| < \epsilon \right) = 1 $$
  • Bernoulli's Law of Large Numbers: The frequency of an event converges in probability to its probability \(p\).

Central Limit Theorem

The sum of a large number of independent and identically distributed random variables approaches a normal distribution, regardless of the original distribution.

  • i.i.d. Case: Let \(X_1, \dots, X_n\) be independent and identically distributed with mean \(\mu\) and variance \(\sigma^2\).
  • Standardized Convergence Formula: $$ \lim_{n \to \infty} P\left( \frac{\sum_{i=1}^n X_i - n\mu}{\sigma\sqrt{n}} \le x \right) = \Phi(x) $$

Multivariate Random Variables

The study of the joint cumulative distribution and individual marginal distributions of multiple random variables defined on the same sample space.

Distributions

  • Joint Distribution: \(F(x, y) = P(X \le x, Y \le y)\)
  • Marginal Distribution: The single-variable distribution obtained by integrating out the other variables. $$ f_X(x) = \int_{-\infty}^{\infty} f(x, y) dy $$

Covariance Matrix

Arranges the variances and covariances among all dimensions of a multivariate random variable in matrix form, characterizing the dispersion and correlation of the vector.

  • Matrix Definition: For a random vector \(\mathbf{X} = [X_1, X_2, \dots, X_n]^T\): $$ \Sigma_{ij} = \text{Cov}(X_i, X_j) = E[(X_i - E[X_i])(X_j - E[X_j])] $$
  • Properties: The covariance matrix is symmetric and positive semi-definite.

Maximum Likelihood Estimation (MLE)

Given data \(D = \{x_1, \ldots, x_n\}\), the likelihood function of parameter \(\theta\):

\[L(\theta) = \prod_{i=1}^{n} p(x_i | \theta)\]

Log-likelihood:

\[\ell(\theta) = \sum_{i=1}^{n} \log p(x_i | \theta)\]

MLE: \(\hat{\theta}_{MLE} = \arg\max_\theta \ell(\theta)\)

Example: MLE for the Gaussian Distribution

\[\hat{\mu} = \frac{1}{n}\sum x_i, \quad \hat{\sigma}^2 = \frac{1}{n}\sum (x_i - \hat{\mu})^2\]

KL Divergence and Information-Theoretic Metrics

KL Divergence (relative entropy):

\[D_{KL}(P \| Q) = \sum_x P(x) \log \frac{P(x)}{Q(x)} = \mathbb{E}_P\left[\log \frac{P}{Q}\right]\]

Properties: \(D_{KL} \geq 0\) (Gibbs' inequality), asymmetric.

Cross-Entropy:

\[H(P, Q) = -\sum_x P(x) \log Q(x) = H(P) + D_{KL}(P \| Q)\]

In deep learning, minimizing cross-entropy is equivalent to minimizing KL divergence (since \(H(P)\) is a constant).

Additional Common Distributions

Beta Distribution

\[\text{Beta}(\alpha, \beta): \quad p(x) = \frac{x^{\alpha-1}(1-x)^{\beta-1}}{B(\alpha,\beta)}, \quad x \in [0,1]\]

The conjugate prior for Bernoulli/Binomial distributions. \(\mathbb{E}[X] = \frac{\alpha}{\alpha+\beta}\)

Gamma Distribution

\[\text{Gamma}(\alpha, \beta): \quad p(x) = \frac{\beta^\alpha x^{\alpha-1} e^{-\beta x}}{\Gamma(\alpha)}, \quad x > 0\]

The conjugate prior for the Poisson distribution. The exponential distribution is a special case: \(\text{Gamma}(1, \lambda)\).

Dirichlet Distribution

\[\text{Dir}(\alpha_1, \ldots, \alpha_K): \quad p(x) = \frac{\Gamma(\sum \alpha_k)}{\prod \Gamma(\alpha_k)} \prod x_k^{\alpha_k - 1}\]

The conjugate prior for the Multinomial distribution, used in topic models (LDA).

Fundamentals of Bayesian Inference

Bayes' formula from the parameter estimation perspective:

\[p(\theta | D) = \frac{p(D | \theta) \cdot p(\theta)}{p(D)}\]
  • Prior \(p(\theta)\): Prior belief about the parameters
  • Likelihood \(p(D|\theta)\): The probability of the data given the parameters
  • Posterior \(p(\theta|D)\): Updated belief after observing data
  • MAP Estimate: \(\hat{\theta}_{MAP} = \arg\max_\theta p(\theta|D)\)

MLE vs MAP: MAP = MLE + regularization (the prior acts as a regularization term).

See Information Theory for more on entropy and mutual information. See Bayesian Learning for MCMC and variational inference.


评论 #