Probability Theory

Fundamental Concepts

Probability Space

Core Definition: A triple $(\Omega, \mathcal{F}, P)$ that provides the rigorous mathematical structure for describing a random experiment.

Sample Space ($\Omega$): The set of all possible outcomes of the experiment.
Event Space ($\mathcal{F}$): The collection of all possible events (subsets of the sample space).
Probability Function ($P$): A function that maps events to real numbers in $[0, 1]$.

Probability Axioms (Laws of Probability):

\[ P(A) \in [0, 1], \quad \forall A \in \mathcal{F} \]

\[ P(\Omega) = 1 \]

Addition Rule for Mutually Exclusive Events: If $\alpha \cap \beta = \emptyset$, then:

\[ P(\alpha \cup \beta) = P(\alpha) + P(\beta) \]

.

Random Variable

Core Definition: A mapping function that assigns a real numerical value to each outcome in the sample space.

Notation: $P(X=a)$ denotes the probability that the random variable $X$ takes the value $a$.
Range: Denoted as $Val(X)$.
Types: * Discrete: The set of possible values is finite or countably infinite. * Continuous: Values are taken from a continuous interval.

Types of Probability Distributions

Core Definition: Mathematical expressions that describe the likelihood of a random variable taking different values.

Probability Distribution $P(X)$: Describes the likelihood of a single random variable $X$ taking each value.
Joint Distribution $P(X, Y)$: Describes the likelihood of multiple random variables simultaneously taking specific values.
Marginal Distribution $P(X)$: The distribution of a single variable obtained by summing (or integrating) over the other variables in a joint distribution.
Conditional Distribution $P(X|Y)$: The probability distribution of variable $X$ given that variable $Y$ has been observed.

Definition of Conditional Probability:

\[ P(X=a | Y=b) = \frac{P(X=a, Y=b)}{P(Y=b)} \]

.

Bernoulli Distribution

The Bernoulli Distribution is the simplest discrete distribution, describing a single random trial with only two possible outcomes (e.g., a coin flip).

\[ P(X=k) = p^k (1-p)^{1-k}, \quad k \in \{0, 1\} \]

Expectation: $E[X] = p$
Variance: $\text{Var}(X) = p(1-p)$

Performing $n$ independent Bernoulli trials, the number of successes follows a Binomial Distribution: $X \sim B(n, p)$.

Gaussian Distribution

The Gaussian Distribution, also known as the Normal Distribution, is the most important continuous probability distribution.

\[ p(x) = \frac{1}{\sqrt{2\pi}\sigma} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right) \]

Denoted as $X \sim \mathcal{N}(\mu, \sigma^2)$, where $\mu$ is the mean and $\sigma^2$ is the variance.

Standard Normal Distribution: When $\mu=0, \sigma=1$, we have $X \sim \mathcal{N}(0, 1)$.

The importance of the Gaussian distribution in machine learning:

Central Limit Theorem: The sum of a large number of independent random variables approximately follows a normal distribution.
Maximum Entropy Principle: Given constraints on the mean and variance, the normal distribution has the maximum entropy among all continuous distributions.
In Bayesian inference, the normal distribution is commonly used as a prior.

Multivariate Gaussian Distribution: $\mathbf{x} \sim \mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma})$, where $\boldsymbol{\Sigma}$ is the covariance matrix.

\[ p(\mathbf{x}) = \frac{1}{(2\pi)^{d/2}|\boldsymbol{\Sigma}|^{1/2}} \exp\left(-\frac{1}{2}(\mathbf{x}-\boldsymbol{\mu})^T \boldsymbol{\Sigma}^{-1} (\mathbf{x}-\boldsymbol{\mu})\right) \]

Multinomial Distribution

The Multinomial Distribution is a generalization of the binomial distribution, describing how many times each of $k$ possible outcomes occurs in $n$ independent trials.

\[ P(X_1=n_1, \ldots, X_k=n_k) = \frac{n!}{n_1! \cdots n_k!} p_1^{n_1} \cdots p_k^{n_k} \]

where $\sum_{i=1}^k n_i = n$ and $\sum_{i=1}^k p_i = 1$.

In NLP, the multinomial distribution is commonly used to model word frequencies (Bag-of-Words model). When $n=1$, it reduces to the Categorical Distribution, i.e., the probability distribution output by Softmax.

Core Rules Connecting Probabilities

Core Definition: The computational logic that establishes conversion relationships among marginal, joint, and conditional probabilities.

Sum Rule

Purpose: Used to derive marginal distributions from joint distributions (eliminating nuisance variables).
Discrete Form:

\[ P(X) = \sum_{y \in Val(Y)} P(X, Y) \]

Continuous Form:

\[ P(X) = \int_{y \in Val(Y)} P(X, Y) dy \]

Product Rule / Chain Rule

Purpose: Used to decompose a joint distribution into the product of a marginal distribution and a conditional distribution.
Core Formula:

\[ P(X, Y) = P(X)P(Y|X) \]

General Form:

\[ P(X_1, \dots, X_n) = \prod_{i=1}^n P(X_i | X_1, \dots, X_{i-1}) \]

Bayes' Theorem

Core Definition: Describes how prior probability is transformed into posterior probability after observing new data.
Core Formula:

\[ P(X|Y) = \frac{P(Y|X)P(X)}{P(Y)} \]

.

Law of Total Probability

Core Definition: Computes the total probability of a target event through a weighted average of conditional probabilities over a complete partition of events.

Condition: The event set $B_1, B_2, \dots, B_n$ must be mutually exclusive and collectively exhaustive (i.e., their union is the entire sample space $\Omega$).
Core Formula:

\[ P(A) = \sum_{i=1}^n P(B_i)P(A|B_i) \]

.

Probability Distributions

Discrete Distribution: Probability Mass Function (PMF)

Core Concept: A discrete distribution is defined by directly assigning a specific probability "mass" to each possible value of the random variable.

Definition: Probability Mass Function (PMF), denoted as $P(X=x)$.
Requirements:

Non-negativity: $P(X=x_i) \ge 0$
Normalization: The sum of probabilities over all possible values must equal 1.

Formula:

\[ \sum_{i} P(X=x_i) = 1 \]

Example (Coin Flip):

\[ P(X=x) = 0.5, \quad x \in \{0, 1\} \]

.

Continuous Distribution: Probability Density Function (PDF)

Core Concept: A continuous distribution cannot assign probability to individual points (the probability of any single point is 0); instead, it is defined by describing the "density" of probability over intervals.

Definition: Probability Density Function (PDF), denoted as $f(x)$.
Requirements:

Non-negativity: $f(x) \ge 0$
Normalization: The integral over the entire domain must equal 1.

Formula (Interval Probability):

\[ P(a \le X \le b) = \int_{a}^{b} f(x) dx \]

Extension to Joint Distributions:

\[ P(a \le X \le b, c \le Y \le d) = \int_{a}^{b} \int_{c}^{d} f(x, y) dx dy \]

.

Cumulative Distribution Function (CDF)

Core Concept: Describes the total probability that a random variable falls within the range less than or equal to a specific value $x$.

Definition: Cumulative Distribution Function (CDF), denoted as $F(x)$.
Relationship between PDF and CDF:

\[ F(x) = P(X \le x) = \int_{-\infty}^{x} f(u) du \]

Inverse Derivation:

\[ f(x) = \frac{d}{dx} F(x) \]

Expectation / Variance

Expectation

Core Definition: The first moment reflecting the average value of a random variable; its physical interpretation is the "center of mass" of the probability distribution.

Basic Definition:

\[ E(X) = \sum_{i} p_i x_i \quad \text{or} \quad \int x f(x) dx \]

Linearity of Expectation (General Rule): Regardless of whether the variables are independent, the expectation of a sum equals the sum of expectations.

\[ E\left(\sum_{i=1}^{n} X_i\right) = \sum_{i=1}^{n} E(X_i) \]

Product of Independent Variables: If $X \perp Y$, then:

\[ E(XY) = E(X)E(Y) \]

.

Variance

Core Definition: Measures the degree of deviation of a random variable from its expected value, i.e., the dispersion of the distribution.

Basic Definition:

\[ Var(X) = E\left[(X - E(X))^2\right] \]

Commonly Used Formula in Machine Learning:

\[ Var(X) = E(X^2) - [E(X)]^2 \]

Linear Transformation Property:

\[ Var(aX + b) = a^2 Var(X) \]

Sum of Independent Variables: If $X \perp Y$, then:

\[ Var(X \pm Y) = Var(X) + Var(Y) \]

.

Covariance

Core Definition: Measures the direction and strength of the linear relationship between two random variables.

Core Formula:

\[ Cov(X, Y) = E[(X - E(X))(Y - E(Y))] \]

Special Case: When $X=Y$, $Cov(X, X) = Var(X)$.

Independence

Binary Independence

Core Concept: The most fundamental criterion for determining whether two events have a causal or statistical association.

Core Definition: $$ P(A \cap B) = P(A)P(B) $$
Equivalent Conditional Probability Criterion: $$ P(A|B) = P(A) \quad (\text{provided } P(B) > 0) $$
Property: If $A$ and $B$ are independent, then their complementary event combinations (e.g., $\bar{A}$ and $B$) are also independent.

Pairwise Independence

Core Concept: In a set of events, any two events satisfy independence, but this does not necessarily extend to the group as a whole.

Condition: For an event set $\{A_1, A_2, \dots, A_n\}$, the following holds: $$ P(A_i A_j) = P(A_i)P(A_j) \quad (\forall i \neq j) $$
Limitation: This does not imply the multiplicative probability relationship for three or more events occurring simultaneously.

Mutual Independence

Core Concept: The strictest form of independence, requiring that the occurrence of any subset of events does not affect the remaining events.

Core Definition: For any $k$ events among $n$ events ($2 \le k \le n$), the following holds: $$ P(A_{i_1} A_{i_2} \dots A_{i_k}) = P(A_{i_1})P(A_{i_2}) \dots P(A_{i_k}) $$
Implication: Mutual independence $\implies$ pairwise independence (the converse does not hold).

Independence of Random Variables

Core Concept: Extends the notion of event independence to distribution functions of continuous or discrete random variables.

Definition (Distribution Functions): $$ F(x, y) = F_X(x)F_Y(y) $$
Definition (Density Functions): $$ f(x, y) = f_X(x)f_Y(y) $$

Conditional Independence

Given an observed third variable $C$, events $A$ and $B$ become uncorrelated.

Core Definition: $$ P(AB|C) = P(A|C)P(B|C) $$
Equivalent Expression: $P(A|BC) = P(A|C)$

.

Correlation Coefficient

A dimensionless measure of the degree of linear dependence between two random variables.

Formula: $$ \rho_{XY} = \frac{\text{Cov}(X, Y)}{\sqrt{\text{Var}(X)}\sqrt{\text{Var}(Y)}} $$
Properties: $|\rho_{XY}| \le 1$; if $|\rho_{XY}|=1$, then $X$ and $Y$ are linearly dependent; if $\rho_{XY}=0$, then they are uncorrelated.

Bayes' Theorem

Conditional Probability

The probability that event $A$ occurs given that event $B$ has already occurred.

Condition: $P(B) > 0$
Formula: $$ P(A|B) = \frac{P(AB)}{P(B)} $$
Multiplication Rule: $P(AB) = P(B)P(A|B) = P(A)P(B|A)$

Bayes' Theorem

Describes how to update one's prior belief about a hypothesis after observing new evidence (reasoning from effect to cause).

Core Formula: $$ P(A_i|B) = \frac{P(B|A_i)P(A_i)}{\sum_{j=1}^{n} P(B|A_j)P(A_j)} $$
Terminology: $P(A_i)$ is the prior probability, $P(A_i|B)$ is the posterior probability, and $P(B|A_i)$ is the likelihood.

Law of Large Numbers

Describes the statistical law that the sample mean converges to the expected value as the number of trials tends to infinity.

Chebyshev's Law of Large Numbers: For a sequence of independent variables with finite variance, the arithmetic mean converges in probability to the expected value. $$ \lim_{n \to \infty} P\left( \left| \frac{1}{n}\sum_{i=1}^n X_i - \frac{1}{n}\sum_{i=1}^n E(X_i) \right| < \epsilon \right) = 1 $$
Bernoulli's Law of Large Numbers: The frequency of an event converges in probability to its probability $p$.

Central Limit Theorem

The sum of a large number of independent and identically distributed random variables approaches a normal distribution, regardless of the original distribution.

i.i.d. Case: Let $X_1, \dots, X_n$ be independent and identically distributed with mean $\mu$ and variance $\sigma^2$.
Standardized Convergence Formula: $$ \lim_{n \to \infty} P\left( \frac{\sum_{i=1}^n X_i - n\mu}{\sigma\sqrt{n}} \le x \right) = \Phi(x) $$

Multivariate Random Variables

The study of the joint cumulative distribution and individual marginal distributions of multiple random variables defined on the same sample space.

Distributions

Joint Distribution: $F(x, y) = P(X \le x, Y \le y)$
Marginal Distribution: The single-variable distribution obtained by integrating out the other variables. $$ f_X(x) = \int_{-\infty}^{\infty} f(x, y) dy $$

Covariance Matrix

Arranges the variances and covariances among all dimensions of a multivariate random variable in matrix form, characterizing the dispersion and correlation of the vector.

Matrix Definition: For a random vector $\mathbf{X} = [X_1, X_2, \dots, X_n]^T$: $$ \Sigma_{ij} = \text{Cov}(X_i, X_j) = E[(X_i - E[X_i])(X_j - E[X_j])] $$
Properties: The covariance matrix is symmetric and positive semi-definite.

Maximum Likelihood Estimation (MLE)

Given data $D = \{x_1, \ldots, x_n\}$, the likelihood function of parameter $\theta$:

\[L(\theta) = \prod_{i=1}^{n} p(x_i | \theta)\]

Log-likelihood:

\[\ell(\theta) = \sum_{i=1}^{n} \log p(x_i | \theta)\]

MLE: $\hat{\theta}_{MLE} = \arg\max_\theta \ell(\theta)$

Example: MLE for the Gaussian Distribution

\[\hat{\mu} = \frac{1}{n}\sum x_i, \quad \hat{\sigma}^2 = \frac{1}{n}\sum (x_i - \hat{\mu})^2\]

KL Divergence and Information-Theoretic Metrics

KL Divergence (relative entropy):

\[D_{KL}(P \| Q) = \sum_x P(x) \log \frac{P(x)}{Q(x)} = \mathbb{E}_P\left[\log \frac{P}{Q}\right]\]

Properties: $D_{KL} \geq 0$ (Gibbs' inequality), asymmetric.

Cross-Entropy:

\[H(P, Q) = -\sum_x P(x) \log Q(x) = H(P) + D_{KL}(P \| Q)\]

In deep learning, minimizing cross-entropy is equivalent to minimizing KL divergence (since $H(P)$ is a constant).

Additional Common Distributions

Beta Distribution

\[\text{Beta}(\alpha, \beta): \quad p(x) = \frac{x^{\alpha-1}(1-x)^{\beta-1}}{B(\alpha,\beta)}, \quad x \in [0,1]\]

The conjugate prior for Bernoulli/Binomial distributions. $\mathbb{E}[X] = \frac{\alpha}{\alpha+\beta}$

Gamma Distribution

\[\text{Gamma}(\alpha, \beta): \quad p(x) = \frac{\beta^\alpha x^{\alpha-1} e^{-\beta x}}{\Gamma(\alpha)}, \quad x > 0\]

The conjugate prior for the Poisson distribution. The exponential distribution is a special case: $\text{Gamma}(1, \lambda)$.

Dirichlet Distribution

\[\text{Dir}(\alpha_1, \ldots, \alpha_K): \quad p(x) = \frac{\Gamma(\sum \alpha_k)}{\prod \Gamma(\alpha_k)} \prod x_k^{\alpha_k - 1}\]

The conjugate prior for the Multinomial distribution, used in topic models (LDA).

Fundamentals of Bayesian Inference

Bayes' formula from the parameter estimation perspective:

\[p(\theta | D) = \frac{p(D | \theta) \cdot p(\theta)}{p(D)}\]

Prior $p(\theta)$: Prior belief about the parameters
Likelihood $p(D|\theta)$: The probability of the data given the parameters
Posterior $p(\theta|D)$: Updated belief after observing data
MAP Estimate: $\hat{\theta}_{MAP} = \arg\max_\theta p(\theta|D)$

MLE vs MAP: MAP = MLE + regularization (the prior acts as a regularization term).

See Information Theory for more on entropy and mutual information. See Bayesian Learning for MCMC and variational inference.

Probability Theory

Fundamental Concepts

Probability Space

Random Variable

Types of Probability Distributions

Bernoulli Distribution

Gaussian Distribution

Multinomial Distribution

Core Rules Connecting Probabilities

Sum Rule

Product Rule / Chain Rule

Bayes' Theorem

Law of Total Probability

Probability Distributions

Discrete Distribution: Probability Mass Function (PMF)

Continuous Distribution: Probability Density Function (PDF)

Cumulative Distribution Function (CDF)

Expectation / Variance

Expectation

Variance

Covariance

Independence

Binary Independence

Pairwise Independence

Mutual Independence

Independence of Random Variables

Conditional Independence

Correlation Coefficient

Bayes' Theorem

Conditional Probability

Bayes' Theorem

Law of Large Numbers

Central Limit Theorem

Multivariate Random Variables

Distributions

Covariance Matrix

Maximum Likelihood Estimation (MLE)

KL Divergence and Information-Theoretic Metrics

Additional Common Distributions

Beta Distribution

Gamma Distribution

Dirichlet Distribution

Fundamentals of Bayesian Inference

评论 #