Statistics

Maximum Likelihood Estimation

Maximum Likelihood Estimation (MLE) is a method for estimating the parameters of a probabilistic model. In simple terms, MLE takes the data we already have (the observed outcomes) and works backward to find the model parameters most likely to have produced that data (the cause). It is therefore a method of reasoning from effect to cause.

Whether it is something as simple as computing a mean or as complex as training ChatGPT, the core logic follows these steps:

Design the model — assume the data follows a certain distribution, e.g., \(P(x|\theta)\), where \(\theta\) is the unknown parameter.
Write the likelihood: multiply the probabilities of all data points to get the joint probability of observing the entire dataset.
Take the logarithm (Log-Likelihood): for computational convenience, convert the product into a sum.
Maximize: for simple models, solve analytically; for complex models, use gradient descent to approximate the optimum.

Normal Distribution Example

Here is an example involving the normal distribution. Imagine you are the manager of an orange juice factory. Your bottles are labeled with a net content of 500 ml. Regulations state that no more than 1% of bottles may contain less than 500 ml, or you face a hefty fine from the market regulator (i.e., the defect rate must not exceed 1%).

Your filling machine is imperfect — sometimes it fills 502 ml, sometimes 498 ml. The machine has two dials:

Target setpoint (\(\mu\)): the volume the machine aims to fill.
Precision / stability (\(\sigma\)): the more expensive and better-maintained the machine, the smaller the variation (\(\sigma\)).

After a recent repair, you randomly sample 10 bottles and measure their actual volumes. The results are (in ml):

\[ 502, 501, 499, 503, 498, 502, 500, 504, 497, 504 \]

You need to answer two questions:

What is the current state of the machine? (Find \(\mu\) and \(\sigma\)) — because after the repair, we do not know the machine's state.
Will we be fined? (Compute the probability \(P(X < 500)\))

Step 1: Model Assumption and Parameter Estimation

At a glance, the values seem to hover around 500, with the mean slightly above it. But is it safe? We cannot tell by eye alone. We assume that filling errors result from the superposition of countless tiny factors (voltage fluctuations, air pressure, mechanical wear). We therefore assume the data follows a normal distribution. The model is written as: \(X \sim N(\mu, \sigma^2)\), with two unknown parameters:

\(\mu\) (mean): represents the machine's true aim point.
\(\sigma\) (standard deviation): represents the true magnitude of the machine's fluctuations.

Step 2: Write the Likelihood Function

Since these 10 data points (502, 501, ...) have actually been observed, the machine's parameters (\(\mu\) and \(\sigma\)) must be the ones that maximize the probability of this dataset occurring. In other words, having seen the outcomes, we want to find the parameter values (here, \(\mu\) and \(\sigma\)) most likely to have caused them.

Under this "effect-to-cause" logic, the essence of the likelihood function is: express the joint probability of all 10 observations as a formula. Since we assume each bottle is filled independently, the joint probability of these 10 specific values is the product of each individual probability.

For any single bottle \(x_i\), its probability (probability density) is given by the normal distribution formula:

\[ f(x_i | \mu, \sigma) = \frac{1}{\sqrt{2\pi}\sigma} e^{-\frac{(x_i - \mu)^2}{2\sigma^2}} \]

We multiply these 10 probabilities together. The likelihood function is typically denoted \(L(\mu, \sigma)\):

\[ L(\mu, \sigma) = f(x_1 | \mu, \sigma) \times f(x_2 | \mu, \sigma) \times \dots \times f(x_{10} | \mu, \sigma) \]

Expressed using the product symbol \(\prod\):

\[ L(\mu, \sigma) = \prod_{i=1}^{10} \frac{1}{\sqrt{2\pi}\sigma} e^{-\frac{(x_i - \mu)^2}{2\sigma^2}} \]

To make its structure clearer, we use the property of exponents (same base, add exponents) to expand it:

\[ L(\mu, \sigma) = \underbrace{\left( \frac{1}{\sqrt{2\pi}\sigma} \right) \times \dots \times \left( \frac{1}{\sqrt{2\pi}\sigma} \right)}_{10次} \times e^{\left[ -\frac{(x_1-\mu)^2}{2\sigma^2} - \frac{(x_2-\mu)^2}{2\sigma^2} - \dots - \frac{(x_{10}-\mu)^2}{2\sigma^2} \right]} \]

After simplification, this is the final likelihood function:

\[ L(\mu, \sigma) = \left( \frac{1}{2\pi\sigma^2} \right)^{5} \exp\left( -\sum_{i=1}^{10} \frac{(x_i - \mu)^2}{2\sigma^2} \right) \]

(Note: the front factor is raised to the 10th power, but due to the square root and squaring it simplifies to the 5th power; \(\exp\) denotes \(e\) raised to a power.)

This formula is now a function of \(\mu\) and \(\sigma\):

\(x_i\) are known: they are the 10 measured values (502, 501, ...).
\(\mu\) and \(\sigma\) are variables: think of them as two dials on a radio.

The goal of maximum likelihood estimation is: keep turning the \(\mu\) and \(\sigma\) dials until the value of \(L\) reaches its maximum. When \(L\) is maximized, the corresponding \(\mu\) and \(\sigma\) are the parameters "most likely to have produced these results."

Step 3: Take the Logarithm

Mathematically, we need to multiply the probabilities of these 10 data points together. The probability density function of the normal distribution \(f(x)\) involves \(e\) raised to a power.

\[ L(\mu, \sigma) = \prod_{i=1}^{10} \frac{1}{\sqrt{2\pi}\sigma} e^{-\frac{(x_i - \mu)^2}{2\sigma^2}} \]

If you try to differentiate this product directly, you will quickly find it unmanageable. So we take the logarithm (\(\ln\)), using the property \(\ln(A \cdot B) = \ln A + \ln B\) to convert the product into a sum:

\[ \ell(\mu, \sigma) = \ln(L) = \sum_{i=1}^{10} \left[ \ln\left(\frac{1}{\sqrt{2\pi}\sigma}\right) - \frac{(x_i - \mu)^2}{2\sigma^2} \right] \]

Tidying up:

\[ \ell(\mu, \sigma) = - \frac{n}{2} \ln(2\pi\sigma^2) - \frac{1}{2\sigma^2} \sum_{i=1}^{10} (x_i - \mu)^2 \]

Step 4: Maximize

To find the \(\mu\) that maximizes the probability, we differentiate with respect to \(\mu\) and set the derivative to zero.

Notice that only the last term depends on \(\mu\). Taking the partial derivative with respect to \(\mu\):

\[ \frac{\partial \ell}{\partial \mu} = \frac{1}{\sigma^2} \sum_{i=1}^{10} (x_i - \mu) = 0 \]

Solving the equation:

\[ \sum x_i - 10\mu = 0 \implies \mu = \frac{\sum x_i}{10} \]

This result tells us that the optimal \(\mu\) is simply the sample mean.

Similarly, taking the partial derivative with respect to \(\sigma^2\) yields:

\[ \sigma^2 = \frac{1}{10} \sum (x_i - \mu)^2 \]

This is the sample variance (and its square root is the sample standard deviation).

Substituting our ten samples, we get \(\mu\) (mean):

\[ (502+501+499+503+498+502+500+504+497+504) / 10 = \mathbf{501.0} \]

\(\sigma\) (standard deviation) works out to approximately 2.32.

We can now conclude: the machine is aimed at 501 ml, but due to post-repair stability issues, the variation (standard deviation) is 2.32 ml.

As for the second question — will we be fined?

Compute the Z-score (how many standard deviations 500 is from the mean):

\[ Z = \frac{500 - 501}{2.32} \approx -0.43 \]

Looking up the standard normal distribution table: \(P(Z < -0.43) \approx \mathbf{33.36\%}\)

This means the machine currently produces 33.36% defective bottles — a serious compliance failure.

Since the current \(\sigma = 2.32\) is far too large, you have two options:

Repair the machine: reduce \(\sigma\) to below 0.43 (keeping \(\mu = 501\) unchanged).
Increase the fill volume: keep \(\sigma = 2.32\) unchanged, but raise the aim point \(\mu\) to above 505.4 ml.

The reasoning: in a normal distribution, if we want the left-tail area (defect rate) to be only 1%, the standard normal table gives a critical constant: \(Z = -2.326\). This means: 500 ml must lie at least 2.326 standard deviations (\(\sigma\)) to the left of the mean \(\mu\). Expressed as a formula:

\[ -2.326 = \frac{500 - \mu}{\sigma} \]

If we decide not to change the aim point (keep \(\mu = 501\)) and instead improve stability through repair (reduce \(\sigma\)):

Substituting the known values:

\[ -2.326 = \frac{500 - 501}{\sigma} \]

\[ -2.326 = \frac{-1}{\sigma} \]

\[ \sigma = \frac{1}{2.326} \approx \mathbf{0.43} \]

Conclusion: If you do not want to give away extra juice (keeping the aim at 501 ml), you must reduce the machine's variation (\(\sigma\)) from 2.32 down to 0.43. This requires a dramatic improvement in machine precision.

The brute-force alternative is to accept that the machine cannot be fixed (\(\sigma = 2.32\) stays) and, to remain compliant, raise the aim point (increase \(\mu\)) — essentially trading extra juice for regulatory compliance.

Substituting the known values:

\[ -2.326 = \frac{500 - \mu}{2.32} \]

Solving:

\[ -2.326 \times 2.32 = 500 - \mu \]

\[ -5.396 = 500 - \mu \]

\[ \mu = 500 + 5.396 \approx \mathbf{505.4} \]

Conclusion: If the machine remains this unstable, to ensure that 99% of bottles contain more than 500 ml, the aim point must be set to 505.4 ml. This means you would be giving away an average of 5.4 ml extra per bottle.

Statistics

Maximum Likelihood Estimation

Normal Distribution Example

评论 #