Skip to content

Numerical Methods

Numerical methods study how to approximately solve mathematical problems using finite-precision computation. In practice, many mathematical problems (such as evaluating high-dimensional integrals, solving partial differential equations, or handling large-scale linear systems) lack closed-form analytical solutions and must rely on numerical approximation. The core concerns of numerical methods are: accuracy (the gap between the approximate and exact solutions), efficiency (computational complexity), and stability (whether errors amplify during computation).

In the AI domain, numerical methods form the mathematical foundation for deep learning optimizers, Neural ODEs, Physics-Informed Neural Networks (PINNs), and related techniques. These notes cover floating-point representation, interpolation, numerical integration, linear system solvers, nonlinear root finding, ODE numerical solutions, and numerical stability analysis.


1. Floating-Point Representation and Error

1.1 IEEE 754 Floating Point

Computers represent real numbers using a finite number of bits. The IEEE 754 standard defines floating-point formats:

\[ x = (-1)^s \times 1.f \times 2^{e - \text{bias}} \]
Format Sign bits Exponent bits Mantissa bits Bias Decimal precision
float32 1 8 23 127 ~7 digits
float64 1 11 52 1023 ~16 digits
float16 1 5 10 15 ~3 digits
bfloat16 1 8 7 127 ~2 digits

Numerical Precision in Deep Learning

  • Training typically uses mixed precision (float16 computation + float32 accumulation), yielding 2-3x speedup.
  • bfloat16 has the same exponent range as float32, reducing overflow risk; it is widely adopted by TPUs and next-generation GPUs.
  • Quantized inference (INT8/INT4) further reduces precision in exchange for speed and memory savings.

1.2 Types of Error

Error Type Source Example
Truncation error Approximating an infinite series with finite terms Taking the first \(n\) terms of a Taylor expansion
Rounding error Limited floating-point precision \(0.1 + 0.2 \neq 0.3\) (in float64)
Propagation error Error accumulation during computation Summing a long sequence of floating-point numbers

Absolute error: \(|x - \hat{x}|\), Relative error: \(\frac{|x - \hat{x}|}{|x|}\)


2. Interpolation

The goal of interpolation is to construct a function \(p(x)\) such that \(p(x_i) = y_i\) for all known data points \((x_i, y_i)\).

2.1 Lagrange Interpolation

Given \(n+1\) data points, the Lagrange interpolation polynomial is:

\[ L(x) = \sum_{i=0}^{n} y_i \prod_{j \neq i} \frac{x - x_j}{x_i - x_j} \]
  • Uniqueness: \(n+1\) points uniquely determine a polynomial of degree \(n\)
  • Drawback: Adding new data points requires recomputing all basis functions

2.2 Newton Interpolation

Constructed recursively using divided differences:

\[ N(x) = f[x_0] + f[x_0, x_1](x - x_0) + f[x_0, x_1, x_2](x - x_0)(x - x_1) + \cdots \]

where \(f[x_i, x_{i+1}] = \frac{f[x_{i+1}] - f[x_i]}{x_{i+1} - x_i}\)

Advantage: When a new data point is added, only the new divided difference terms need to be computed.

2.3 Spline Interpolation

High-degree polynomial interpolation may exhibit the Runge phenomenon (severe oscillation near endpoints). Cubic splines use a cubic polynomial on each subinterval, subject to:

  • Interpolation conditions: \(S(x_i) = y_i\)
  • Continuity: \(S, S', S''\) are continuous at knot points
  • Boundary conditions: natural spline (\(S''(x_0) = S''(x_n) = 0\)) or clamped spline

3. Numerical Integration

3.1 The Trapezoidal Rule

Divide the interval \([a, b]\) into \(n\) equal subintervals and approximate using trapezoids:

\[ \int_a^b f(x)\,dx \approx \frac{h}{2}\left[f(a) + 2\sum_{i=1}^{n-1}f(x_i) + f(b)\right], \quad h = \frac{b-a}{n} \]

Error is \(O(h^2)\).

3.2 Simpson's Rule

Uses quadratic polynomial approximation over pairs of subintervals:

\[ \int_a^b f(x)\,dx \approx \frac{h}{3}\left[f(a) + 4\sum_{\text{odd } i} f(x_i) + 2\sum_{\text{even } i} f(x_i) + f(b)\right] \]

Error is \(O(h^4)\), substantially more accurate than the trapezoidal rule.

3.3 Gaussian Quadrature

Selects optimal nodes (Gauss points) so that \(n\) points can exactly integrate polynomials of degree up to \(2n-1\):

\[ \int_{-1}^{1} f(x)\,dx \approx \sum_{i=1}^{n} w_i f(x_i) \]
Points Max exact polynomial degree Typical nodes
1 1 \(x_1 = 0\)
2 3 \(x_{1,2} = \pm 1/\sqrt{3}\)
3 5 \(x_1 = 0, x_{2,3} = \pm\sqrt{3/5}\)

4. Linear Systems

4.1 Gaussian Elimination

Transform the augmented matrix \([A | b]\) into upper triangular form via row operations, then solve by back-substitution:

  • Time complexity: \(O(n^3)\)
  • Partial pivoting: At each step, select the element with the largest absolute value in the column as the pivot to reduce rounding error

4.2 LU Decomposition

Factorize the matrix as \(A = LU\) (lower triangular \(\times\) upper triangular), effectively "storing" the Gaussian elimination process:

\[ Ax = b \;\Rightarrow\; Ly = b \;\text{(forward substitution)},\; Ux = y \;\text{(back-substitution)} \]

Advantage: For the same \(A\) with different right-hand sides \(b\), only one factorization is needed.

4.3 Iterative Methods

For large sparse systems, direct methods (\(O(n^3)\)) are too expensive; iterative methods are more practical.

Jacobi iteration:

\[ x_i^{(k+1)} = \frac{1}{a_{ii}} \left( b_i - \sum_{j \neq i} a_{ij} x_j^{(k)} \right) \]

Gauss-Seidel iteration: Uses the most recently computed component values, typically converging faster.

Method Parallelism Convergence speed Storage
Jacobi High (parallelizable) Slower Two vector copies needed
Gauss-Seidel Low (sequential dependence) Faster In-place updates
SOR Low Fastest (with optimal \(\omega\)) In-place updates

Convergence condition: The spectral radius of the iteration matrix must satisfy \(\rho(M) < 1\). Diagonally dominant matrices guarantee convergence for both Jacobi and Gauss-Seidel methods.


5. Nonlinear Root Finding

Solve \(f(x) = 0\).

5.1 The Bisection Method

  • Prerequisite: \(f\) is continuous on \([a, b]\) with \(f(a) \cdot f(b) < 0\)
  • Each step halves the interval; convergence rate: linear, gaining 1 binary digit per iteration
  • Pros: simple and reliable; Cons: slow convergence

5.2 Newton's Method

\[ x_{n+1} = x_n - \frac{f(x_n)}{f'(x_n)} \]
  • Convergence rate: quadratic (near a simple root), meaning the error squares at each step
  • Requires computing the derivative \(f'(x)\)
  • May diverge (if the initial guess is poor or \(f'(x_n) \approx 0\))

Newton's Method Variants in Deep Learning

  • First-order optimizers (SGD, Adam) can be viewed as "incomplete" versions of Newton's method: they use only the gradient, not the Hessian.
  • Second-order optimizers (L-BFGS, K-FAC) approximately use Hessian information for faster convergence, but at higher memory cost.

5.3 The Secant Method

Replaces the derivative with a finite difference:

\[ x_{n+1} = x_n - f(x_n) \cdot \frac{x_n - x_{n-1}}{f(x_n) - f(x_{n-1})} \]

Convergence order is approximately \(\phi \approx 1.618\) (the golden ratio), between linear and quadratic.


6. Numerical Solutions of ODEs

Solve the initial value problem \(\frac{dy}{dt} = f(t, y)\), \(y(t_0) = y_0\).

6.1 Euler's Method

\[ y_{n+1} = y_n + h \cdot f(t_n, y_n) \]
  • First-order method; local truncation error \(O(h^2)\), global error \(O(h)\)
  • Simple but low accuracy; unsuitable for stiff problems

6.2 The Runge-Kutta Method (RK4)

The classical fourth-order Runge-Kutta method:

\[ \begin{aligned} k_1 &= f(t_n, y_n) \\ k_2 &= f(t_n + h/2, y_n + h k_1/2) \\ k_3 &= f(t_n + h/2, y_n + h k_2/2) \\ k_4 &= f(t_n + h, y_n + h k_3) \\ y_{n+1} &= y_n + \frac{h}{6}(k_1 + 2k_2 + 2k_3 + k_4) \end{aligned} \]

Local truncation error \(O(h^5)\), global error \(O(h^4)\).

6.3 Stiff Problems and Implicit Methods

Stiff problems: ODEs containing both rapidly and slowly varying components; explicit methods require extremely small step sizes for stability.

  • Implicit Euler method: \(y_{n+1} = y_n + h \cdot f(t_{n+1}, y_{n+1})\), A-stable, suitable for stiff problems
  • BDF (Backward Differentiation Formula) methods: High-order implicit multistep methods widely used in stiff ODE solvers
Method Order Stability Suitable for
Explicit Euler 1 Conditionally stable Pedagogical demonstrations
RK4 4 Conditionally stable Non-stiff problems
Implicit Euler 1 A-stable Stiff problems
BDF 1-6 A(\(\alpha\))-stable Stiff problems

7. Numerical Stability

7.1 Condition Number

The condition number of a matrix \(A\) measures the sensitivity of the output to perturbations in the input:

\[ \kappa(A) = \|A\| \cdot \|A^{-1}\| \]
  • \(\kappa(A) \approx 1\): well-conditioned problem
  • \(\kappa(A) \gg 1\): ill-conditioned problem; small input changes can cause large output changes

7.2 Backward Error Analysis

Core idea: The numerically computed result \(\hat{x}\) is not the exact solution to the original problem, but it is the exact solution to some "perturbed problem."

\[ (A + \Delta A)\hat{x} = b + \Delta b \]

If \(\|\Delta A\|\) and \(\|\Delta b\|\) are sufficiently small (on the order of machine precision), the algorithm is said to be backward stable.


8. Applications in AI

8.1 Neural ODEs

Neural ODEs generalize the discrete layers of residual networks into a continuous differential equation:

\[ \frac{dh}{dt} = f_\theta(h(t), t) \]
  • Forward propagation is performed using an ODE solver (e.g., adaptive RK45)
  • Gradients are computed via the adjoint method, achieving \(O(1)\) memory complexity
  • Applications: continuous normalizing flows, time-series modeling

8.2 Physics-Informed Neural Networks (PINNs)

PINNs embed physical equations as regularization terms in neural network training:

\[ \mathcal{L} = \mathcal{L}_{\text{data}} + \lambda \cdot \mathcal{L}_{\text{PDE}} \]

where \(\mathcal{L}_{\text{PDE}}\) computes PDE residuals via automatic differentiation.

Applications: fluid dynamics simulation, materials science, weather prediction.

8.3 Numerical Linear Algebra and Deep Learning

  • SVD (Singular Value Decomposition): Used for low-rank approximation (LoRA), dimensionality reduction (PCA)
  • Eigendecomposition: Hessian spectral analysis for understanding loss landscapes
  • Krylov subspace methods: Efficient solvers for large-scale sparse matrices

References

  • Burden & Faires, Numerical Analysis
  • Trefethen & Bau, Numerical Linear Algebra
  • Higham, Accuracy and Stability of Numerical Algorithms
  • Chen et al., "Neural Ordinary Differential Equations" (NeurIPS 2018)
  • Raissi et al., "Physics-informed neural networks" (JCP 2019)

评论 #