Numerical Methods
Numerical methods study how to approximately solve mathematical problems using finite-precision computation. In practice, many mathematical problems (such as evaluating high-dimensional integrals, solving partial differential equations, or handling large-scale linear systems) lack closed-form analytical solutions and must rely on numerical approximation. The core concerns of numerical methods are: accuracy (the gap between the approximate and exact solutions), efficiency (computational complexity), and stability (whether errors amplify during computation).
In the AI domain, numerical methods form the mathematical foundation for deep learning optimizers, Neural ODEs, Physics-Informed Neural Networks (PINNs), and related techniques. These notes cover floating-point representation, interpolation, numerical integration, linear system solvers, nonlinear root finding, ODE numerical solutions, and numerical stability analysis.
1. Floating-Point Representation and Error
1.1 IEEE 754 Floating Point
Computers represent real numbers using a finite number of bits. The IEEE 754 standard defines floating-point formats:
| Format | Sign bits | Exponent bits | Mantissa bits | Bias | Decimal precision |
|---|---|---|---|---|---|
| float32 | 1 | 8 | 23 | 127 | ~7 digits |
| float64 | 1 | 11 | 52 | 1023 | ~16 digits |
| float16 | 1 | 5 | 10 | 15 | ~3 digits |
| bfloat16 | 1 | 8 | 7 | 127 | ~2 digits |
Numerical Precision in Deep Learning
- Training typically uses mixed precision (float16 computation + float32 accumulation), yielding 2-3x speedup.
- bfloat16 has the same exponent range as float32, reducing overflow risk; it is widely adopted by TPUs and next-generation GPUs.
- Quantized inference (INT8/INT4) further reduces precision in exchange for speed and memory savings.
1.2 Types of Error
| Error Type | Source | Example |
|---|---|---|
| Truncation error | Approximating an infinite series with finite terms | Taking the first \(n\) terms of a Taylor expansion |
| Rounding error | Limited floating-point precision | \(0.1 + 0.2 \neq 0.3\) (in float64) |
| Propagation error | Error accumulation during computation | Summing a long sequence of floating-point numbers |
Absolute error: \(|x - \hat{x}|\), Relative error: \(\frac{|x - \hat{x}|}{|x|}\)
2. Interpolation
The goal of interpolation is to construct a function \(p(x)\) such that \(p(x_i) = y_i\) for all known data points \((x_i, y_i)\).
2.1 Lagrange Interpolation
Given \(n+1\) data points, the Lagrange interpolation polynomial is:
- Uniqueness: \(n+1\) points uniquely determine a polynomial of degree \(n\)
- Drawback: Adding new data points requires recomputing all basis functions
2.2 Newton Interpolation
Constructed recursively using divided differences:
where \(f[x_i, x_{i+1}] = \frac{f[x_{i+1}] - f[x_i]}{x_{i+1} - x_i}\)
Advantage: When a new data point is added, only the new divided difference terms need to be computed.
2.3 Spline Interpolation
High-degree polynomial interpolation may exhibit the Runge phenomenon (severe oscillation near endpoints). Cubic splines use a cubic polynomial on each subinterval, subject to:
- Interpolation conditions: \(S(x_i) = y_i\)
- Continuity: \(S, S', S''\) are continuous at knot points
- Boundary conditions: natural spline (\(S''(x_0) = S''(x_n) = 0\)) or clamped spline
3. Numerical Integration
3.1 The Trapezoidal Rule
Divide the interval \([a, b]\) into \(n\) equal subintervals and approximate using trapezoids:
Error is \(O(h^2)\).
3.2 Simpson's Rule
Uses quadratic polynomial approximation over pairs of subintervals:
Error is \(O(h^4)\), substantially more accurate than the trapezoidal rule.
3.3 Gaussian Quadrature
Selects optimal nodes (Gauss points) so that \(n\) points can exactly integrate polynomials of degree up to \(2n-1\):
| Points | Max exact polynomial degree | Typical nodes |
|---|---|---|
| 1 | 1 | \(x_1 = 0\) |
| 2 | 3 | \(x_{1,2} = \pm 1/\sqrt{3}\) |
| 3 | 5 | \(x_1 = 0, x_{2,3} = \pm\sqrt{3/5}\) |
4. Linear Systems
4.1 Gaussian Elimination
Transform the augmented matrix \([A | b]\) into upper triangular form via row operations, then solve by back-substitution:
- Time complexity: \(O(n^3)\)
- Partial pivoting: At each step, select the element with the largest absolute value in the column as the pivot to reduce rounding error
4.2 LU Decomposition
Factorize the matrix as \(A = LU\) (lower triangular \(\times\) upper triangular), effectively "storing" the Gaussian elimination process:
Advantage: For the same \(A\) with different right-hand sides \(b\), only one factorization is needed.
4.3 Iterative Methods
For large sparse systems, direct methods (\(O(n^3)\)) are too expensive; iterative methods are more practical.
Jacobi iteration:
Gauss-Seidel iteration: Uses the most recently computed component values, typically converging faster.
| Method | Parallelism | Convergence speed | Storage |
|---|---|---|---|
| Jacobi | High (parallelizable) | Slower | Two vector copies needed |
| Gauss-Seidel | Low (sequential dependence) | Faster | In-place updates |
| SOR | Low | Fastest (with optimal \(\omega\)) | In-place updates |
Convergence condition: The spectral radius of the iteration matrix must satisfy \(\rho(M) < 1\). Diagonally dominant matrices guarantee convergence for both Jacobi and Gauss-Seidel methods.
5. Nonlinear Root Finding
Solve \(f(x) = 0\).
5.1 The Bisection Method
- Prerequisite: \(f\) is continuous on \([a, b]\) with \(f(a) \cdot f(b) < 0\)
- Each step halves the interval; convergence rate: linear, gaining 1 binary digit per iteration
- Pros: simple and reliable; Cons: slow convergence
5.2 Newton's Method
- Convergence rate: quadratic (near a simple root), meaning the error squares at each step
- Requires computing the derivative \(f'(x)\)
- May diverge (if the initial guess is poor or \(f'(x_n) \approx 0\))
Newton's Method Variants in Deep Learning
- First-order optimizers (SGD, Adam) can be viewed as "incomplete" versions of Newton's method: they use only the gradient, not the Hessian.
- Second-order optimizers (L-BFGS, K-FAC) approximately use Hessian information for faster convergence, but at higher memory cost.
5.3 The Secant Method
Replaces the derivative with a finite difference:
Convergence order is approximately \(\phi \approx 1.618\) (the golden ratio), between linear and quadratic.
6. Numerical Solutions of ODEs
Solve the initial value problem \(\frac{dy}{dt} = f(t, y)\), \(y(t_0) = y_0\).
6.1 Euler's Method
- First-order method; local truncation error \(O(h^2)\), global error \(O(h)\)
- Simple but low accuracy; unsuitable for stiff problems
6.2 The Runge-Kutta Method (RK4)
The classical fourth-order Runge-Kutta method:
Local truncation error \(O(h^5)\), global error \(O(h^4)\).
6.3 Stiff Problems and Implicit Methods
Stiff problems: ODEs containing both rapidly and slowly varying components; explicit methods require extremely small step sizes for stability.
- Implicit Euler method: \(y_{n+1} = y_n + h \cdot f(t_{n+1}, y_{n+1})\), A-stable, suitable for stiff problems
- BDF (Backward Differentiation Formula) methods: High-order implicit multistep methods widely used in stiff ODE solvers
| Method | Order | Stability | Suitable for |
|---|---|---|---|
| Explicit Euler | 1 | Conditionally stable | Pedagogical demonstrations |
| RK4 | 4 | Conditionally stable | Non-stiff problems |
| Implicit Euler | 1 | A-stable | Stiff problems |
| BDF | 1-6 | A(\(\alpha\))-stable | Stiff problems |
7. Numerical Stability
7.1 Condition Number
The condition number of a matrix \(A\) measures the sensitivity of the output to perturbations in the input:
- \(\kappa(A) \approx 1\): well-conditioned problem
- \(\kappa(A) \gg 1\): ill-conditioned problem; small input changes can cause large output changes
7.2 Backward Error Analysis
Core idea: The numerically computed result \(\hat{x}\) is not the exact solution to the original problem, but it is the exact solution to some "perturbed problem."
If \(\|\Delta A\|\) and \(\|\Delta b\|\) are sufficiently small (on the order of machine precision), the algorithm is said to be backward stable.
8. Applications in AI
8.1 Neural ODEs
Neural ODEs generalize the discrete layers of residual networks into a continuous differential equation:
- Forward propagation is performed using an ODE solver (e.g., adaptive RK45)
- Gradients are computed via the adjoint method, achieving \(O(1)\) memory complexity
- Applications: continuous normalizing flows, time-series modeling
8.2 Physics-Informed Neural Networks (PINNs)
PINNs embed physical equations as regularization terms in neural network training:
where \(\mathcal{L}_{\text{PDE}}\) computes PDE residuals via automatic differentiation.
Applications: fluid dynamics simulation, materials science, weather prediction.
8.3 Numerical Linear Algebra and Deep Learning
- SVD (Singular Value Decomposition): Used for low-rank approximation (LoRA), dimensionality reduction (PCA)
- Eigendecomposition: Hessian spectral analysis for understanding loss landscapes
- Krylov subspace methods: Efficient solvers for large-scale sparse matrices
References
- Burden & Faires, Numerical Analysis
- Trefethen & Bau, Numerical Linear Algebra
- Higham, Accuracy and Stability of Numerical Algorithms
- Chen et al., "Neural Ordinary Differential Equations" (NeurIPS 2018)
- Raissi et al., "Physics-informed neural networks" (JCP 2019)