Numerical Methods

Numerical methods study how to approximately solve mathematical problems using finite-precision computation. In practice, many mathematical problems (such as evaluating high-dimensional integrals, solving partial differential equations, or handling large-scale linear systems) lack closed-form analytical solutions and must rely on numerical approximation. The core concerns of numerical methods are: accuracy (the gap between the approximate and exact solutions), efficiency (computational complexity), and stability (whether errors amplify during computation).

In the AI domain, numerical methods form the mathematical foundation for deep learning optimizers, Neural ODEs, Physics-Informed Neural Networks (PINNs), and related techniques. These notes cover floating-point representation, interpolation, numerical integration, linear system solvers, nonlinear root finding, ODE numerical solutions, and numerical stability analysis.

1. Floating-Point Representation and Error

1.1 IEEE 754 Floating Point

Computers represent real numbers using a finite number of bits. The IEEE 754 standard defines floating-point formats:

\[ x = (-1)^s \times 1.f \times 2^{e - \text{bias}} \]

Format	Sign bits	Exponent bits	Mantissa bits	Bias	Decimal precision
float32	1	8	23	127	~7 digits
float64	1	11	52	1023	~16 digits
float16	1	5	10	15	~3 digits
bfloat16	1	8	7	127	~2 digits

Numerical Precision in Deep Learning

Training typically uses mixed precision (float16 computation + float32 accumulation), yielding 2-3x speedup.
bfloat16 has the same exponent range as float32, reducing overflow risk; it is widely adopted by TPUs and next-generation GPUs.
Quantized inference (INT8/INT4) further reduces precision in exchange for speed and memory savings.

1.2 Types of Error

Error Type	Source	Example
Truncation error	Approximating an infinite series with finite terms	Taking the first \(n\) terms of a Taylor expansion
Rounding error	Limited floating-point precision	\(0.1 + 0.2 \neq 0.3\) (in float64)
Propagation error	Error accumulation during computation	Summing a long sequence of floating-point numbers

Absolute error: \(|x - \hat{x}|\), Relative error: \(\frac{|x - \hat{x}|}{|x|}\)

2. Interpolation

The goal of interpolation is to construct a function \(p(x)\) such that \(p(x_i) = y_i\) for all known data points \((x_i, y_i)\).

2.1 Lagrange Interpolation

Given \(n+1\) data points, the Lagrange interpolation polynomial is:

\[ L(x) = \sum_{i=0}^{n} y_i \prod_{j \neq i} \frac{x - x_j}{x_i - x_j} \]

Uniqueness: \(n+1\) points uniquely determine a polynomial of degree \(n\)
Drawback: Adding new data points requires recomputing all basis functions

2.2 Newton Interpolation

Constructed recursively using divided differences:

\[ N(x) = f[x_0] + f[x_0, x_1](x - x_0) + f[x_0, x_1, x_2](x - x_0)(x - x_1) + \cdots \]

where \(f[x_i, x_{i+1}] = \frac{f[x_{i+1}] - f[x_i]}{x_{i+1} - x_i}\)

Advantage: When a new data point is added, only the new divided difference terms need to be computed.

2.3 Spline Interpolation

High-degree polynomial interpolation may exhibit the Runge phenomenon (severe oscillation near endpoints). Cubic splines use a cubic polynomial on each subinterval, subject to:

Interpolation conditions: \(S(x_i) = y_i\)
Continuity: \(S, S', S''\) are continuous at knot points
Boundary conditions: natural spline (\(S''(x_0) = S''(x_n) = 0\)) or clamped spline

3. Numerical Integration

3.1 The Trapezoidal Rule

Divide the interval \([a, b]\) into \(n\) equal subintervals and approximate using trapezoids:

\[ \int_a^b f(x)\,dx \approx \frac{h}{2}\left[f(a) + 2\sum_{i=1}^{n-1}f(x_i) + f(b)\right], \quad h = \frac{b-a}{n} \]

Error is \(O(h^2)\).

3.2 Simpson's Rule

Uses quadratic polynomial approximation over pairs of subintervals:

\[ \int_a^b f(x)\,dx \approx \frac{h}{3}\left[f(a) + 4\sum_{\text{odd } i} f(x_i) + 2\sum_{\text{even } i} f(x_i) + f(b)\right] \]

Error is \(O(h^4)\), substantially more accurate than the trapezoidal rule.

3.3 Gaussian Quadrature

Selects optimal nodes (Gauss points) so that \(n\) points can exactly integrate polynomials of degree up to \(2n-1\):

\[ \int_{-1}^{1} f(x)\,dx \approx \sum_{i=1}^{n} w_i f(x_i) \]

Points	Max exact polynomial degree	Typical nodes
1	1	\(x_1 = 0\)
2	3	\(x_{1,2} = \pm 1/\sqrt{3}\)
3	5	\(x_1 = 0, x_{2,3} = \pm\sqrt{3/5}\)

4. Linear Systems

4.1 Gaussian Elimination

Transform the augmented matrix \([A | b]\) into upper triangular form via row operations, then solve by back-substitution:

Time complexity: \(O(n^3)\)
Partial pivoting: At each step, select the element with the largest absolute value in the column as the pivot to reduce rounding error

4.2 LU Decomposition

Factorize the matrix as \(A = LU\) (lower triangular \(\times\) upper triangular), effectively "storing" the Gaussian elimination process:

\[ Ax = b \;\Rightarrow\; Ly = b \;\text{(forward substitution)},\; Ux = y \;\text{(back-substitution)} \]

Advantage: For the same \(A\) with different right-hand sides \(b\), only one factorization is needed.

4.3 Iterative Methods

For large sparse systems, direct methods (\(O(n^3)\)) are too expensive; iterative methods are more practical.

Jacobi iteration:

\[ x_i^{(k+1)} = \frac{1}{a_{ii}} \left( b_i - \sum_{j \neq i} a_{ij} x_j^{(k)} \right) \]

Gauss-Seidel iteration: Uses the most recently computed component values, typically converging faster.

Method	Parallelism	Convergence speed	Storage
Jacobi	High (parallelizable)	Slower	Two vector copies needed
Gauss-Seidel	Low (sequential dependence)	Faster	In-place updates
SOR	Low	Fastest (with optimal \(\omega\))	In-place updates

Convergence condition: The spectral radius of the iteration matrix must satisfy \(\rho(M) < 1\). Diagonally dominant matrices guarantee convergence for both Jacobi and Gauss-Seidel methods.

5. Nonlinear Root Finding

Solve \(f(x) = 0\).

5.1 The Bisection Method

Prerequisite: \(f\) is continuous on \([a, b]\) with \(f(a) \cdot f(b) < 0\)
Each step halves the interval; convergence rate: linear, gaining 1 binary digit per iteration
Pros: simple and reliable; Cons: slow convergence

5.2 Newton's Method

\[ x_{n+1} = x_n - \frac{f(x_n)}{f'(x_n)} \]

Convergence rate: quadratic (near a simple root), meaning the error squares at each step
Requires computing the derivative \(f'(x)\)
May diverge (if the initial guess is poor or \(f'(x_n) \approx 0\))

Newton's Method Variants in Deep Learning

First-order optimizers (SGD, Adam) can be viewed as "incomplete" versions of Newton's method: they use only the gradient, not the Hessian.
Second-order optimizers (L-BFGS, K-FAC) approximately use Hessian information for faster convergence, but at higher memory cost.

5.3 The Secant Method

Replaces the derivative with a finite difference:

\[ x_{n+1} = x_n - f(x_n) \cdot \frac{x_n - x_{n-1}}{f(x_n) - f(x_{n-1})} \]

Convergence order is approximately \(\phi \approx 1.618\) (the golden ratio), between linear and quadratic.

6. Numerical Solutions of ODEs

Solve the initial value problem \(\frac{dy}{dt} = f(t, y)\), \(y(t_0) = y_0\).

6.1 Euler's Method

\[ y_{n+1} = y_n + h \cdot f(t_n, y_n) \]

First-order method; local truncation error \(O(h^2)\), global error \(O(h)\)
Simple but low accuracy; unsuitable for stiff problems

6.2 The Runge-Kutta Method (RK4)

The classical fourth-order Runge-Kutta method:

\[ \begin{aligned} k_1 &= f(t_n, y_n) \\ k_2 &= f(t_n + h/2, y_n + h k_1/2) \\ k_3 &= f(t_n + h/2, y_n + h k_2/2) \\ k_4 &= f(t_n + h, y_n + h k_3) \\ y_{n+1} &= y_n + \frac{h}{6}(k_1 + 2k_2 + 2k_3 + k_4) \end{aligned} \]

Local truncation error \(O(h^5)\), global error \(O(h^4)\).

6.3 Stiff Problems and Implicit Methods

Stiff problems: ODEs containing both rapidly and slowly varying components; explicit methods require extremely small step sizes for stability.

Implicit Euler method: \(y_{n+1} = y_n + h \cdot f(t_{n+1}, y_{n+1})\), A-stable, suitable for stiff problems
BDF (Backward Differentiation Formula) methods: High-order implicit multistep methods widely used in stiff ODE solvers

Method	Order	Stability	Suitable for
Explicit Euler	1	Conditionally stable	Pedagogical demonstrations
RK4	4	Conditionally stable	Non-stiff problems
Implicit Euler	1	A-stable	Stiff problems
BDF	1-6	A(\(\alpha\))-stable	Stiff problems

7. Numerical Stability

7.1 Condition Number

The condition number of a matrix \(A\) measures the sensitivity of the output to perturbations in the input:

\[ \kappa(A) = \|A\| \cdot \|A^{-1}\| \]

\(\kappa(A) \approx 1\): well-conditioned problem
\(\kappa(A) \gg 1\): ill-conditioned problem; small input changes can cause large output changes

7.2 Backward Error Analysis

Core idea: The numerically computed result \(\hat{x}\) is not the exact solution to the original problem, but it is the exact solution to some "perturbed problem."

\[ (A + \Delta A)\hat{x} = b + \Delta b \]

If \(\|\Delta A\|\) and \(\|\Delta b\|\) are sufficiently small (on the order of machine precision), the algorithm is said to be backward stable.

8. Applications in AI

8.1 Neural ODEs

Neural ODEs generalize the discrete layers of residual networks into a continuous differential equation:

\[ \frac{dh}{dt} = f_\theta(h(t), t) \]

Forward propagation is performed using an ODE solver (e.g., adaptive RK45)
Gradients are computed via the adjoint method, achieving \(O(1)\) memory complexity
Applications: continuous normalizing flows, time-series modeling

8.2 Physics-Informed Neural Networks (PINNs)

PINNs embed physical equations as regularization terms in neural network training:

\[ \mathcal{L} = \mathcal{L}_{\text{data}} + \lambda \cdot \mathcal{L}_{\text{PDE}} \]

where \(\mathcal{L}_{\text{PDE}}\) computes PDE residuals via automatic differentiation.

Applications: fluid dynamics simulation, materials science, weather prediction.

8.3 Numerical Linear Algebra and Deep Learning

SVD (Singular Value Decomposition): Used for low-rank approximation (LoRA), dimensionality reduction (PCA)
Eigendecomposition: Hessian spectral analysis for understanding loss landscapes
Krylov subspace methods: Efficient solvers for large-scale sparse matrices

References

Burden & Faires, Numerical Analysis
Trefethen & Bau, Numerical Linear Algebra
Higham, Accuracy and Stability of Numerical Algorithms
Chen et al., "Neural Ordinary Differential Equations" (NeurIPS 2018)
Raissi et al., "Physics-informed neural networks" (JCP 2019)