Optimization Theory

Overview

Optimization theory is a core branch of mathematics and engineering that studies how to find the optimal solution of an objective function under given constraints. In machine learning and deep learning, virtually all model training reduces to an optimization problem.

\[ \min_{x \in \mathbb{R}^n} f(x) \quad \text{s.t.} \quad g_i(x) \leq 0, \; h_j(x) = 0 \]

1. Convex Sets and Convex Functions

1.1 Convex Sets

A set \(C \subseteq \mathbb{R}^n\) is a convex set if and only if for any \(x, y \in C\) and \(\theta \in [0, 1]\):

\[ \theta x + (1 - \theta) y \in C \]

Common convex sets:

Hyperplane \(\{x \mid a^T x = b\}\)
Half-space \(\{x \mid a^T x \leq b\}\)
Ball \(\{x \mid \|x - x_c\| \leq r\}\)
Polyhedron (intersection of linear inequalities)

Property: The intersection of convex sets is still convex.

1.2 Convex Functions

A function \(f: \mathbb{R}^n \to \mathbb{R}\) is a convex function if and only if its domain is a convex set and for any \(x, y\) and \(\theta \in [0, 1]\):

\[ f(\theta x + (1 - \theta) y) \leq \theta f(x) + (1 - \theta) f(y) \]

Equivalent conditions (when twice differentiable):

First-order condition: \(f(y) \geq f(x) + \nabla f(x)^T (y - x)\)
Second-order condition: \(\nabla^2 f(x) \succeq 0\) (Hessian is positive semidefinite)

Common convex functions:

Function	Domain	Convexity
\(\\|x\\|_p\) (\(p \geq 1\))	\(\mathbb{R}^n\)	Convex
\(e^{ax}\)	\(\mathbb{R}\)	Convex
\(x \log x\)	\(\mathbb{R}_{++}\)	Convex
\(-\log x\)	\(\mathbb{R}_{++}\)	Convex
\(\max(x_1, \ldots, x_n)\)	\(\mathbb{R}^n\)	Convex

1.3 Convex Optimization Problems

\[ \min_{x} f(x) \quad \text{s.t.} \quad g_i(x) \leq 0, \; Ax = b \]

where \(f\) and all \(g_i\) are convex functions.

Key property: Any local optimum of a convex optimization problem is also the global optimum.

2. Unconstrained Optimization

2.1 Gradient Descent

Basic iteration:

\[ x_{k+1} = x_k - \alpha \nabla f(x_k) \]

where \(\alpha > 0\) is the learning rate (step size).

Convergence analysis:

Assume \(f\) is \(L\)-smooth (\(\|\nabla f(x) - \nabla f(y)\| \leq L\|x - y\|\)):

With step size \(\alpha = \frac{1}{L}\): \(f(x_k) - f^* \leq \frac{L\|x_0 - x^*\|^2}{2k}\)
Convergence rate: \(O(1/k)\) (sublinear)

If \(f\) is additionally \(\mu\)-strongly convex (\(\nabla^2 f(x) \succeq \mu I\)):

Convergence rate: \(O\left(\left(\frac{L - \mu}{L + \mu}\right)^k\right)\) (linear convergence)
The larger the condition number \(\kappa = L / \mu\), the slower the convergence

def gradient_descent(f_grad, x0, lr=0.01, tol=1e-6, max_iter=10000):
    """Basic gradient descent"""
    x = x0.copy()
    for k in range(max_iter):
        grad = f_grad(x)
        if np.linalg.norm(grad) < tol:
            break
        x = x - lr * grad
    return x, k

2.2 Learning Rate Selection

Fixed learning rate: Simple but requires tuning

Line Search (Backtracking Line Search):

Given a descent direction \(d\), find step size \(\alpha\) satisfying the Armijo condition:

\[ f(x + \alpha d) \leq f(x) + c_1 \alpha \nabla f(x)^T d \]

where \(c_1 \in (0, 1)\), typically \(c_1 = 10^{-4}\).

Exact line search:

\[ \alpha^* = \arg\min_{\alpha > 0} f(x + \alpha d) \]

2.3 Gradient Descent Variants

graph TD
    GD[Gradient Descent] --> SGD[Stochastic Gradient Descent]
    GD --> Mini[Mini-Batch Gradient Descent]
    GD --> Momentum[Momentum]
    Momentum --> NAG[Nesterov Acceleration]
    SGD --> Adam[Adam Optimizer]
    SGD --> Adagrad[Adagrad]
    Adagrad --> RMSProp[RMSProp]
    RMSProp --> Adam

Momentum method:

\[ v_{k+1} = \beta v_k + \nabla f(x_k), \quad x_{k+1} = x_k - \alpha v_{k+1} \]

Nesterov Accelerated Gradient (NAG):

\[ v_{k+1} = \beta v_k + \nabla f(x_k - \alpha \beta v_k), \quad x_{k+1} = x_k - \alpha v_{k+1} \]

For convex functions, NAG achieves a convergence rate of \(O(1/k^2)\), improving upon the \(O(1/k)\) rate of standard gradient descent.

3. Constrained Optimization

3.1 Lagrange Multiplier Method

For equality-constrained optimization problems:

\[ \min_x f(x) \quad \text{s.t.} \quad h_i(x) = 0, \; i = 1, \ldots, p \]

Construct the Lagrangian:

\[ \mathcal{L}(x, \nu) = f(x) + \sum_{i=1}^{p} \nu_i h_i(x) \]

Necessary conditions:

\[ \nabla_x \mathcal{L} = 0, \quad h_i(x) = 0 \]

3.2 KKT Conditions

For general constrained optimization problems:

\[ \min_x f(x) \quad \text{s.t.} \quad g_i(x) \leq 0, \; h_j(x) = 0 \]

KKT conditions (Karush-Kuhn-Tucker):

\[ \begin{aligned} \nabla f(x^*) + \sum_i \lambda_i \nabla g_i(x^*) + \sum_j \nu_j \nabla h_j(x^*) &= 0 & \text{(Stationarity)} \\ g_i(x^*) &\leq 0 & \text{(Primal feasibility)} \\ \lambda_i &\geq 0 & \text{(Dual feasibility)} \\ \lambda_i g_i(x^*) &= 0 & \text{(Complementary slackness)} \end{aligned} \]

Meaning of complementary slackness: If \(\lambda_i > 0\), then \(g_i(x^*) = 0\) (the constraint is active).

Importance of KKT Conditions

For convex optimization problems, the KKT conditions are both necessary and sufficient. This means any point satisfying the KKT conditions is guaranteed to be a global optimum. The solution of Support Vector Machines (SVMs) relies on the KKT conditions.

3.3 Constrained Optimization Example

Problem: Minimize a linear function on the unit ball

\[ \min_x c^T x \quad \text{s.t.} \quad \|x\|^2 \leq 1 \]

KKT conditions:

\(c + 2\lambda x^* = 0 \implies x^* = -\frac{c}{2\lambda}\)
\(\lambda(\|x^*\|^2 - 1) = 0\)
By complementary slackness: \(\lambda = \frac{\|c\|}{2}\), \(x^* = -\frac{c}{\|c\|}\)

4. Duality Theory

4.1 Lagrangian Duality

Lagrangian dual function:

\[ g(\lambda, \nu) = \inf_x \mathcal{L}(x, \lambda, \nu) = \inf_x \left[ f(x) + \sum_i \lambda_i g_i(x) + \sum_j \nu_j h_j(x) \right] \]

Property: \(g(\lambda, \nu)\) is a concave function (even if the original problem is non-convex).

Weak duality:

\[ d^* = \max_{\lambda \geq 0, \nu} g(\lambda, \nu) \leq \min_x f(x) = p^* \]

Duality gap: \(p^* - d^* \geq 0\).

Strong duality:

When Slater's condition holds (there exists a strictly feasible point \(\tilde{x}\): \(g_i(\tilde{x}) < 0\)):

\[ p^* = d^* \]

4.2 Applications of Duality

The dual form of SVM is one of the most classic applications of duality theory:

Primal problem:

\[ \min_{w, b} \frac{1}{2}\|w\|^2 \quad \text{s.t.} \quad y_i(w^T x_i + b) \geq 1 \]

Dual problem:

\[ \max_\alpha \sum_i \alpha_i - \frac{1}{2} \sum_{i,j} \alpha_i \alpha_j y_i y_j x_i^T x_j \quad \text{s.t.} \quad \alpha_i \geq 0, \; \sum_i \alpha_i y_i = 0 \]

5. Newton's Method and Quasi-Newton Methods

5.1 Newton's Method

Leveraging second-order information for optimization:

\[ x_{k+1} = x_k - [\nabla^2 f(x_k)]^{-1} \nabla f(x_k) \]

Convergence: Exhibits quadratic convergence near the optimum.

Advantages:

Fast convergence (quadratic)
Insensitive to the condition number

Disadvantages:

Requires computing and storing the \(n \times n\) Hessian matrix
Per-step computational complexity of \(O(n^3)\)
The Hessian may not be positive definite

5.2 Quasi-Newton Methods (BFGS)

Approximate the Hessian with a matrix \(B_k\):

\[ x_{k+1} = x_k - \alpha_k B_k^{-1} \nabla f(x_k) \]

BFGS update formula:

\[ B_{k+1} = B_k - \frac{B_k s_k s_k^T B_k}{s_k^T B_k s_k} + \frac{y_k y_k^T}{y_k^T s_k} \]

where \(s_k = x_{k+1} - x_k\), \(y_k = \nabla f(x_{k+1}) - \nabla f(x_k)\).

L-BFGS: A limited-memory variant that stores only the most recent \(m\) pairs of \(\{s_k, y_k\}\), suitable for large-scale problems.

5.3 Method Comparison

Method	Convergence Rate	Per-Step Complexity	Memory	Use Case
Gradient Descent	\(O(1/k)\)	\(O(n)\)	\(O(n)\)	Large-scale, simple
Newton's Method	Quadratic	\(O(n^3)\)	\(O(n^2)\)	Small-scale, precise
BFGS	Superlinear	\(O(n^2)\)	\(O(n^2)\)	Medium-scale
L-BFGS	Superlinear	\(O(mn)\)	\(O(mn)\)	Large-scale

6. Special Convex Optimization Problems

6.1 Linear Programming (LP)

\[ \min_x c^T x \quad \text{s.t.} \quad Ax \leq b, \; x \geq 0 \]

Solution methods: Simplex method, Interior-point method.

6.2 Quadratic Programming (QP)

\[ \min_x \frac{1}{2} x^T Q x + c^T x \quad \text{s.t.} \quad Ax \leq b \]

SVM is a quadratic programming problem.

6.3 Semidefinite Programming (SDP)

\[ \min_X \langle C, X \rangle \quad \text{s.t.} \quad \langle A_i, X \rangle = b_i, \; X \succeq 0 \]

Applications: Relaxation of combinatorial optimization problems, control theory.

7. Non-Convex Optimization

7.1 Challenges

Multiple local minima exist
Saddle point problem (ubiquitous in high-dimensional spaces)
NP-hard in the general case

7.2 Non-Convex Optimization in Deep Learning

The loss function of deep neural networks is highly non-convex, yet SGD and its variants remain effective in practice:

Over-parameterization: Far more parameters than data points; local minima tend to have similar quality
Implicit regularization of SGD: Noise helps escape sharp minima
Loss surface structure: In practice, saddle points are more common than poor local minima

# Adam optimizer implementation
def adam(f_grad, x0, lr=0.001, beta1=0.9, beta2=0.999, eps=1e-8, max_iter=10000):
    x = x0.copy()
    m = np.zeros_like(x)  # First moment estimate
    v = np.zeros_like(x)  # Second moment estimate
    for t in range(1, max_iter + 1):
        grad = f_grad(x)
        m = beta1 * m + (1 - beta1) * grad
        v = beta2 * v + (1 - beta2) * grad**2
        m_hat = m / (1 - beta1**t)  # Bias correction
        v_hat = v / (1 - beta2**t)
        x = x - lr * m_hat / (np.sqrt(v_hat) + eps)
    return x

8. Connection Between Optimization Theory and Machine Learning

graph LR
    A[Optimization Theory] --> B[Model Training]
    A --> C[Hyperparameter Tuning]
    A --> D[Constrained Learning]
    B --> E[Loss Function Minimization]
    B --> F[Regularization = Constrained Optimization]
    C --> G[Bayesian Optimization]
    D --> H[Fairness Constraints]
    D --> I[Physics Constraints]

Machine Learning Concept	Optimization Correspondence
Training	Unconstrained/Constrained optimization
Regularization	Lagrangian duality
SVM	Quadratic programming + KKT
Neural networks	Non-convex optimization
Hyperparameter search	Black-box optimization

References

Boyd & Vandenberghe, Convex Optimization (classic textbook, freely available online)
Nocedal & Wright, Numerical Optimization
Bertsekas, Nonlinear Programming

Related sections: