Optimization Theory
Overview
Optimization theory is a core branch of mathematics and engineering that studies how to find the optimal solution of an objective function under given constraints. In machine learning and deep learning, virtually all model training reduces to an optimization problem.
1. Convex Sets and Convex Functions
1.1 Convex Sets
A set \(C \subseteq \mathbb{R}^n\) is a convex set if and only if for any \(x, y \in C\) and \(\theta \in [0, 1]\):
Common convex sets:
- Hyperplane \(\{x \mid a^T x = b\}\)
- Half-space \(\{x \mid a^T x \leq b\}\)
- Ball \(\{x \mid \|x - x_c\| \leq r\}\)
- Polyhedron (intersection of linear inequalities)
Property: The intersection of convex sets is still convex.
1.2 Convex Functions
A function \(f: \mathbb{R}^n \to \mathbb{R}\) is a convex function if and only if its domain is a convex set and for any \(x, y\) and \(\theta \in [0, 1]\):
Equivalent conditions (when twice differentiable):
- First-order condition: \(f(y) \geq f(x) + \nabla f(x)^T (y - x)\)
- Second-order condition: \(\nabla^2 f(x) \succeq 0\) (Hessian is positive semidefinite)
Common convex functions:
| Function | Domain | Convexity |
|---|---|---|
| \(\|x\|_p\) (\(p \geq 1\)) | \(\mathbb{R}^n\) | Convex |
| \(e^{ax}\) | \(\mathbb{R}\) | Convex |
| \(x \log x\) | \(\mathbb{R}_{++}\) | Convex |
| \(-\log x\) | \(\mathbb{R}_{++}\) | Convex |
| \(\max(x_1, \ldots, x_n)\) | \(\mathbb{R}^n\) | Convex |
1.3 Convex Optimization Problems
where \(f\) and all \(g_i\) are convex functions.
Key property: Any local optimum of a convex optimization problem is also the global optimum.
2. Unconstrained Optimization
2.1 Gradient Descent
Basic iteration:
where \(\alpha > 0\) is the learning rate (step size).
Convergence analysis:
Assume \(f\) is \(L\)-smooth (\(\|\nabla f(x) - \nabla f(y)\| \leq L\|x - y\|\)):
- With step size \(\alpha = \frac{1}{L}\): \(f(x_k) - f^* \leq \frac{L\|x_0 - x^*\|^2}{2k}\)
- Convergence rate: \(O(1/k)\) (sublinear)
If \(f\) is additionally \(\mu\)-strongly convex (\(\nabla^2 f(x) \succeq \mu I\)):
- Convergence rate: \(O\left(\left(\frac{L - \mu}{L + \mu}\right)^k\right)\) (linear convergence)
- The larger the condition number \(\kappa = L / \mu\), the slower the convergence
def gradient_descent(f_grad, x0, lr=0.01, tol=1e-6, max_iter=10000):
"""Basic gradient descent"""
x = x0.copy()
for k in range(max_iter):
grad = f_grad(x)
if np.linalg.norm(grad) < tol:
break
x = x - lr * grad
return x, k
2.2 Learning Rate Selection
Fixed learning rate: Simple but requires tuning
Line Search (Backtracking Line Search):
Given a descent direction \(d\), find step size \(\alpha\) satisfying the Armijo condition:
where \(c_1 \in (0, 1)\), typically \(c_1 = 10^{-4}\).
Exact line search:
2.3 Gradient Descent Variants
graph TD
GD[Gradient Descent] --> SGD[Stochastic Gradient Descent]
GD --> Mini[Mini-Batch Gradient Descent]
GD --> Momentum[Momentum]
Momentum --> NAG[Nesterov Acceleration]
SGD --> Adam[Adam Optimizer]
SGD --> Adagrad[Adagrad]
Adagrad --> RMSProp[RMSProp]
RMSProp --> Adam
Momentum method:
Nesterov Accelerated Gradient (NAG):
For convex functions, NAG achieves a convergence rate of \(O(1/k^2)\), improving upon the \(O(1/k)\) rate of standard gradient descent.
3. Constrained Optimization
3.1 Lagrange Multiplier Method
For equality-constrained optimization problems:
Construct the Lagrangian:
Necessary conditions:
3.2 KKT Conditions
For general constrained optimization problems:
KKT conditions (Karush-Kuhn-Tucker):
Meaning of complementary slackness: If \(\lambda_i > 0\), then \(g_i(x^*) = 0\) (the constraint is active).
Importance of KKT Conditions
For convex optimization problems, the KKT conditions are both necessary and sufficient. This means any point satisfying the KKT conditions is guaranteed to be a global optimum. The solution of Support Vector Machines (SVMs) relies on the KKT conditions.
3.3 Constrained Optimization Example
Problem: Minimize a linear function on the unit ball
KKT conditions:
- \(c + 2\lambda x^* = 0 \implies x^* = -\frac{c}{2\lambda}\)
- \(\lambda(\|x^*\|^2 - 1) = 0\)
- By complementary slackness: \(\lambda = \frac{\|c\|}{2}\), \(x^* = -\frac{c}{\|c\|}\)
4. Duality Theory
4.1 Lagrangian Duality
Lagrangian dual function:
Property: \(g(\lambda, \nu)\) is a concave function (even if the original problem is non-convex).
Weak duality:
Duality gap: \(p^* - d^* \geq 0\).
Strong duality:
When Slater's condition holds (there exists a strictly feasible point \(\tilde{x}\): \(g_i(\tilde{x}) < 0\)):
4.2 Applications of Duality
The dual form of SVM is one of the most classic applications of duality theory:
Primal problem:
Dual problem:
5. Newton's Method and Quasi-Newton Methods
5.1 Newton's Method
Leveraging second-order information for optimization:
Convergence: Exhibits quadratic convergence near the optimum.
Advantages:
- Fast convergence (quadratic)
- Insensitive to the condition number
Disadvantages:
- Requires computing and storing the \(n \times n\) Hessian matrix
- Per-step computational complexity of \(O(n^3)\)
- The Hessian may not be positive definite
5.2 Quasi-Newton Methods (BFGS)
Approximate the Hessian with a matrix \(B_k\):
BFGS update formula:
where \(s_k = x_{k+1} - x_k\), \(y_k = \nabla f(x_{k+1}) - \nabla f(x_k)\).
L-BFGS: A limited-memory variant that stores only the most recent \(m\) pairs of \(\{s_k, y_k\}\), suitable for large-scale problems.
5.3 Method Comparison
| Method | Convergence Rate | Per-Step Complexity | Memory | Use Case |
|---|---|---|---|---|
| Gradient Descent | \(O(1/k)\) | \(O(n)\) | \(O(n)\) | Large-scale, simple |
| Newton's Method | Quadratic | \(O(n^3)\) | \(O(n^2)\) | Small-scale, precise |
| BFGS | Superlinear | \(O(n^2)\) | \(O(n^2)\) | Medium-scale |
| L-BFGS | Superlinear | \(O(mn)\) | \(O(mn)\) | Large-scale |
6. Special Convex Optimization Problems
6.1 Linear Programming (LP)
Solution methods: Simplex method, Interior-point method.
6.2 Quadratic Programming (QP)
SVM is a quadratic programming problem.
6.3 Semidefinite Programming (SDP)
Applications: Relaxation of combinatorial optimization problems, control theory.
7. Non-Convex Optimization
7.1 Challenges
- Multiple local minima exist
- Saddle point problem (ubiquitous in high-dimensional spaces)
- NP-hard in the general case
7.2 Non-Convex Optimization in Deep Learning
The loss function of deep neural networks is highly non-convex, yet SGD and its variants remain effective in practice:
- Over-parameterization: Far more parameters than data points; local minima tend to have similar quality
- Implicit regularization of SGD: Noise helps escape sharp minima
- Loss surface structure: In practice, saddle points are more common than poor local minima
# Adam optimizer implementation
def adam(f_grad, x0, lr=0.001, beta1=0.9, beta2=0.999, eps=1e-8, max_iter=10000):
x = x0.copy()
m = np.zeros_like(x) # First moment estimate
v = np.zeros_like(x) # Second moment estimate
for t in range(1, max_iter + 1):
grad = f_grad(x)
m = beta1 * m + (1 - beta1) * grad
v = beta2 * v + (1 - beta2) * grad**2
m_hat = m / (1 - beta1**t) # Bias correction
v_hat = v / (1 - beta2**t)
x = x - lr * m_hat / (np.sqrt(v_hat) + eps)
return x
8. Connection Between Optimization Theory and Machine Learning
graph LR
A[Optimization Theory] --> B[Model Training]
A --> C[Hyperparameter Tuning]
A --> D[Constrained Learning]
B --> E[Loss Function Minimization]
B --> F[Regularization = Constrained Optimization]
C --> G[Bayesian Optimization]
D --> H[Fairness Constraints]
D --> I[Physics Constraints]
| Machine Learning Concept | Optimization Correspondence |
|---|---|
| Training | Unconstrained/Constrained optimization |
| Regularization | Lagrangian duality |
| SVM | Quadratic programming + KKT |
| Neural networks | Non-convex optimization |
| Hyperparameter search | Black-box optimization |
References
- Boyd & Vandenberghe, Convex Optimization (classic textbook, freely available online)
- Nocedal & Wright, Numerical Optimization
- Bertsekas, Nonlinear Programming
Related sections: