Evolutionaries
"If nature can do it, so can we." —— The methodological manifesto of the Evolutionaries
The Evolutionaries are one of the five tribes Domingos identifies in The Master Algorithm. Their core belief can be summarized as: intelligence is not designed, it is evolved. The biological brain is the product of 3.5 billion years of natural selection. Rather than hand-specifying priors or assuming an analytic form, why not define learning itself as a genetic search performed in hypothesis space—using fitness as the pressure, variation as the source of innovation, and selection as the filter, letting answers "emerge automatically".
Together with the Symbolists, Connectionists, Bayesians, and Analogizers, the Evolutionaries form Domingos's "five tribes of machine learning". This directory systematically organizes the algorithmic lineage, theoretical foundations, and modern revival of the Evolutionaries, and forms a cross-reference network with discussions of "heuristic search", "reinforcement learning", and "automated machine learning (AutoML)" elsewhere in these notes.
1. Tribal Profile
| Dimension | The Evolutionary Answer |
|---|---|
| Core metaphor | Natural selection, biological evolution |
| Nature of learning | Parallel stochastic search in hypothesis space |
| Knowledge carrier | Chromosome / genotype—bit string, real vector, tree, or graph |
| Master algorithm | Genetic Algorithm (GA) and its derivatives |
| Mathematical tools | Probability theory, combinatorial optimization, stochastic processes |
| Representative figures | John Holland, David Goldberg, John Koza, Hans-Paul Schwefel, Ingo Rechenberg, Kenneth Stanley |
| Representative algorithms | GA, GP, ES, CMA-ES, NEAT, PSO, ACO, NAS |
| Strengths | Black-box optimization, non-differentiable objectives, combinatorial spaces, multimodality |
| Weaknesses | Low sample efficiency, slow convergence, hyperparameter sensitivity |
| Modern flagship scenarios | Neural Architecture Search (NAS), hyperparameter optimization, robot morphology evolution, symbolic regression |
The most critical feature distinguishing the Evolutionaries from other tribes is: they do not require gradient information. This gives them a structural advantage on non-differentiable, discrete, deceptive, or noisy objectives.
2. The Master Algorithm: The Genetic Search Loop
The skeleton of the Evolutionary "master algorithm" can be expressed as a minimal loop diagram—every concrete method (GA, ES, GP, NEAT, etc.) is a specialization of this skeleton along three axes: encoding, variation operator, and selection pressure.
flowchart TD
A[Initialize Population<br/>Population P_0] --> B[Evaluate Fitness<br/>Fitness f x_i]
B --> C{Termination?}
C -- No --> D[Parent Selection<br/>Parent Selection]
D --> E[Mutation / Crossover<br/>Crossover & Mutation]
E --> F[Generate Offspring<br/>Offspring]
F --> G[Survivor Selection<br/>Survivor Selection]
G --> H[Elitism<br/>Elitism]
H --> B
C -- Yes --> I[Return Best Individual]
Formally, given a hypothesis space \(\mathcal{H}\) and a fitness function \(f: \mathcal{H} \to \mathbb{R}\), an evolutionary algorithm maintains a population \(P_t = \{x_1, \dots, x_N\} \subset \mathcal{H}\) and iterates:
where \(\mathcal{S}_{\text{par}}\) is parent selection, \(\mathcal{V}\) is the variation/recombination operator, and \(\mathcal{S}_{\text{env}}\) is environmental (survivor) selection. The whole loop can be viewed as a generalized EM process: selection is the E step (inferring which hypotheses are promising), and variation is the M step (generating new hypotheses near the promising regions).
3. Algorithmic Lineage
The diagram below shows the family relationships among Evolutionary algorithms. Note that this is not a strict timeline but rather an idea-inheritance graph—some branches (e.g., swarm intelligence) developed in parallel with the main GA/ES trunk and borrowed mutually.
flowchart LR
GA[GA<br/>Holland 1975] --> GP[GP<br/>Koza 1992]
GA --> NSGA[NSGA-II<br/>Deb 2002]
GA --> MEM[Memetic Algorithms]
ES[ES<br/>Rechenberg 1973] --> CMA[CMA-ES<br/>Hansen 2001]
CMA --> NES[NES / xNES]
NES --> OAIES[OpenAI ES<br/>Salimans 2017]
GA --> NEAT[NEAT<br/>Stanley 2002]
NEAT --> HYP[HyperNEAT<br/>Stanley 2009]
HYP --> ESHYP[ES-HyperNEAT]
PSO[PSO<br/>Kennedy 1995] --> SI[Swarm Intelligence SI]
ACO[ACO<br/>Dorigo 1992] --> SI
SI --> MOO[Multimodal / Multi-objective]
NEAT --> NE[Neuroevolution NE]
OAIES --> NE
NE --> NAS[Neural Architecture Search NAS]
GA --> AML[AutoML / HPO]
NAS --> AML
Major branches:
- GA family: Originating with Holland, emphasizing bit-string encoding and the schema theorem; Goldberg's 1989 textbook engineered and popularized it.
- GP family: Koza took "programs" as evolvable objects, with structured variation acting on abstract syntax trees.
- ES family: Rechenberg and Schwefel's roots in fluid-mechanics optimization, emphasizing real-valued variables and adaptive Gaussian perturbations; CMA-ES is the modern champion of this branch.
- Swarm intelligence: PSO, ACO, ABC, FA, CS, and others, drawing on the collective behavior of bird flocks, ant colonies, and bee swarms.
- Neuroevolution: Searching neural network weights and topology with evolutionary methods; NEAT/HyperNEAT are the milestones.
- AutoML / NAS: The revival of evolutionary thinking in the deep-learning era, forming—together with RL and Bayesian optimization—the three major methodologies of NAS.
4. Comparison with Gradient Optimization
Evolutionary algorithms are often misread as a "degenerate version of gradient descent". In fact they are structurally different.
| Dimension | Gradient Optimization (SGD/Adam) | Evolutionary Algorithm (EA) |
|---|---|---|
| Gradient required? | Must be differentiable | Not needed at all |
| Signal type | First-/second-order analytic gradient | Scalar fitness |
| Black-box-friendliness | Does not accept black-box | Naturally black-box-friendly |
| Parallelism | Limited by mini-batch | Extreme (entire population evaluated independently) |
| Communication cost | Gradient tensors | Just seeds or a few scalars |
| Sample efficiency | High | Low (typically \(10^4 \sim 10^6\) evaluations) |
| Local minima | Easy to get stuck | Population diversity provides escape |
| Hyperparameter sensitivity | Learning rate, momentum, etc. | Population size, mutation rate, selection pressure |
| Suitable scenarios | Differentiable losses, large-scale supervised learning | Non-differentiable, discrete, combinatorial, black-box RL |
| Convergence guarantees | Bounded in convex case | Asymptotic only (Markov-chain stationary distribution) |
Rule of thumb: when the objective is differentiable, high-dimensional, and data is abundant, use SGD; when the objective is black-box, rewards are sparse, large-scale parallel CPUs are available, and robustness matters, evolutionary algorithms remain a competitive choice (Salimans et al. 2017 demonstrated this on Atari).
5. The Modern Revival: From Niche to Mainstream
Evolutionary algorithms went through a relatively quiet two decades after 1990, as SVMs, boosting, and deep learning took the stage in succession. But after 2015, they returned with force in three directions.
The first wave came from Neural Architecture Search (NAS). In the AmoebaNet line of work (2017—2019), Real et al. used large-scale evolutionary search to discover network architectures on ImageNet that surpassed hand-designed ones; the follow-up Regularized Evolution (Real 2019) replaced traditional elitism with "age regularization" and was shown to be more stable and efficient than the contemporaneous RL-based NAS-RL (Zoph & Le 2017). The second wave was OpenAI ES (Salimans et al. 2017): treating evolution strategies as a "scalable alternative to reinforcement learning", they ran Atari and MuJoCo on 1,440 CPUs and achieved scores comparable to A3C with shorter training time and near-zero parallel-communication overhead. The third wave is the industrialization of AutoML: frameworks such as Optuna, NNI, AutoGluon, DEAP, and pymoo packaged evolution together with Bayesian optimization, Hyperband, and other methods into out-of-the-box services, enabling non-experts to use EAs for hyperparameter tuning.
The common driver behind all three waves was hardware—cheap CPU clusters turned "evaluating thousands of individuals" from a luxury into a routine operation. The Evolutionaries thus shed the label of "theoretical toy" and once again became part of the engineering practice of machine learning.
6. Subpage Navigation
This directory drills down along three themes:
- Genetic Algorithms and Genetic Programming — The full GA pipeline, encoding schemes, selection/crossover/mutation operators, the Schema theorem, GP tree evolution, NSGA-II for multi-objective optimization, and classical case studies
- Evolution Strategies and Neuroevolution — ES origins, the full CMA-ES algorithm, OpenAI ES, NEAT/HyperNEAT, and a neuroevolution-vs-RL decision tree
- Swarm Intelligence and AutoML — PSO, ACO, ABC/FA/CS, an AutoML survey, comparison of HPO methods, the three NAS strategies, and industrial frameworks
7. Division of Labor with Existing Pages
This directory overlaps in content with the §evolutionary section of ../../02_Symbolic_AI/1_search.md, but differs in both perspective and depth:
| Dimension | 02_Symbolic_AI/1_search.md |
This directory |
|---|---|---|
| Context | A branch of classical search algorithms | The overall philosophy and algorithmic ecosystem of the Evolutionaries |
| GA introduction | Shallow (pipeline + a single example) | Full pipeline, operator details, derivation of the Schema theorem |
| Algorithm coverage | Only the GA mainline | GA / GP / ES / CMA-ES / NEAT / PSO / ACO / NAS |
| Comparison axis | Compared with A* / simulated annealing | Compared with gradient optimization, Bayesian optimization, RL |
| Modern applications | Briefly mentioned | Detailed narrative of OpenAI ES, AmoebaNet, DARTS |
Reading suggestion: if you only want to know "what is a GA", 1_search.md is enough; if you intend to apply evolutionary methods in engineering, or want to understand their place in the deep-learning era, start with this directory.
References
- Domingos, P. (2015). The Master Algorithm: How the Quest for the Ultimate Learning Machine Will Remake Our World. Basic Books. — The original book that defines the five tribes; Chapter 5 is devoted to the Evolutionaries.
- Holland, J. H. (1975). Adaptation in Natural and Artificial Systems. University of Michigan Press. — The foundational work on GA and the schema theorem.
- Goldberg, D. E. (1989). Genetic Algorithms in Search, Optimization, and Machine Learning. Addison-Wesley. — The engineering textbook for GA.
- Eiben, A. E., & Smith, J. E. (2015). Introduction to Evolutionary Computing (2nd ed.). Springer. — A modern textbook on evolutionary computation, covering the full GA/ES/GP/SI spectrum.
- Koza, J. R. (1992). Genetic Programming: On the Programming of Computers by Means of Natural Selection. MIT Press.
- Hansen, N. (2016). The CMA Evolution Strategy: A Tutorial. arXiv:1604.00772.
- Salimans, T., Ho, J., Chen, X., et al. (2017). Evolution Strategies as a Scalable Alternative to Reinforcement Learning. arXiv:1703.03864.
- Stanley, K. O., & Miikkulainen, R. (2002). Evolving neural networks through augmenting topologies. Evolutionary Computation, 10(2), 99–127.
- Real, E., Aggarwal, A., Huang, Y., & Le, Q. V. (2019). Regularized evolution for image classifier architecture search. AAAI.
Genetic Algorithms and Genetic Programming
This page goes deep into the algorithmic details, theoretical foundations, and engineering practice of Genetic Algorithms (GA) and Genetic Programming (GP). Together they form the "trunk" of the Evolutionaries—GA provides a general bit-string/real-vector chromosomal evolution framework, and GP generalizes the evolved object from vectors to program structures themselves.
Before reading this page, we recommend looking at 本页 §1(派系入口) for the overall positioning of the Evolutionaries; this page goes deeper and is more technical than the §evolutionary section in ../../02_Symbolic_AI/1_search.md.
1. The Complete GA Pipeline
GA was formally proposed by John Holland in Adaptation in Natural and Artificial Systems (1975). Its core idea: encode candidate solutions as "chromosomes" and search in parallel through solution space using a process that mimics natural selection.
1.1 Formal Definition
Given: - a hypothesis space \(\mathcal{H}\) (the solution space), - an encoding function \(\phi: \mathcal{H} \to \mathcal{G}\) that maps solutions into the genotype space \(\mathcal{G}\), - a fitness function \(f: \mathcal{H} \to \mathbb{R}\) (larger is better),
GA maintains a population \(P_t = \{g_1, \dots, g_N\} \subset \mathcal{G}\), with size \(N\) typically in \(50 \sim 500\). Each generation executes:
flowchart TD
A[Initialize population P_0<br/>random or heuristic] --> B[Decode and evaluate<br/>f x_i = f phi inverse g_i]
B --> C{Terminate?<br/>maxGen / fitness target / no improve}
C -- No --> D[Parent selection<br/>roulette / tournament / rank]
D --> E[Crossover<br/>1pt / 2pt / uniform / SBX]
E --> F[Mutation<br/>bitflip / Gaussian / polynomial]
F --> G[Evaluate offspring]
G --> H[Survivor selection<br/>generational / steady-state]
H --> I[Elitism<br/>top-k pass directly to next generation]
I --> B
C -- Yes --> J[Return best-so-far]
1.2 Relation to General Search
A GA can be viewed as a stochastic beam search with a population, but with three essential differences:
- Multi-solution parallelism: \(N\) candidates are maintained in parallel, not a single frontier.
- Recombination: the crossover operator mixes information from two parents—a feature absent from other heuristic searches like simulated annealing or hill climbing.
- Implicit parallelism: by the schema theorem, each generation effectively evaluates \(\mathcal{O}(N^3)\) schemata simultaneously (see §7).
2. Encoding Schemes
Encoding (representation) is the most critical decision in GA design. The wrong encoding renders all operators ineffective.
| Encoding | Chromosome form | Typical use case | Matching crossover/mutation |
|---|---|---|---|
| Binary | \(\{0, 1\}^l\) | Subset selection, Boolean rules, combinatorial optimization | bit-flip mutation, 1-point/uniform crossover |
| Real-valued | \(\mathbb{R}^d\) | Continuous function optimization, neural network weights | Gaussian mutation, BLX-α / SBX |
| Permutation | A permutation in \(S_n\) | TSP, scheduling, assembly sequences | order crossover (OX), PMX, swap mutation |
| Tree | Abstract syntax tree | GP program evolution, symbolic regression | subtree crossover, point mutation |
| Graph | Nodes + edges | Neural network topology (NEAT), circuits | NEAT-style add-node / add-link |
| Gray code | \(\{0,1\}^l\) | Numerical optimization, avoiding the Hamming cliff | Same as binary |
Hamming cliff: under standard binary encoding, adjacent integers (e.g., 7 = 0111 and 8 = 1000) can have a large Hamming distance, and a single bit-flip cannot bridge them. Gray code guarantees a Hamming distance of exactly 1 between adjacent integers and is a common trick for numerical optimization.
Encoding design principles:
- Completeness: every legal solution has at least one encoding.
- Non-redundancy: encoding and solution should be roughly one-to-one (avoiding many equivalent genotypes).
- Locality: small changes in genotype should correspond to small changes in phenotype—otherwise mutation degenerates into shuffling.
3. Selection Operators
The selection operator determines the selection pressure—the reproductive advantage of high-fitness individuals over low-fitness ones. Pressure that is too low (near-random search) yields slow convergence; pressure that is too high (only the best gets copied) leads to premature convergence.
3.1 Roulette Wheel
Each individual is selected with probability proportional to its fitness: $\(p_i = \frac{f_i}{\sum_{j=1}^N f_j}\)$ Issues: collapses when \(f\) takes negative values or has wildly different scales; pressure is hard to control under fitness scale drift.
3.2 Tournament Selection
Randomly draw \(k\) individuals and pick the one with the highest fitness as parent. The tournament size \(k\) directly controls selection pressure:
- \(k=1\): pure random
- \(k=2\): mild pressure, the most common choice
- \(k=N\): always picks the best (degenerates to truncation selection)
The expected selection pressure (takeover time) of tournament selection: in the limit of no mutation and no crossover, \(k\)-tournament drives the entire population to the best individual in \(\mathcal{O}(\log N / \log k)\) generations.
3.3 Rank-based
Probabilities are assigned by fitness rank, independent of the actual \(f\) values. Linear ranking: $\(p_i = \frac{1}{N}\Big[\eta^- + (\eta^+ - \eta^-)\frac{i-1}{N-1}\Big]\)$ where \(i\) is the rank (1 worst, \(N\) best), \(\eta^+ \in [1, 2]\) controls pressure, and \(\eta^+ + \eta^- = 2\).
3.4 Boltzmann Selection
Mimics simulated annealing, with temperature \(T\) decreasing over generations: $\(p_i \propto \exp(f_i / T)\)$ Large \(T\) approaches uniform; \(T \to 0\) degenerates to greedy. Combined with simulated annealing this gives the GASA family.
| Selection operator | Pressure tunability | Computational cost | Sensitive to fitness scale |
|---|---|---|---|
| Roulette | Hard to tune | \(\mathcal{O}(N)\) | Highly sensitive |
| Tournament | Easy to tune (\(k\)) | \(\mathcal{O}(k)\) | Insensitive |
| Rank | Easy to tune (\(\eta^+\)) | \(\mathcal{O}(N \log N)\) for sort | Completely insensitive |
| Boltzmann | Schedulable (\(T\)) | \(\mathcal{O}(N)\) | Sensitive (but \(T\) helps) |
Practical recommendation: default to tournament (\(k=2\) or \(k=3\))—pressure is controllable, robust to fitness scale, and easy to implement.
4. Crossover Operators
Crossover (recombination) is the core innovation that distinguishes GA from other random searches: by mixing two parents' genes, "good components (building blocks)" can combine to produce a better child.
4.1 1-point Crossover
Randomly choose a cut point \(c \in \{1, \dots, l-1\}\); child 1 = \(p_1[1:c] + p_2[c+1:l]\), and child 2 the reverse. Issue: genes at the head and tail of the chromosome are harder to separate (positional bias); genes far apart are unlikely to be inherited together.
4.2 2-point Crossover
Two cut points are chosen and the middle segment is exchanged. This mitigates positional bias and is one of the engineering defaults.
4.3 Uniform Crossover
Each position is independently inherited from \(p_1\) or \(p_2\) with probability \(p\) (often \(p=0.5\)).
- Pro: completely eliminates positional bias.
- Con: highly destructive to long building blocks—only suitable for problems with strong gene independence.
4.4 Simulated Binary Crossover (SBX)
The de facto standard for real-valued GAs (Deb & Agrawal 1995). Given two parents \(x_1, x_2\), the offspring are: $\(c_{1,2} = 0.5 \big[(1 \mp \beta_q)(x_1) + (1 \pm \beta_q)(x_2)\big]\)$ where \(\beta_q\) is determined by the distribution parameter \(\eta_c\) (typically \(\eta_c \in [2, 5]\)): $\(\beta_q = \begin{cases}(2u)^{1/(\eta_c+1)} & u \le 0.5 \\ (1/(2(1-u)))^{1/(\eta_c+1)} & u > 0.5\end{cases}, \quad u \sim \mathcal{U}[0,1]\)$ Key property of SBX: the offspring distribution is centered on the parents, and larger \(\eta_c\) pulls offspring closer to the parents—mimicking the "locality" of binary 1-point crossover in the real-valued domain.
5. Mutation Operators
Mutation provides exploration—pushing the population into regions not yet covered, preventing premature convergence.
5.1 Bit-flip (binary)
Each bit is flipped independently with probability \(p_m\) (typically \(1/l\), so the expected number of flips per chromosome is 1).
5.2 Gaussian Mutation (real-valued)
\(x_i \leftarrow x_i + \mathcal{N}(0, \sigma_i^2)\). \(\sigma_i\) may be globally shared, per-dimension independent, or self-adaptive (this transitions into ES; see Evolution Strategies and Neuroevolution).
5.3 Polynomial Mutation
The standard mutation for real-valued GAs proposed by Deb. Given bounds \([x^L, x^U]\) and a distribution parameter \(\eta_m\) (typically \(20\)): $\(x_i' = x_i + (x^U - x^L)\delta, \quad \delta = \begin{cases}(2u)^{1/(\eta_m+1)} - 1 & u < 0.5 \\ 1 - (2(1-u))^{1/(\eta_m+1)} & u \ge 0.5\end{cases}\)$ Larger \(\eta_m\) makes mutation more conservative.
5.4 Self-adaptive Mutation
Record each individual's mutation rate \(p_m^{(i)}\) as an extra gene that itself evolves—successful \(p_m\) values get selected and preserved. This is the bridge connecting GA to ES.
6. Elitism and Diversity Preservation
6.1 Elitism
Each generation directly copies the top-\(k\) individuals to the next, ensuring best-so-far is monotonically non-decreasing. \(k\) is usually \(1\% \sim 10\%\) of the population. Too large \(k\) destroys diversity; too small fails to protect good solutions.
6.2 The Diversity Crisis
Without any safeguard, GAs tend toward premature convergence—the population rapidly collapses around the first local optimum found. Two mainstream remedies:
Fitness sharing (Goldberg & Richardson 1987): nearby individuals "share" fitness with each other, simulating ecological niche competition. $\(f'_i = \frac{f_i}{\sum_{j=1}^N \mathrm{sh}(d_{ij})}, \quad \mathrm{sh}(d) = \max\Big(0,\ 1 - (d/\sigma_{\text{share}})^\alpha\Big)\)$
Crowding (De Jong 1975): offspring compete for survival only with the parent most similar to them, thereby preserving different niches.
Island model: split the population into \(K\) isolated subpopulations with occasional migration. Naturally suited to distributed implementations.
7. The Schema Theorem (Holland 1975)
The schema theorem is GA's only classical theoretical guarantee. It formalizes GA's "implicit parallelism".
7.1 Definition of a Schema
A schema (pattern) is a string over the alphabet \(\{0, 1, *\}^l\), where \(*\) is a wildcard. For example, \(H = 1\!*\!*\!0\!*\) denotes the set of all bit strings whose first bit is 1 and whose fourth bit is 0 (containing \(2^3 = 8\) strings when \(l=5\)).
Two key attributes:
- Order \(o(H)\): the number of fixed (non-\(*\)) positions in \(H\). E.g., \(o(1\!*\!*\!0\!*) = 2\).
- Defining length \(\delta(H)\): the distance between the first and the last fixed positions. E.g., \(\delta(1\!*\!*\!0\!*) = 4 - 1 = 3\).
7.2 Statement of the Theorem
Let \(m(H, t)\) be the number of individuals in generation \(t\) that match schema \(H\), \(f(H)\) their average fitness, \(\bar f\) the population average fitness, \(p_c\) and \(p_m\) the crossover and mutation probabilities, and \(l\) the chromosome length. Under roulette-wheel selection + 1-point crossover + bit-flip mutation:
7.3 Term-by-term Interpretation
- \(\frac{f(H)}{\bar f}\): the selection-pressure term. If schema \(H\) is above average (\(f(H) > \bar f\)), it is amplified by selection.
- \(1 - p_c \frac{\delta(H)}{l-1}\): the crossover-disruption term. The wider the schema span \(\delta(H)\), the more easily it is cut and disrupted by single-point crossover.
- \(1 - o(H) p_m\): the mutation-disruption term. The higher the schema order \(o(H)\), the more fixed positions are exposed to mutation.
7.4 Corollary: The Building-Block Hypothesis
Short (small \(\delta\)), low-order (small \(o\)), above-average-fitness schemata—called building blocks—grow at an exponential rate. GA constructs increasingly larger good solutions by recombining building blocks. This is the explanation Holland and Goldberg gave for "why GA works".
7.5 The Linkage Problem
The building-block hypothesis carries an implicit assumption: related genes must be placed close together on the chromosome. Otherwise the crossover probability \(p_c \delta(H)/(l-1)\) will repeatedly disrupt them. This is the linkage problem.
Solutions:
- Hand-engineered encoding: domain knowledge places related genes close together.
- mGA / fmGA (messy GA): let the GA learn the linkage structure on its own.
- EDA (Estimation of Distribution Algorithms): skip explicit crossover; directly learn a probabilistic model of the population (BOA, UMDA, etc.) and sample new generations from the distribution.
The schema theorem gives a necessary, not sufficient condition. It explains "under what conditions GA can work" but does not guarantee that GA converges to the global optimum. This differs in nature from the convex-optimization convergence proofs for SGD.
8. Genetic Programming (GP)
Genetic Programming (GP) was proposed by John Koza in 1992. It generalizes the evolved object from bit strings/real vectors to computer programs—specifically, the abstract syntax trees (ASTs) that represent programs.
8.1 Function Set and Terminal Set
A GP program consists of two types of nodes:
| Node type | Role | Examples |
|---|---|---|
| Function set | Internal nodes | \(\{+, -, \times, /, \sin, \cos, \mathrm{if}, \mathrm{and}\}\) |
| Terminal set | Leaf nodes | \(\{x_1, x_2, \dots, c_1, c_2, \dots\}\) |
Closure property: every function in the function set must be defined for all outputs of terminals and other functions—e.g., \(/\) must handle division by zero (commonly via protected division: \(x/y = 1\) if \(y=0\)).
Sufficiency property: the combinatorial space spanned by the function set and terminal set must contain the target program.
8.2 Tree Crossover and Mutation
flowchart LR
subgraph P1["Parent 1: x*x + sin x"]
A1["+"] --> B1["*"]
A1 --> C1["sin"]
B1 --> D1["x"]
B1 --> E1["x"]
C1 --> F1["x"]
end
subgraph P2["Parent 2: x+1 / 2"]
A2["/"] --> B2["+"]
A2 --> C2["2"]
B2 --> D2["x"]
B2 --> E2["1"]
end
subgraph C["Offspring (swap sin subtree with x+1 subtree)"]
AC["+"] --> BC["*"]
AC --> CC["+"]
BC --> DC["x"]
BC --> EC["x"]
CC --> FC["x"]
CC --> GC["1"]
end
Subtree crossover: randomly pick a subtree from each parent and swap them. This is the standard recombination operator in GP.
Point mutation: replace a node with another function of the same arity; or replace a leaf with another terminal of the same type.
Subtree mutation: delete a subtree and grow a new random subtree in its place.
8.3 Bloat and Its Control
Bloat is the most thorny phenomenon in GP: tree size grows unboundedly across generations while fitness fails to improve correspondingly. One cause is "neutral mutation"—code that does not affect output (introns) shields against crossover and is therefore selectively retained.
Control techniques:
- Depth limit: hard cap on tree depth. Crude but effective.
- Parsimony pressure: add a complexity penalty to fitness, \(f' = f - \alpha \cdot \text{size}(\text{tree})\).
- Tarpeian method: kill individuals exceeding a threshold with some probability.
- Operator equalization: maintain fixed quotas across size bins.
8.4 Classical Applications
- Symbolic regression: given data \(\{(x_i, y_i)\}\), evolve a function \(f\) that minimizes \(\sum (f(x_i) - y_i)^2\). Eureqa / PySR are modern representatives.
- Classification rule evolution: evolve decision lists or if-then rules.
- Circuit design: Koza's group used GP to reinvent several patented analog circuits (low-pass filter, amplifier topologies)—termed "human-competitive results".
- Game strategies: evolve heuristics for board games / poker strategies.
9. Multi-objective Evolutionary Algorithms: NSGA-II
Real-world problems often have conflicting objectives (accuracy vs. model size, throughput vs. cost). Multi-objective optimization (MOO) does not seek a single best solution but the Pareto front.
Pareto dominance: solution \(a\) dominates \(b\) iff \(a\) is no worse than \(b\) on all objectives and strictly better on at least one. The Pareto front is the set of all solutions not dominated by any other solution.
NSGA-II (Deb et al. 2002) is one of the most highly cited algorithms in MOO. Its core mechanisms:
- Fast non-dominated sorting: stratify the population by dominance (\(F_1\) is the current Pareto front, \(F_2\) is the front after removing \(F_1\), and so on). Complexity \(\mathcal{O}(MN^2)\).
- Crowding distance: within each layer, sort by neighborhood sparsity to encourage uniform spread along the front.
- Selection: compare layer first (lower is better); within the same layer, compare by crowding distance (larger is better).
- Survivor selection: combine parents and offspring, then take the top \(N\).
NSGA-II is widely used in NAS, hyperparameter optimization, and machine-learning model trade-offs (e.g., NSGA-Net directly searches the Pareto front of model families).
10. Examples
10.1 0/1 Knapsack
Item set \(\{(w_i, v_i)\}_{i=1}^n\), capacity \(W\).
- Encoding: binary string \(x \in \{0,1\}^n\), with \(x_i=1\) meaning item \(i\) is packed.
- Fitness: \(f(x) = \sum v_i x_i\) if \(\sum w_i x_i \le W\), otherwise heavily penalize \(f = -\infty\) or \(f = -\sum w_i x_i\).
- Operators: 1-point crossover + bit-flip.
- Typical performance: 100 items, 200 population, 500 generations can approach 95%+ of the optimum.
10.2 TSP
Visit-all-cities problem on \(n\) cities.
- Encoding: permutation \(\pi \in S_n\).
- Fitness: \(f(\pi) = -\sum d(\pi_i, \pi_{i+1})\).
- Operators: order crossover (OX) or PMX; combine with 2-opt local search (memetic algorithm).
10.3 Symbolic Regression
Data \(\{(x_i, y_i)\}\), the goal is to find \(f\).
- GP tree encoding, function set \(\{+, -, \times, /, \sin, \exp, \log\}\) (protected).
- Fitness: \(-\mathrm{MSE}(f, \text{data})\), optionally with a parsimony penalty.
- Modern practice: PySR (Cranmer 2023) and Eureqa combine GP with symbolic simplification, successfully rediscovering known formulas on physics-equation benchmarks (Feynman dataset).
11. Python Example (DEAP-style)
Below is a minimal runnable GA main loop, demonstrating a binary GA on \(f(x) = \sum_{i=1}^{20} x_i^2\) (i.e., the negation of OneMax).
import random
from typing import List, Tuple
POP_SIZE = 100
CHROM_LEN = 20
P_CROSS = 0.9
P_MUT = 1.0 / CHROM_LEN
GENERATIONS = 50
TOURN_K = 3
Chrom = List[int]
def fitness(c: Chrom) -> int:
return sum(c) # OneMax: 越多 1 越好
def tournament(pop: List[Chrom]) -> Chrom:
contestants = random.sample(pop, TOURN_K)
return max(contestants, key=fitness)
def one_point_crossover(p1: Chrom, p2: Chrom) -> Tuple[Chrom, Chrom]:
if random.random() >= P_CROSS:
return p1[:], p2[:]
cut = random.randint(1, CHROM_LEN - 1)
return p1[:cut] + p2[cut:], p2[:cut] + p1[cut:]
def bitflip_mutation(c: Chrom) -> Chrom:
return [1 - g if random.random() < P_MUT else g for g in c]
def init_population() -> List[Chrom]:
return [[random.randint(0, 1) for _ in range(CHROM_LEN)]
for _ in range(POP_SIZE)]
def run_ga():
pop = init_population()
best = max(pop, key=fitness)
for gen in range(GENERATIONS):
new_pop = [best] # 1-elitism
while len(new_pop) < POP_SIZE:
p1, p2 = tournament(pop), tournament(pop)
c1, c2 = one_point_crossover(p1, p2)
new_pop.append(bitflip_mutation(c1))
if len(new_pop) < POP_SIZE:
new_pop.append(bitflip_mutation(c2))
pop = new_pop
cur_best = max(pop, key=fitness)
if fitness(cur_best) > fitness(best):
best = cur_best
print(f"Gen {gen:3d} best={fitness(best)} avg={sum(map(fitness,pop))/POP_SIZE:.2f}")
return best
if __name__ == "__main__":
run_ga()
In production it is preferable to use DEAP (creator + tools + algorithms.eaSimple), pymoo (multi-objective first), or PyGAD (lightweight, integrates with sklearn / Keras).
12. Hyperparameter Tuning and Pitfalls
| Hyperparameter | Typical range | Tuning intuition |
|---|---|---|
| Population size \(N\) | \(50 \sim 500\) | High dimensionality, multimodality → larger; expensive evaluations → smaller |
| Crossover probability \(p_c\) | \(0.6 \sim 0.95\) | Usually high, letting recombination dominate |
| Mutation probability \(p_m\) | \(1/l \sim 5/l\) | Too low: no exploration; too high: random search |
| Tournament size \(k\) | \(2 \sim 5\) | Larger → stronger pressure → premature convergence |
| Elitism ratio | \(1\% \sim 10\%\) | Too large: lose diversity |
| Maximum generations | \(10^2 \sim 10^4\) | Add early stopping |
Common pitfalls:
- Wildly different fitness scales → replace roulette with rank/tournament.
- Premature convergence → add fitness sharing / increase mutation rate / use island model.
- Slow evaluation → use a surrogate model or cache duplicates.
- GP bloat → add depth limit + parsimony pressure.
- Discrete constraints → use repair operators rather than penalty functions.
13. Cross-references
- Parent page: 本页 §1(派系入口) — Overall positioning and algorithmic lineage of the Evolutionaries
- Same directory: Evolution Strategies and Neuroevolution — ES/CMA-ES/NEAT generalize GA to continuous and topological settings
- Same directory: Swarm Intelligence and AutoML — Comparison of PSO/ACO/NAS with GA
- Existing page: ../../02_Symbolic_AI/1_search.md §evolutionary — GA seen from the heuristic-search perspective
- Related page: BOHB / TPE from the Bayesian camp are strong competitors of GA in HPO; see 贝叶斯派
References
- Holland, J. H. (1975). Adaptation in Natural and Artificial Systems. University of Michigan Press. — Original source of GA and the schema theorem.
- Goldberg, D. E. (1989). Genetic Algorithms in Search, Optimization, and Machine Learning. Addison-Wesley. — The engineering textbook for GA.
- Goldberg, D. E., & Richardson, J. (1987). Genetic algorithms with sharing for multimodal function optimization. ICGA.
- De Jong, K. A. (1975). An Analysis of the Behavior of a Class of Genetic Adaptive Systems. PhD thesis, University of Michigan.
- Koza, J. R. (1992). Genetic Programming: On the Programming of Computers by Means of Natural Selection. MIT Press.
- Koza, J. R. (1994). Genetic Programming II: Automatic Discovery of Reusable Programs. MIT Press.
- Deb, K., & Agrawal, R. B. (1995). Simulated binary crossover for continuous search space. Complex Systems, 9, 115–148.
- Deb, K., Pratap, A., Agarwal, S., & Meyarivan, T. (2002). A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE Transactions on Evolutionary Computation, 6(2), 182–197.
- Eiben, A. E., & Smith, J. E. (2015). Introduction to Evolutionary Computing (2nd ed.). Springer.
- Cranmer, M. (2023). Interpretable machine learning for science with PySR and SymbolicRegression.jl. arXiv:2305.01582.
- Fortin, F.-A., et al. (2012). DEAP: Evolutionary algorithms made easy. JMLR, 13, 2171–2175.
Evolution Strategies and Neuroevolution
This page goes deep into representative algorithms developed by the Evolutionaries along two directions: continuous optimization and neural network evolution. These include Evolution Strategies (ES), Covariance Matrix Adaptation ES (CMA-ES), OpenAI ES, NEAT, and HyperNEAT. Their common feature is treating "mutation" as the primary evolutionary operator (in contrast to GA's "crossover-driven" view) and devoting substantial work to adapting the mutation distribution.
Before reading this page, we recommend looking at 本页 §1(派系入口) and Genetic Algorithms and Genetic Programming to understand the basic GA terminology.
1. Origins of Evolution Strategies
Evolution Strategies (ES) were developed by Ingo Rechenberg and Hans-Paul Schwefel at the Technical University of Berlin in the late 1960s, independently of Holland's GA. The earliest applications were wind-tunnel airfoil optimization—objectives that were continuous, non-analytically-differentiable, and expensive to evaluate.
Key differences between ES and GA:
| Dimension | GA (Holland) | ES (Rechenberg/Schwefel) |
|---|---|---|
| Origin | Adaptive systems theory | Engineering optimization (fluid mechanics) |
| Chromosome | Mainly binary strings | Mainly real vectors |
| Primary operator | Crossover | Mutation |
| Selection | Fitness-proportional | Truncation (\(\mu\) chosen from \(\lambda\)) |
| Self-adaptation | Mutation rate hand-tuned | Mutation step size \(\sigma\) self-adaptive |
| Theory | Schema theorem | 1/5 success rule, IGO/natural gradient |
2. Classical ES Variants
2.1 (1+1)-ES: The Minimal Working Model
The population is a single individual \(\theta_t \in \mathbb{R}^d\). Each generation produces one mutated offspring: $\(\theta'_t = \theta_t + \sigma_t \cdot \mathbf{z}, \quad \mathbf{z} \sim \mathcal{N}(0, I_d)\)$ If \(f(\theta'_t) \ge f(\theta_t)\), set \(\theta_{t+1} = \theta'_t\); otherwise keep \(\theta_t\).
2.2 (\(\mu\), \(\lambda\))-ES vs (\(\mu\) + \(\lambda\))-ES
- (\(\mu\), \(\lambda\)): \(\mu\) parents produce \(\lambda > \mu\) offspring; the next generation is chosen only from the offspring—parents do not participate, forcing updates.
- (\(\mu\) + \(\lambda\)): parents and offspring are merged and the top \(\mu\) are kept, giving monotonic non-decrease.
In practice, (\(\mu\), \(\lambda\)) is more common—it avoids the problem of "stale \(\sigma\) dragging \(\theta\)" under self-adaptive step sizes. A typical ratio is \(\lambda / \mu \approx 4 \sim 7\).
2.3 The 1/5 Success Rule (Rechenberg 1973)
Too small a \(\sigma\) converges slowly; too large a \(\sigma\) has a low success rate. Rechenberg's empirical rule: target success rate \(\approx 1/5\).
- If the recent success rate is \(> 1/5\) → \(\sigma \leftarrow \sigma / c\) (enlarge step, \(c \in (0.8, 1)\))
- If \(< 1/5\) → \(\sigma \leftarrow \sigma \cdot c\)
- If \(\approx 1/5\) → keep
This is the first self-adaptive step-size mechanism in evolutionary algorithms historically. It has theoretical support on the sphere function (from asymptotic convergence-rate analysis) and is the conceptual root of CMA-ES, xNES, and others.
3. CMA-ES: The Modern Champion of ES
The Covariance Matrix Adaptation Evolution Strategy (CMA-ES) was developed by Nikolaus Hansen and Andreas Ostermeier between 1996 and 2001, and is widely regarded as one of the strongest black-box continuous optimization algorithms today. On benchmarks such as BBOB (Black-Box Optimization Benchmark), it has long sat in the top tier.
3.1 Algorithmic Skeleton
Each generation, CMA-ES samples \(\lambda\) candidates from a multivariate Gaussian: $\(\mathbf{x}_i^{(g+1)} \sim \mathbf{m}^{(g)} + \sigma^{(g)} \mathcal{N}(\mathbf{0}, \mathbf{C}^{(g)}), \quad i=1,\dots,\lambda\)$
where:
- \(\mathbf{m}^{(g)} \in \mathbb{R}^d\): the distribution mean (current best estimate)
- \(\sigma^{(g)} \in \mathbb{R}_+\): overall step size
- \(\mathbf{C}^{(g)} \in \mathbb{R}^{d \times d}\): covariance matrix (shape/scale prior)
After evaluation, the top-\(\mu\) of the \(\lambda\) offspring update \(\mathbf{m}, \sigma, \mathbf{C}\). The essence of CMA-ES lies entirely in how these three quantities are adapted.
3.2 Mean Update
$\(\mathbf{m}^{(g+1)} = \sum_{i=1}^\mu w_i\, \mathbf{x}_{i:\lambda}^{(g+1)}, \quad \sum w_i = 1,\ w_1 \ge \dots \ge w_\mu \ge 0\)$ where \(\mathbf{x}_{i:\lambda}\) is the \(i\)-th offspring after sorting by fitness. Weights \(w_i\) usually decay logarithmically. \(\mu_{\mathrm{eff}} = 1/\sum w_i^2\) is called the effective selection mass.
3.3 Covariance Matrix Update: rank-1 and rank-\(\mu\)
Rank-\(\mu\) update: estimate the covariance from this generation's top-\(\mu\) offspring, $\(\mathbf{C}_\mu^{(g+1)} = \sum_{i=1}^\mu w_i (\mathbf{x}_{i:\lambda} - \mathbf{m}^{(g)})(\mathbf{x}_{i:\lambda} - \mathbf{m}^{(g)})^\top / \sigma^{(g)2}\)$ But a single generation gives too few samples and the estimate is noisy.
Rank-1 update: use an evolution path \(\mathbf{p}_c\) to accumulate displacement information across generations: $\(\mathbf{p}_c^{(g+1)} = (1-c_c)\mathbf{p}_c^{(g)} + \sqrt{c_c(2-c_c)\mu_{\mathrm{eff}}} \cdot \frac{\mathbf{m}^{(g+1)} - \mathbf{m}^{(g)}}{\sigma^{(g)}}\)$ $\(\mathbf{C}^{(g+1)} = (1-c_1-c_\mu)\mathbf{C}^{(g)} + c_1\mathbf{p}_c\mathbf{p}_c^\top + c_\mu \mathbf{C}_\mu^{(g+1)}\)$
Intuition: if the mean drifts in the same direction over several generations, the path \(\mathbf{p}_c\) becomes long; CMA-ES "etches" this direction into the covariance, biasing future samples toward it—capturing long-range correlations.
3.4 Global Step-Size Control (CSA)
Similarly, maintain a conjugate evolution path \(\mathbf{p}_\sigma\): $\(\mathbf{p}_\sigma^{(g+1)} = (1-c_\sigma)\mathbf{p}_\sigma^{(g)} + \sqrt{c_\sigma(2-c_\sigma)\mu_{\mathrm{eff}}}\cdot \mathbf{C}^{(g)\,-1/2} \frac{\mathbf{m}^{(g+1)}-\mathbf{m}^{(g)}}{\sigma^{(g)}}\)$ $\(\sigma^{(g+1)} = \sigma^{(g)} \exp\!\left(\frac{c_\sigma}{d_\sigma}\left(\frac{\|\mathbf{p}_\sigma\|}{E\|\mathcal{N}(0,I)\|}-1\right)\right)\)$
Intuition: if successive steps are almost collinear (the path is long), \(\sigma\) is too small → enlarge it; if they cancel each other out (the path is short), \(\sigma\) is too large → shrink it. This is the continuous, anisotropic generalization of the 1/5 rule.
3.5 Connection to Natural Gradients: the IGO Framework
In the NES paper (2014), Wierstra et al. showed that CMA-ES can, under a particular parameterization, be interpreted as natural-gradient ascent in the parameter space of the search distribution—where the Fisher information metric automatically yields the rank-1/rank-\(\mu\) update form. Ollivier et al. (2017) then unified an entire family of evolutionary algorithms (CMA-ES, xNES, PBIL, EDA) into a single Information-Geometric Optimization (IGO) flow on the same parameter space.
Engineering takeaway: the "magic" of CMA-ES is almost entirely determined by Fisher geometry. Few of its hyperparameters need tuning—the library defaults work directly in 80% of scenarios.
3.6 Suitable Problem Sizes
- Dimension \(d \le 100\): standard CMA-ES handles it fully.
- \(d \in [100, 1000]\): use sep-CMA-ES (diagonal covariance) or LM-CMA-ES (low-rank approximation).
- \(d > 1000\): the \(\mathcal{O}(d^2)\) memory of the covariance matrix breaks down—switch to NES variants or simplified schemes like OpenAI ES that only use isotropic Gaussian perturbations.
4. ES from a Modern Lens: as a Black-Box Optimizer
Viewing ES as "simulated annealing + adaptive variance + multiple sampling" is a reasonable engineering picture. Its engineering advantages include:
- Zeroth-order: needs only \(f(\theta)\), not \(\nabla f\).
- Parallelizable: \(\lambda\) offspring are evaluated independently.
- Invariance: CMA-ES is invariant under monotonic transforms of fitness (it uses ranking only) and invariant under linear transforms (covariance adaptation)—two properties that make it more robust than SGD on poorly scaled problems.
- No reliance on backprop: the evaluator can be a simulator, a real robot, or human ratings.
5. OpenAI ES: Pushing ES to Large-Scale RL
The 2017 paper Evolution Strategies as a Scalable Alternative to Reinforcement Learning by Salimans, Ho, Chen, and Sidor (OpenAI) is the landmark event for ES's revival in the deep-learning era.
5.1 The Algorithm
Parameter \(\theta \in \mathbb{R}^d\) (the flattened vector of an entire neural network). Each step:
- Sample \(n\) Gaussian perturbations \(\epsilon_1, \dots, \epsilon_n \sim \mathcal{N}(0, I)\).
- Evaluate \(F_i = F(\theta + \sigma \epsilon_i)\) (the return from one rollout in the environment).
- Estimate the gradient and update: $\(\nabla_\theta \mathbb{E}_{\epsilon}[F(\theta + \sigma\epsilon)] \approx \frac{1}{n\sigma}\sum_{i=1}^n F_i\, \epsilon_i\)$ $\(\theta \leftarrow \theta + \alpha \cdot \frac{1}{n\sigma}\sum_{i=1}^n F_i\, \epsilon_i\)$
This is essentially the fixed-variance version of NES (Natural Evolution Strategies)—a score-function gradient (REINFORCE-style) estimate over \(\theta\).
5.2 The Key Engineering Innovation: Seed Communication
Inter-worker communication cost is the bottleneck of large-scale ES with thousands of CPU workers. The OpenAI trick: all workers share the same RNG seed table, and communication only requires broadcasting tuples \((i, F_i)\)—each worker regenerates \(\epsilon_i\) from seed \(i\) on its own. This cuts communication from \(\mathcal{O}(d)\) to \(\mathcal{O}(1)\).
5.3 ES vs PPO
| Dimension | OpenAI ES | PPO / A3C |
|---|---|---|
| Signal | Episode-level scalar return | Per-step advantage |
| Gradient | Finite-difference estimate | Analytic gradient via backprop |
| Sample efficiency | Low (\(10^7 \sim 10^9\) frames) | High (\(10^6 \sim 10^8\)) |
| Wall-clock efficiency | Fast (CPU-parallelizable) | Bottlenecked by GPU/data efficiency |
| Communication | \(\mathcal{O}(1)\) per step | Gradient \(\mathcal{O}(d)\) per step |
| Exploration | Parameter-space perturbations | Action-space stochasticity |
| Sparse rewards | Robust (episode-level returns) | Brittle |
| Long horizons | Robust | Hard credit assignment |
| Implementation complexity | Extremely simple (~50 lines) | Moderate (replay/clipping/GAE) |
OpenAI ran Atari on 1,440 CPUs and achieved scores comparable to A3C, with shorter wall-clock time. Conclusion: when massive parallelism is available and rewards are sparse with long horizons, ES is a strong RL alternative.
6. NEAT: Co-evolving Topology and Weights
NeuroEvolution of Augmenting Topologies (NEAT) was proposed by Kenneth Stanley and Risto Miikkulainen in 2002. It addresses three core problems that had long plagued evolving neural network topologies: competing conventions, initial explosion, and early death of good structures.
6.1 Three Innovations
(1) Innovation Number (historical marking): each newly added connection is given a globally increasing ID. Whether two connections are "homologous" is determined by ID, not by structural alignment—definitively solving the alignment problem in topological crossover.
(2) Speciation: divide the population into species by structural distance. The distance formula: $\(\delta = \frac{c_1 E}{N} + \frac{c_2 D}{N} + c_3 \bar{W}\)$ where \(E\) = excess genes (those held by one parent but not the other at all), \(D\) = disjoint genes (mismatched in the middle), and \(\bar W\) = the mean weight difference of shared genes. Individuals with \(\delta < \delta_t\) are deemed to be of the same species.
Explicit fitness sharing: fitness within a species is divided evenly (\(f_i' = f_i / |S|\)), preventing small species (new topologies) from being swamped by large ones—the core protective mechanism for "gradual complexification".
(3) Gradual complexification: start from a minimal topology (input directly to output) and monotonically increase structural complexity through add-node / add-link mutations only. This avoids the wasteful "complexify then prune" cycle.
6.2 The NEAT Speciation Loop
flowchart TD
A[New generation of offspring] --> B[For each individual]
B --> C{delta distance to<br/>existing species representative}
C -- delta < delta_t --> D[Add to that species]
C -- No species matches --> E[Create new species<br/>this individual as representative]
D --> F[Within-species fitness sharing<br/>f_i divide |S|]
E --> F
F --> G[Allocate reproductive<br/>quotas per species]
G --> H[Cull species stagnant for too long]
H --> I[Next-generation selection / mutation]
6.3 Applications of NEAT
- OpenAI Gym control tasks: NEAT approaches optimum within hundreds of generations on CartPole and Pendulum.
- MarI/O: YouTuber SethBling used NEAT to play Super Mario Bros, with enormous popular-science influence.
- Robot control: competitive with traditional RL on continuous control.
7. HyperNEAT: Indirect Encoding and Geometric Priors
Direct encoding (NEAT) does not scale to neural networks with millions of parameters—every connection must be represented separately. HyperNEAT (Stanley, D'Ambrosio, Gauci 2009) solves this with indirect encoding.
7.1 CPPN: Pattern-Generating Networks
A Compositional Pattern Producing Network (CPPN) is an ordinary neural network, but its "task" is not classification—it is producing the connection weights of other neural networks. Specifically, for each pair of neurons \((u, v)\) in the target network, their positions \((x_u, y_u, x_v, y_v)\) are fed into the CPPN, which outputs the connection weight \(w_{uv}\): $\(w_{uv} = \mathrm{CPPN}(x_u, y_u, x_v, y_v)\)$
CPPNs are evolved with NEAT (a CPPN is itself a neural network). Beyond sigmoid, the activation set includes sin, Gaussian, and abs—these function families help produce symmetric, repetitive, and gradient-like geometric patterns.
7.2 Geometric Priors
The core insight of HyperNEAT: in real problems, the connectivity patterns of neural networks often have geometric structure (the retinotopic mapping from retina to V1, the somatotopic map of motor cortex, etc.). Placing neurons in a geometric coordinate space and letting the CPPN learn the coordinate-to-weight mapping amounts to imposing the positional inductive bias that "neighboring neurons should have correlated weights".
7.3 Subsequent Developments
- ES-HyperNEAT (Risi & Stanley 2010): lets the CPPN itself decide how many neurons to place and where, no longer fixing the substrate.
- Adaptive HyperNEAT: makes the learning rate one of the CPPN's outputs.
- This family of ideas is directly related to the modern deep-learning notion of HyperNetworks (Ha et al. 2016)—a neural network that generates the weights of another neural network.
8. Neuroevolution vs Reinforcement Learning: Which to Choose
flowchart TD
A[Decision starting point] --> B{Reward dense<br/>observable per step?}
B -- No / sparse --> NE1[Lean toward NE / ES]
B -- Yes --> C{Action space discrete<br/>and small?}
C -- Yes --> RL1[Lean toward RL DQN/PPO]
C -- No / continuous high-dim --> D{Policy needs<br/>long-term memory?}
D -- Yes --> NE2[NE / NEAT topology search]
D -- No --> E{Compute: GPU vs CPU?}
E -- Mostly GPU --> RL2[Lean toward RL]
E -- Lots of CPU --> NE3[Lean toward ES]
A --> F{Need to evolve network architecture?}
F -- Yes --> NE4[NEAT / NAS]
F -- No --> G{Hyperparameter sensitive?}
G -- Yes --> NE5[ES is more robust]
G -- No --> RL3[RL has better sample efficiency]
Summary heuristics:
| Lean toward NE/ES | Lean toward RL |
|---|---|
| Sparse rewards / final-only outcome | Dense rewards |
| High-dim continuous control, long horizons | Discrete small action spaces, short horizons |
| Need to evolve topology | Fixed topology |
| Plenty of CPUs, scarce GPUs | Plenty of GPUs |
| Black-box simulator (no gradient) | Differentiable simulator |
| Tolerant of low sample efficiency | Care about sample efficiency |
9. Practical Considerations
9.1 Population Size and Parallelism
- CMA-ES: default \(\lambda = 4 + \lfloor 3\ln d \rfloor\), e.g., \(\lambda \approx 18\) for \(d=100\). Can be enlarged to a few hundred to accelerate convergence (when parallelism is sufficient).
- OpenAI ES: typically \(n = 1000 \sim 10000\). The paper used 1,440 CPU workers, each evaluating several perturbations per generation.
- NEAT: \(N = 150 \sim 300\). The number of species is typically held between 5 and 15 (with adaptive \(\delta_t\)).
9.2 Hybrids with SGD
- Genetic SGD: each generation fine-tunes offspring with SGD before evaluating—a typical memetic recipe.
- PBT (Population-Based Training): a GA + SGD hybrid by DeepMind for online tuning of hyperparameters/learning rates (Jaderberg et al. 2017).
- ES + backprop: Conti et al. (2018) backpropagated through the ES loss, forming an ES-Gradient hybrid.
9.3 Implementation Pitfalls
- The variance of Gaussian perturbations must be regularized—without normalization, perturbation magnitudes blow up as \(\sqrt{d}\) in high dimensions.
- ES's invariance to fitness scaling holds only when rank-based fitness shaping is used (subtract mean and divide by std, or convert to ranks).
- NEAT's \(\delta_t\) cannot be fixed—as the population evolves, the scale of structural distance drifts, requiring adaptation.
10. Example: ES vs PPO Pseudocode in a Gymnasium Environment
Below is a minimal comparison skeleton of OpenAI ES vs PPO on the same Gym environment—the structures are highly similar, with the difference lying only in the update rule.
# ============ OpenAI ES ============
def openai_es(env, policy_net, n_workers=64, sigma=0.1, alpha=0.01, gens=500):
theta = flatten(policy_net.parameters())
for g in range(gens):
seeds = [rng.randint(2**31) for _ in range(n_workers)]
rewards = []
for s in seeds: # 可并行 ray.remote
eps = sample_normal_with_seed(s, theta.shape)
theta_perturbed = theta + sigma * eps
r = rollout(env, unflatten(theta_perturbed))
rewards.append(r)
rewards = rank_normalize(rewards) # fitness shaping
grad_est = sum(r * regen_eps(s, theta.shape)
for r, s in zip(rewards, seeds)) / (n_workers * sigma)
theta = theta + alpha * grad_est
# ============ PPO (skeleton) ============
def ppo(env, policy_net, value_net, T=2048, K=10, eps=0.2, gens=500):
for g in range(gens):
traj = collect_trajectories(env, policy_net, T)
adv = gae(traj, value_net)
for _ in range(K):
ratio = exp(policy_net.logp(traj) - traj.old_logp)
surr = min(ratio * adv, clip(ratio, 1-eps, 1+eps) * adv)
loss = -surr.mean() + 0.5 * (value_net(traj.s) - traj.ret).pow(2).mean()
sgd_step(loss)
In production, the ES part is typically run with ray / torch.distributed worker pools, while PPO uses Stable-Baselines3 / RLlib.
11. Cross-references
- Parent page: 本页 §1(派系入口) — Overall positioning of the Evolutionaries
- Same directory: Genetic Algorithms and Genetic Programming — The bit-string/tree encoded version of GA
- Same directory: Swarm Intelligence and AutoML — Comparison of PSO/ACO/NAS
- Reinforcement learning: ../../../2_ReinforcementLearning/ — Details of PPO/A3C/SAC; this page only does comparisons
- Deep learning: ../../../1_DeepLearning/ — Pages on HyperNetworks and neural architecture search
References
- Rechenberg, I. (1973). Evolutionsstrategie: Optimierung technischer Systeme nach Prinzipien der biologischen Evolution. Frommann-Holzboog. — The original ES PhD thesis.
- Schwefel, H.-P. (1981). Numerical Optimization of Computer Models. Wiley.
- Hansen, N., & Ostermeier, A. (2001). Completely derandomized self-adaptation in evolution strategies. Evolutionary Computation, 9(2), 159–195.
- Hansen, N. (2016). The CMA Evolution Strategy: A Tutorial. arXiv:1604.00772. — The authoritative engineering tutorial for CMA-ES.
- Wierstra, D., Schaul, T., Glasmachers, T., Sun, Y., Peters, J., & Schmidhuber, J. (2014). Natural Evolution Strategies. JMLR, 15, 949–980.
- Ollivier, Y., Arnold, L., Auger, A., & Hansen, N. (2017). Information-geometric optimization algorithms: A unifying picture via invariance principles. JMLR, 18, 1–65.
- Salimans, T., Ho, J., Chen, X., Sidor, S., & Sutskever, I. (2017). Evolution Strategies as a Scalable Alternative to Reinforcement Learning. arXiv:1703.03864.
- Stanley, K. O., & Miikkulainen, R. (2002). Evolving neural networks through augmenting topologies. Evolutionary Computation, 10(2), 99–127.
- Stanley, K. O., D'Ambrosio, D. B., & Gauci, J. (2009). A hypercube-based encoding for evolving large-scale neural networks. Artificial Life, 15(2), 185–212.
- Risi, S., & Stanley, K. O. (2010). Evolving plastic neural networks with novel adaptation rules. GECCO.
- Conti, E., Madhavan, V., Petroski Such, F., Lehman, J., Stanley, K. O., & Clune, J. (2018). Improving exploration in evolution strategies for deep reinforcement learning via a population of novelty-seeking agents. NeurIPS.
- Jaderberg, M., et al. (2017). Population Based Training of Neural Networks. arXiv:1711.09846.
- Ha, D., Dai, A., & Le, Q. V. (2016). HyperNetworks. arXiv:1609.09106.
Swarm Intelligence and AutoML
This page focuses on two parallel threads within the Evolutionaries: Swarm Intelligence (SI) and Automated Machine Learning (AutoML). The former is a family of decentralized optimization methods that emulate the collective behavior of ant colonies, bird flocks, and bee swarms; the latter applies evolution and other black-box optimization techniques to the machine-learning pipeline itself—hyperparameters, model families, and neural network architectures all become objects of search.
Before reading this page, we recommend looking at 本页 §1(派系入口). Together with Genetic Algorithms and Genetic Programming and Evolution Strategies and Neuroevolution, this page completes the Evolutionary series.
1. Swarm Intelligence: Decentralization, Local Interaction, Emergence
Swarm Intelligence (SI) was proposed by Beni and Wang in 1989 as a research paradigm for collaborative robot swarms. Its three pillars are:
- Decentralization: there is no global scheduler; each individual sees only the state of its local neighborhood.
- Local interaction: individuals communicate using simple rules and limited information.
- Emergence: global intelligent behavior arises "spontaneously" from the interactions of many simple individuals.
Key differences between SI and GA:
| Dimension | GA | SI (PSO/ACO) |
|---|---|---|
| Inter-individual relations | Selective competition (survival of the fittest) | Cooperative communication (pheromone / global-best sharing) |
| Primary operator | Crossover + mutation | Update rule (move toward the best) |
| Has explicit "generations"? | Yes, with discrete generations | Usually continuous updates |
| Primary metaphor | Natural selection | Swarm behavior |
2. Particle Swarm Optimization (PSO)
Particle Swarm Optimization (PSO) was proposed by James Kennedy and Russell Eberhart in 1995, inspired by the cooperative foraging of bird flocks.
2.1 The Algorithm
Each particle \(i\) maintains:
- position \(\mathbf{x}_i \in \mathbb{R}^d\)
- velocity \(\mathbf{v}_i \in \mathbb{R}^d\)
- personal best \(\mathbf{p}_i\)
- global best \(\mathbf{g}\) (or neighborhood best \(\mathbf{l}_i\))
Each step: $\(\mathbf{v}_i^{t+1} = \omega \mathbf{v}_i^t + c_1 r_1 (\mathbf{p}_i - \mathbf{x}_i^t) + c_2 r_2 (\mathbf{g} - \mathbf{x}_i^t)\)$ $\(\mathbf{x}_i^{t+1} = \mathbf{x}_i^t + \mathbf{v}_i^{t+1}\)$
where \(r_1, r_2 \sim \mathcal{U}[0,1]^d\) are sampled component-wise; \(\omega\) is the inertia weight; \(c_1, c_2\) are the cognitive and social learning factors, with typical values \(c_1 = c_2 = 1.49\) and \(\omega \in [0.4, 0.9]\).
2.2 Parameter Intuitions
- Large \(\omega\) → strong inertia, strong exploration; small \(\omega\) → strong convergence. A common linear decay schedule is from 0.9 down to 0.4.
- Large \(c_1\) → particles trust their own history → multimodal protection.
- Large \(c_2\) → particles trust the global best → fast convergence but easy premature convergence.
- Clerc constriction factor (\(\chi\)): a more stable setting \(\mathbf{v} \leftarrow \chi[\mathbf{v} + c_1 r_1 (\mathbf{p}_i - \mathbf{x}) + c_2 r_2 (\mathbf{g} - \mathbf{x})]\), with \(\chi = 2/|2 - \varphi - \sqrt{\varphi^2 - 4\varphi}|\) and \(\varphi = c_1+c_2 > 4\).
2.3 Topologies
- gbest (star): all particles share the global best → fast convergence, prone to premature convergence.
- lbest (ring): each particle sees only its left/right neighbors → slow convergence but resists premature convergence.
- Von Neumann lattice: each particle sees its four (top/bottom/left/right) neighbors.
- Random topology: dynamically changing neighbor relations.
2.4 Important Variants
- CLPSO (Comprehensive Learning PSO, Liang 2006): each dimension picks its own learning target, improving multimodal search.
- APSO (Adaptive PSO): adapts \(\omega, c_1, c_2\) based on the evolutionary state.
- Bare-bones PSO: removes velocity entirely, sampling directly \(\mathbf{x}_i^{t+1} \sim \mathcal{N}((\mathbf{p}_i + \mathbf{g})/2, |\mathbf{p}_i - \mathbf{g}|)\).
3. Ant Colony Optimization (ACO)
Ant Colony Optimization (ACO) was introduced in Marco Dorigo's 1992 PhD thesis. It is inspired by ants communicating indirectly through pheromones and ultimately converging to the shortest path.
3.1 Key Mechanisms of ACO
-
Solution construction: each ant incrementally builds a complete solution (e.g., a TSP tour), making decisions according to: $\(p_{ij}^k = \frac{[\tau_{ij}]^\alpha [\eta_{ij}]^\beta}{\sum_{l \in N_i^k}[\tau_{il}]^\alpha [\eta_{il}]^\beta}\)$ where \(\tau_{ij}\) is the pheromone concentration on edge \((i,j)\), \(\eta_{ij}\) is a heuristic (e.g., \(1/d_{ij}\)), and \(\alpha, \beta\) control the influence of pheromone and heuristic respectively.
-
Pheromone update: after all ants finish a round: $\(\tau_{ij} \leftarrow (1-\rho)\tau_{ij} + \sum_{k=1}^m \Delta\tau_{ij}^k\)$ \(\rho \in (0, 1]\) is the evaporation rate, \(\Delta\tau_{ij}^k = Q/L^k\) if ant \(k\) traversed \((i,j)\) and 0 otherwise, and \(L^k\) is the total cost of ant \(k\)'s tour.
-
Positive feedback: edges of good solutions accumulate more pheromone → subsequent ants are more inclined to traverse them → reinforcing good solutions. The evaporation mechanism prevents over-reinforcement of suboptimal solutions discovered early.
3.2 ACO Variants
- AS (Ant System): the original version (Dorigo 1992); all ants update.
- MMAS (MAX-MIN Ant System, Stützle 2000): pheromone is bounded to \([\tau_{\min}, \tau_{\max}]\), and only the best-so-far ant updates—greatly mitigating premature convergence.
- ACS (Ant Colony System, Dorigo & Gambardella 1997): introduces a local pheromone update + a pseudo-random proportional selection rule.
- Rank-based AS: only the top-\(\sigma\) ants update, weighted by rank.
3.3 Application Scenarios
ACO is best suited to discrete combinatorial optimization:
- TSP, vehicle routing (VRP), scheduling, network routing (AntNet), assembly sequences, feature selection.
- It is less competitive than PSO/CMA-ES on continuous optimization—the discrete pheromone structure is awkward in continuous space.
4. Other Swarm Algorithms (Brief Tour)
4.1 Artificial Bee Colony (ABC, Karaboga 2005)
Mimics bee foraging. Three roles:
- Employed bees: each maintains a food source (candidate solution) and performs local search.
- Onlooker bees: choose food sources based on the dance (fitness) of employed bees.
- Scout bees: when a food source is exhausted (no improvement for
limitgenerations), scout bees randomly restart it.
Update rule: \(x'_{ij} = x_{ij} + \phi(x_{ij} - x_{kj})\), \(\phi \in [-1, 1]\), with \(k\) being another randomly chosen food source.
ABC's strengths are few parameters (mainly limit and population size) and good resistance to premature convergence; its weakness is that it offers no clear edge over PSO/CMA-ES outside high-dimensional, multimodal problems.
4.2 Firefly Algorithm (FA, Yang 2008)
Each firefly is attracted to brighter ones: $\(\mathbf{x}_i \leftarrow \mathbf{x}_i + \beta_0 e^{-\gamma r_{ij}^2}(\mathbf{x}_j - \mathbf{x}_i) + \alpha \boldsymbol{\epsilon}_i\)$ where \(r_{ij} = \|\mathbf{x}_i - \mathbf{x}_j\|\), \(\beta_0\) is the base attraction, \(\gamma\) is the decay coefficient, and \(\alpha\) is the random perturbation magnitude.
Its hallmark is distance-based decay—brighter fireflies attract only nearby fireflies strongly, naturally forming multiple niches → suitable for multimodal optimization.
4.3 Cuckoo Search (CS, Yang & Deb 2009)
Inspired by cuckoos laying eggs in other birds' nests, with eggs ejected if discovered. Algorithm steps:
- Generate new solutions via Lévy flight: \(\mathbf{x}_i^{t+1} = \mathbf{x}_i^t + \alpha \otimes L(s, \lambda)\), where \(L\) is a Lévy distribution (heavy-tailed, with long-range jumps).
- Randomly replace solutions in some "nests".
- With probability \(p_a\), abandon the worst "discovered" eggs and regenerate them.
The heavy-tailed nature of Lévy flights gives CS stronger global exploration than PSO.
4.4 Comparison of Swarm Algorithms
| Algorithm | Main inspiration | Information sharing | Strengths | Weaknesses |
|---|---|---|---|---|
| PSO | Bird flocks | Shares personal/global best | Continuous optimization, simple to implement | Premature convergence in high dimensions |
| ACO | Ant pheromones | Pheromone matrix | Discrete combinatorial, TSP | Awkward for continuous optimization |
| ABC | Bee division of labor | Fitness-proportional dance | Multimodal, few parameters | Slow convergence in high dimensions |
| FA | Firefly brightness decay | Distance-based attraction decay | Multimodal | Sensitive to parameter \(\gamma\) |
| CS | Cuckoo + Lévy | Replaces worst nest | Strong global exploration | Poor local refinement, often hybridized |
5. Swarm Intelligence vs GA: An Overview
| Dimension | GA | SI (PSO/ACO) |
|---|---|---|
| Encoding | Bit string / real / tree | Real / path / discrete structure |
| Primary operations | Selection + crossover + mutation | Update rule (move toward best) |
| Information flow | Parent → child (generational) | Continuous sharing among individuals |
| Convergence speed | Slower | Faster (but premature convergence is common) |
| Diversity preservation | sharing/crowding/island | Topology/inertia |
| Applicability | General | Depends on the specific algorithm |
6. AutoML: Automating the Machine Learning Pipeline
AutoML (Automated Machine Learning) aims to enable non-experts to build high-quality machine-learning systems—automating data preprocessing, feature engineering, model selection, hyperparameter optimization, ensemble, and deployment across the entire pipeline.
6.1 Subproblems of AutoML
flowchart LR
A[Raw data] --> B[Data cleaning / missing-value handling]
B --> C[Feature engineering / feature selection]
C --> D[Model family selection]
D --> E[Hyperparameter optimization HPO]
E --> F[Neural architecture search NAS]
E --> G[Model ensemble / distillation]
F --> H[Deployment / monitoring]
G --> H
subgraph Meta-learning
M1[Meta-data]
M2[warm-start]
end
M1 -.- B
M1 -.- D
M2 -.- E
Main subproblems:
- HPO (Hyperparameter Optimization): learning rate, regularization, network width, etc.
- NAS (Neural Architecture Search): the network architecture itself.
- Meta-learning: leverages experience from prior tasks to speed up new tasks.
- Data-augmentation search: AutoAugment, RandAugment.
- Feature engineering: FeatureTools, AutoCross.
6.2 The Three Schools of AutoML
- Evolutionary: encode the ML pipeline as a chromosome and evolve it (TPOT, AutoML-Zero).
- Bayesian: fit a hyperparameter→performance surrogate using a Gaussian process or TPE (Hyperopt, Optuna).
- Reinforcement learning: model pipeline construction as an MDP (Zoph & Le NAS-RL, Google Vizier).
7. Comparison of Hyperparameter Optimization Methods
| Method | Principle | Dimensions | Smoothness needed? | Typical tools |
|---|---|---|---|---|
| Grid search | Discrete grid traversal | Low (\(\le 4\)) | No | sklearn |
| Random search | Uniform sampling | Medium (\(\le 20\)) | No | sklearn |
| Evolutionary | GA / CMA-ES / TPE | Medium–high | No | DEAP, nevergrad |
| Bayesian optimization (BO) | GP/TPE/SMAC | Medium (\(\le 50\)) | Soft (kernel assumption) | scikit-optimize, Optuna, BoTorch |
| Hyperband | Multi-armed bandit + early stopping | Any | No | hyperband, HpBandSter |
| BOHB | BO + Hyperband hybrid | Medium–high | Soft | HpBandSter |
| PBT | Evolution + online training | Any | No | Ray Tune |
Rules of thumb:
- Dimensions \(\le 5\), cheap evaluations: random search is enough.
- Dimensions 5–30, expensive evaluations: Bayesian optimization / TPE.
- Training-curve information available: Hyperband / BOHB (more resource-efficient).
- Hyperparameters tuned alongside training: PBT.
- Non-differentiable objectives / many discrete choices: CMA-ES, GA.
In Random Search for Hyper-Parameter Optimization, Bergstra & Bengio (2012) proved that in many scenarios random search beats grid search—because most hyperparameters are unimportant ("low effective dimensionality") and grids waste evaluations on the irrelevant dimensions. This is essential reading for HPO.
8. Neural Architecture Search (NAS)
NAS is the most-watched subproblem in AutoML. Its goal is to automatically discover network architectures that surpass human designs.
8.1 Three Dimensions
Every NAS method must answer three questions:
| Dimension | Options |
|---|---|
| Search space | macro (full net) / cell-based / hierarchical / morphism-based |
| Search strategy | RL / evolution / Bayesian / differentiable / random |
| Performance estimation | Full training / early stopping / weight sharing / performance predictor |
8.2 Search Spaces
- Macro: specify the entire network layer by layer (Zoph & Le 2017 original). The space is huge and search is hard.
- Cell-based: search small cells, then stack them into a network (NASNet, AmoebaNet, DARTS). The mainstream choice—shrinks the space dramatically and transfers across resolutions/tasks.
- Hierarchical: cells themselves contain hierarchical motifs.
- Morphism-based: starts from a known model and expands via equivalence-preserving transforms (NetMorph).
8.3 Search Strategies
8.3.1 Evolutionary: AmoebaNet (Real et al. 2019)
Encode the architecture as a DAG chromosome. The key innovation of Regularized Evolution is replacing elitism with "age-based" selection—always retiring the oldest individual to avoid being dominated long-term by an early lucky solution. Under the same compute budget, it beat NAS-RL, discovering AmoebaNet-A with 83.9% top-1 on ImageNet.
8.3.2 RL: NAS-RL (Zoph & Le 2017)
A Controller RNN samples architecture descriptors step by step; the architecture is trained on CIFAR for validation accuracy as reward; REINFORCE / PPO updates the controller.
- Original version: 800 GPUs × 28 days.
- Subsequent NASNet, ENAS dramatically reduced cost.
8.3.3 Differentiable: DARTS (Liu et al. 2019)
Turn "select one of the candidate ops" into a softmax-weighted sum: $\(\bar o^{(i,j)}(x) = \sum_{o \in \mathcal{O}} \frac{\exp(\alpha_o^{(i,j)})}{\sum_{o'} \exp(\alpha_{o'}^{(i,j)})}\, o(x)\)$ where \(\alpha\) are architecture parameters, trained jointly with weights \(w\) via SGD (bilevel optimization). At the end, \(\arg\max \alpha\) discretizes the architecture.
DARTS reduced NAS cost from GPU-days to a few GPU-hours and has been widely adopted. But it has known issues: search-evaluation gap, skip-connection collapse, etc., spawning improvements such as PC-DARTS, Fair-DARTS, and SDARTS.
8.4 Comparison Diagram of the Three NAS Strategies
flowchart TD
SP[Search space] --> S1[Search strategy]
S1 --> RL[Reinforcement learning<br/>RNN controller<br/>NAS-RL Zoph 2017]
S1 --> EV[Evolution<br/>Chromosome + mutation / crossover<br/>AmoebaNet Real 2019]
S1 --> DG[Differentiable<br/>softmax over ops<br/>DARTS Liu 2019]
RL --> P1[Full training<br/>per sampled architecture]
EV --> P1
DG --> P2[Weight sharing<br/>One-shot]
P1 --> EVAL[Accuracy reward]
P2 --> EVAL
EVAL --> S1
| Dimension | NAS-RL | AmoebaNet | DARTS |
|---|---|---|---|
| Strategy | Reinforcement learning | Evolution (Regularized Evolution) | Differentiable (bilevel) |
| Compute | 800 GPU·days | \(\sim\) 3000 GPU·days | 0.4 GPU·days |
| One-shot? | No | No (later ENAS is) | Yes |
| Reproduction stability | Poor (PPO tuning) | Good | Medium (skip-connection collapse) |
| Architectures found | NASNet | AmoebaNet | DARTS-cells |
8.5 Performance Predictors and Weight Sharing
Fully training each candidate is extremely expensive. Three major cost-reduction techniques:
- Performance Predictor: train a regressor on historical (architecture, accuracy) data and directly predict for new architectures → PNAS, Neural Predictor.
- Weight Sharing / One-Shot: one supernet contains all candidates; after training the supernet, all subnets inherit weights → ENAS, SPOS, ProxylessNAS.
- Early Stopping / Learning Curve Extrapolation: terminate low-potential training early → Hyperband / Successive Halving.
8.6 ProxylessNAS and Hardware Awareness
ProxylessNAS (Cai et al. 2019) searches directly on the target hardware (phone/GPU/FPGA), incorporating latency into the loss: $\(\mathcal{L} = \mathcal{L}_{\mathrm{CE}} + \lambda_1 \|w\|^2 + \lambda_2 \mathbb{E}[\mathrm{Latency}]\)$ Latency is obtained per submodule via lookup tables and is differentiable with respect to architecture parameters. This idea drove the development of phone-side SOTA networks like MobileNet-V3 and EfficientNet.
9. Comparison of Industrial AutoML Frameworks
| Framework | Main algorithms | Strengths | Suitable scenarios |
|---|---|---|---|
| DEAP | GA/GP/ES (Python) | Flexible, textbook-style API | Research, teaching, custom EAs |
| pymoo | NSGA-II/III, MOEA/D | Multi-objective optimization | Engineering optimization, Pareto fronts |
| nevergrad (Facebook) | CMA-ES, TBPSA, bayes | Many black-box benchmarks | General black-box optimization |
| Optuna | TPE, CMA-ES, NSGA-II | Seamless sklearn/PyTorch, pruning | Industrial first choice for HPO |
| Hyperopt | TPE | Early classic | HPO, gradually being replaced by Optuna |
| Ray Tune | PBT, ASHA, BOHB, Optuna | Distributed + scheduling | Large-scale distributed HPO |
| NNI (Microsoft) | TPE, SMAC, PPO, Evolution, DARTS | NAS + HPO unified | NAS experiments |
| AutoGluon (AWS) | Model ensembling + meta-learning | One-click on tabular/image/text | End-to-end engineering deployment |
| AutoKeras | NAS + Bayesian | Friendly to Keras users | Quick onboarding |
| AutoSklearn | SMAC + meta-learning | sklearn-compatible | Small tabular data |
| TPOT | GP evolves sklearn pipelines | Pipeline-level evolution | Tabular ML pipeline optimization |
| AutoML-Zero | Evolves basic instructions | Evolves the algorithm itself from scratch | Research / extreme automation |
10. Limitations of AutoML
AutoML is no silver bullet. Three fundamental limitations:
10.1 Search Cost
- AmoebaNet: \(\sim\) 3000 GPU·days
- Early NAS-RL: 800 GPU·days
- DARTS / one-shot dramatically reduce cost, but it is still no "free lunch"
The carbon cost was quantified by Strubell et al. (2019): training a single large NAS model emits roughly as much carbon as a car's entire life cycle.
10.2 Generalization and Transfer
- Architectures searched on CIFAR-10 are not necessarily optimal on ImageNet (although in practice transfer is usually effective).
- Architectures searched on ImageNet-1k are not necessarily optimal on medical imaging or satellite imagery.
- Proxy-task bias: using small datasets / few epochs as a proxy can produce preferences that misalign with the actual large-scale task.
10.3 Interpretability
- Architectures searched by evolution/RL often have "weird" connections (skip chains, irregular op combinations) that humans struggle to understand.
- HPO returns "\(\eta = 3.7 \times 10^{-4}\)" as the optimum without offering principled insight.
- This creates tension with the Symbolists' / Bayesians' pursuit of interpretability.
10.4 Fairness and Reproducibility
Lindauer & Hutter (2020) systematically pointed out: many NAS papers compare different search algorithms with different search spaces, making it impossible to attribute the gains. The arrival of benchmarks such as NAS-Bench-101/201/301 finally enabled fair comparisons of search algorithms themselves.
11. Engineering Practice Checklist
A checklist to walk through when deploying an AutoML system:
- [ ] Is the search space reasonable? Too large → wasteful; too small → unable to find good solutions.
- [ ] Correlation between proxy task and target task? Run transfer experiments.
- [ ] Compute budget cap? Constrain both wall-clock and GPU-days.
- [ ] Early-stopping strategy? Hyperband / ASHA usually saves 5–10×.
- [ ] One-shot? Weigh supernet training cost against subnet inheritance accuracy.
- [ ] Multi-objective (accuracy + latency + memory)? Use NSGA-II / Pareto.
- [ ] Meta-learning warm-start? Can speed up cold starts.
- [ ] Monitor the search process: top-\(k\) diversity, convergence curves, hyperparameter importance.
- [ ] Final validation: full training + held-out test set + multiple seeds.
- [ ] Reproducibility: fix seeds, log all hyperparameters, containerize.
12. Cross-references
- Parent page: 本页 §1(派系入口) — Overall positioning and algorithmic lineage of the Evolutionaries
- Same directory: Genetic Algorithms and Genetic Programming — GA/GP is the engine of AutoML's evolutionary branch
- Same directory: Evolution Strategies and Neuroevolution — ES/CMA-ES are important in HPO and NAS
- Deep learning: ../../../1_DeepLearning/ — Details of NAS-discovered networks like AmoebaNet/DARTS
- Bayesians: 贝叶斯派 — TPE/SMAC/BO form the Bayesian camp of HPO
References
- Beni, G., & Wang, J. (1989). Swarm intelligence in cellular robotic systems. NATO Advanced Workshop on Robots and Biological Systems.
- Kennedy, J., & Eberhart, R. (1995). Particle swarm optimization. Proceedings of ICNN, 4, 1942–1948. — The original PSO paper.
- Clerc, M., & Kennedy, J. (2002). The particle swarm: explosion, stability, and convergence in a multidimensional complex space. IEEE TEC, 6(1), 58–73.
- Liang, J. J., Qin, A. K., Suganthan, P. N., & Baskar, S. (2006). Comprehensive learning particle swarm optimizer for global optimization of multimodal functions. IEEE TEC, 10(3), 281–295.
- Dorigo, M. (1992). Optimization, Learning and Natural Algorithms. PhD thesis, Politecnico di Milano. — The original ACO PhD thesis.
- Dorigo, M., & Gambardella, L. M. (1997). Ant colony system: A cooperative learning approach to TSP. IEEE TEC, 1(1), 53–66.
- Stützle, T., & Hoos, H. H. (2000). MAX-MIN ant system. Future Generation Computer Systems, 16(8), 889–914.
- Karaboga, D. (2005). An idea based on honey bee swarm for numerical optimization. Technical Report, Erciyes University.
- Yang, X.-S. (2008). Nature-Inspired Metaheuristic Algorithms. Luniver Press. — The Firefly Algorithm.
- Yang, X.-S., & Deb, S. (2009). Cuckoo search via Lévy flights. World Congress on Nature & Biologically Inspired Computing.
- Bergstra, J., & Bengio, Y. (2012). Random search for hyper-parameter optimization. JMLR, 13, 281–305.
- Snoek, J., Larochelle, H., & Adams, R. P. (2012). Practical Bayesian optimization of machine learning algorithms. NeurIPS.
- Li, L., Jamieson, K., DeSalvo, G., Rostamizadeh, A., & Talwalkar, A. (2017). Hyperband: A novel bandit-based approach to hyperparameter optimization. JMLR, 18, 1–52.
- Falkner, S., Klein, A., & Hutter, F. (2018). BOHB: Robust and efficient hyperparameter optimization at scale. ICML.
- Zoph, B., & Le, Q. V. (2017). Neural architecture search with reinforcement learning. ICLR. — The original NAS-RL paper.
- Real, E., Aggarwal, A., Huang, Y., & Le, Q. V. (2019). Regularized evolution for image classifier architecture search. AAAI. — AmoebaNet.
- Liu, H., Simonyan, K., & Yang, Y. (2019). DARTS: Differentiable architecture search. ICLR.
- Pham, H., Guan, M. Y., Zoph, B., Le, Q. V., & Dean, J. (2018). Efficient neural architecture search via parameter sharing. ICML. — ENAS.
- Cai, H., Zhu, L., & Han, S. (2019). ProxylessNAS: Direct neural architecture search on target task and hardware. ICLR.
- Strubell, E., Ganesh, A., & McCallum, A. (2019). Energy and policy considerations for deep learning in NLP. ACL.
- Lindauer, M., & Hutter, F. (2020). Best practices for scientific research on neural architecture search. JMLR, 21, 1–18.
- Akiba, T., Sano, S., Yanase, T., Ohta, T., & Koyama, M. (2019). Optuna: A next-generation hyperparameter optimization framework. KDD.
- Olson, R. S., & Moore, J. H. (2016). TPOT: A tree-based pipeline optimization tool for automating machine learning. Automated Machine Learning Workshop.