Evolutionaries

"If nature can do it, so can we." —— The methodological manifesto of the Evolutionaries

The Evolutionaries are one of the five tribes Domingos identifies in The Master Algorithm. Their core belief can be summarized as: intelligence is not designed, it is evolved. The biological brain is the product of 3.5 billion years of natural selection. Rather than hand-specifying priors or assuming an analytic form, why not define learning itself as a genetic search performed in hypothesis space—using fitness as the pressure, variation as the source of innovation, and selection as the filter, letting answers "emerge automatically".

Together with the Symbolists, Connectionists, Bayesians, and Analogizers, the Evolutionaries form Domingos's "five tribes of machine learning". This directory systematically organizes the algorithmic lineage, theoretical foundations, and modern revival of the Evolutionaries, and forms a cross-reference network with discussions of "heuristic search", "reinforcement learning", and "automated machine learning (AutoML)" elsewhere in these notes.

1. Tribal Profile

Dimension	The Evolutionary Answer
Core metaphor	Natural selection, biological evolution
Nature of learning	Parallel stochastic search in hypothesis space
Knowledge carrier	Chromosome / genotype—bit string, real vector, tree, or graph
Master algorithm	Genetic Algorithm (GA) and its derivatives
Mathematical tools	Probability theory, combinatorial optimization, stochastic processes
Representative figures	John Holland, David Goldberg, John Koza, Hans-Paul Schwefel, Ingo Rechenberg, Kenneth Stanley
Representative algorithms	GA, GP, ES, CMA-ES, NEAT, PSO, ACO, NAS
Strengths	Black-box optimization, non-differentiable objectives, combinatorial spaces, multimodality
Weaknesses	Low sample efficiency, slow convergence, hyperparameter sensitivity
Modern flagship scenarios	Neural Architecture Search (NAS), hyperparameter optimization, robot morphology evolution, symbolic regression

The most critical feature distinguishing the Evolutionaries from other tribes is: they do not require gradient information. This gives them a structural advantage on non-differentiable, discrete, deceptive, or noisy objectives.

2. The Master Algorithm: The Genetic Search Loop

The skeleton of the Evolutionary "master algorithm" can be expressed as a minimal loop diagram—every concrete method (GA, ES, GP, NEAT, etc.) is a specialization of this skeleton along three axes: encoding, variation operator, and selection pressure.

flowchart TD
    A[Initialize Population<br/>Population P_0] --> B[Evaluate Fitness<br/>Fitness f x_i]
    B --> C{Termination?}
    C -- No --> D[Parent Selection<br/>Parent Selection]
    D --> E[Mutation / Crossover<br/>Crossover & Mutation]
    E --> F[Generate Offspring<br/>Offspring]
    F --> G[Survivor Selection<br/>Survivor Selection]
    G --> H[Elitism<br/>Elitism]
    H --> B
    C -- Yes --> I[Return Best Individual]

Formally, given a hypothesis space $\mathcal{H}$ and a fitness function $f: \mathcal{H} \to \mathbb{R}$, an evolutionary algorithm maintains a population $P_t = \{x_1, \dots, x_N\} \subset \mathcal{H}$ and iterates:

\[P_{t+1} = \mathcal{S}_{\text{env}}\big(P_t \cup \mathcal{V}(\mathcal{S}_{\text{par}}(P_t))\big)\]

where $\mathcal{S}_{\text{par}}$ is parent selection, $\mathcal{V}$ is the variation/recombination operator, and $\mathcal{S}_{\text{env}}$ is environmental (survivor) selection. The whole loop can be viewed as a generalized EM process: selection is the E step (inferring which hypotheses are promising), and variation is the M step (generating new hypotheses near the promising regions).

3. Algorithmic Lineage

The diagram below shows the family relationships among Evolutionary algorithms. Note that this is not a strict timeline but rather an idea-inheritance graph—some branches (e.g., swarm intelligence) developed in parallel with the main GA/ES trunk and borrowed mutually.

flowchart LR
    GA[GA<br/>Holland 1975] --> GP[GP<br/>Koza 1992]
    GA --> NSGA[NSGA-II<br/>Deb 2002]
    GA --> MEM[Memetic Algorithms]
    ES[ES<br/>Rechenberg 1973] --> CMA[CMA-ES<br/>Hansen 2001]
    CMA --> NES[NES / xNES]
    NES --> OAIES[OpenAI ES<br/>Salimans 2017]
    GA --> NEAT[NEAT<br/>Stanley 2002]
    NEAT --> HYP[HyperNEAT<br/>Stanley 2009]
    HYP --> ESHYP[ES-HyperNEAT]
    PSO[PSO<br/>Kennedy 1995] --> SI[Swarm Intelligence SI]
    ACO[ACO<br/>Dorigo 1992] --> SI
    SI --> MOO[Multimodal / Multi-objective]
    NEAT --> NE[Neuroevolution NE]
    OAIES --> NE
    NE --> NAS[Neural Architecture Search NAS]
    GA --> AML[AutoML / HPO]
    NAS --> AML

Major branches:

GA family: Originating with Holland, emphasizing bit-string encoding and the schema theorem; Goldberg's 1989 textbook engineered and popularized it.
GP family: Koza took "programs" as evolvable objects, with structured variation acting on abstract syntax trees.
ES family: Rechenberg and Schwefel's roots in fluid-mechanics optimization, emphasizing real-valued variables and adaptive Gaussian perturbations; CMA-ES is the modern champion of this branch.
Swarm intelligence: PSO, ACO, ABC, FA, CS, and others, drawing on the collective behavior of bird flocks, ant colonies, and bee swarms.
Neuroevolution: Searching neural network weights and topology with evolutionary methods; NEAT/HyperNEAT are the milestones.
AutoML / NAS: The revival of evolutionary thinking in the deep-learning era, forming—together with RL and Bayesian optimization—the three major methodologies of NAS.

4. Comparison with Gradient Optimization

Evolutionary algorithms are often misread as a "degenerate version of gradient descent". In fact they are structurally different.

Dimension	Gradient Optimization (SGD/Adam)	Evolutionary Algorithm (EA)
Gradient required?	Must be differentiable	Not needed at all
Signal type	First-/second-order analytic gradient	Scalar fitness
Black-box-friendliness	Does not accept black-box	Naturally black-box-friendly
Parallelism	Limited by mini-batch	Extreme (entire population evaluated independently)
Communication cost	Gradient tensors	Just seeds or a few scalars
Sample efficiency	High	Low (typically $10^4 \sim 10^6$ evaluations)
Local minima	Easy to get stuck	Population diversity provides escape
Hyperparameter sensitivity	Learning rate, momentum, etc.	Population size, mutation rate, selection pressure
Suitable scenarios	Differentiable losses, large-scale supervised learning	Non-differentiable, discrete, combinatorial, black-box RL
Convergence guarantees	Bounded in convex case	Asymptotic only (Markov-chain stationary distribution)

Rule of thumb: when the objective is differentiable, high-dimensional, and data is abundant, use SGD; when the objective is black-box, rewards are sparse, large-scale parallel CPUs are available, and robustness matters, evolutionary algorithms remain a competitive choice (Salimans et al. 2017 demonstrated this on Atari).

5. The Modern Revival: From Niche to Mainstream

Evolutionary algorithms went through a relatively quiet two decades after 1990, as SVMs, boosting, and deep learning took the stage in succession. But after 2015, they returned with force in three directions.

The first wave came from Neural Architecture Search (NAS). In the AmoebaNet line of work (2017—2019), Real et al. used large-scale evolutionary search to discover network architectures on ImageNet that surpassed hand-designed ones; the follow-up Regularized Evolution (Real 2019) replaced traditional elitism with "age regularization" and was shown to be more stable and efficient than the contemporaneous RL-based NAS-RL (Zoph & Le 2017). The second wave was OpenAI ES (Salimans et al. 2017): treating evolution strategies as a "scalable alternative to reinforcement learning", they ran Atari and MuJoCo on 1,440 CPUs and achieved scores comparable to A3C with shorter training time and near-zero parallel-communication overhead. The third wave is the industrialization of AutoML: frameworks such as Optuna, NNI, AutoGluon, DEAP, and pymoo packaged evolution together with Bayesian optimization, Hyperband, and other methods into out-of-the-box services, enabling non-experts to use EAs for hyperparameter tuning.

The common driver behind all three waves was hardware—cheap CPU clusters turned "evaluating thousands of individuals" from a luxury into a routine operation. The Evolutionaries thus shed the label of "theoretical toy" and once again became part of the engineering practice of machine learning.

This directory drills down along three themes:

Genetic Algorithms and Genetic Programming — The full GA pipeline, encoding schemes, selection/crossover/mutation operators, the Schema theorem, GP tree evolution, NSGA-II for multi-objective optimization, and classical case studies
Evolution Strategies and Neuroevolution — ES origins, the full CMA-ES algorithm, OpenAI ES, NEAT/HyperNEAT, and a neuroevolution-vs-RL decision tree
Swarm Intelligence and AutoML — PSO, ACO, ABC/FA/CS, an AutoML survey, comparison of HPO methods, the three NAS strategies, and industrial frameworks

7. Division of Labor with Existing Pages

This directory overlaps in content with the §evolutionary section of ../../02_Symbolic_AI/1_search.md, but differs in both perspective and depth:

Dimension	`02_Symbolic_AI/1_search.md`	This directory
Context	A branch of classical search algorithms	The overall philosophy and algorithmic ecosystem of the Evolutionaries
GA introduction	Shallow (pipeline + a single example)	Full pipeline, operator details, derivation of the Schema theorem
Algorithm coverage	Only the GA mainline	GA / GP / ES / CMA-ES / NEAT / PSO / ACO / NAS
Comparison axis	Compared with A* / simulated annealing	Compared with gradient optimization, Bayesian optimization, RL
Modern applications	Briefly mentioned	Detailed narrative of OpenAI ES, AmoebaNet, DARTS

Reading suggestion: if you only want to know "what is a GA", 1_search.md is enough; if you intend to apply evolutionary methods in engineering, or want to understand their place in the deep-learning era, start with this directory.

References

Domingos, P. (2015). The Master Algorithm: How the Quest for the Ultimate Learning Machine Will Remake Our World. Basic Books. — The original book that defines the five tribes; Chapter 5 is devoted to the Evolutionaries.
Holland, J. H. (1975). Adaptation in Natural and Artificial Systems. University of Michigan Press. — The foundational work on GA and the schema theorem.
Goldberg, D. E. (1989). Genetic Algorithms in Search, Optimization, and Machine Learning. Addison-Wesley. — The engineering textbook for GA.
Eiben, A. E., & Smith, J. E. (2015). Introduction to Evolutionary Computing (2nd ed.). Springer. — A modern textbook on evolutionary computation, covering the full GA/ES/GP/SI spectrum.
Koza, J. R. (1992). Genetic Programming: On the Programming of Computers by Means of Natural Selection. MIT Press.
Hansen, N. (2016). The CMA Evolution Strategy: A Tutorial. arXiv:1604.00772.
Salimans, T., Ho, J., Chen, X., et al. (2017). Evolution Strategies as a Scalable Alternative to Reinforcement Learning. arXiv:1703.03864.
Stanley, K. O., & Miikkulainen, R. (2002). Evolving neural networks through augmenting topologies. Evolutionary Computation, 10(2), 99–127.
Real, E., Aggarwal, A., Huang, Y., & Le, Q. V. (2019). Regularized evolution for image classifier architecture search. AAAI.

Genetic Algorithms and Genetic Programming

This page goes deep into the algorithmic details, theoretical foundations, and engineering practice of Genetic Algorithms (GA) and Genetic Programming (GP). Together they form the "trunk" of the Evolutionaries—GA provides a general bit-string/real-vector chromosomal evolution framework, and GP generalizes the evolved object from vectors to program structures themselves.

Before reading this page, we recommend looking at 本页 §1（派系入口） for the overall positioning of the Evolutionaries; this page goes deeper and is more technical than the §evolutionary section in ../../02_Symbolic_AI/1_search.md.

1. The Complete GA Pipeline

GA was formally proposed by John Holland in Adaptation in Natural and Artificial Systems (1975). Its core idea: encode candidate solutions as "chromosomes" and search in parallel through solution space using a process that mimics natural selection.

1.1 Formal Definition

Given: - a hypothesis space $\mathcal{H}$ (the solution space), - an encoding function $\phi: \mathcal{H} \to \mathcal{G}$ that maps solutions into the genotype space $\mathcal{G}$, - a fitness function $f: \mathcal{H} \to \mathbb{R}$ (larger is better),

GA maintains a population $P_t = \{g_1, \dots, g_N\} \subset \mathcal{G}$, with size $N$ typically in $50 \sim 500$. Each generation executes:

flowchart TD
    A[Initialize population P_0<br/>random or heuristic] --> B[Decode and evaluate<br/>f x_i = f phi inverse g_i]
    B --> C{Terminate?<br/>maxGen / fitness target / no improve}
    C -- No --> D[Parent selection<br/>roulette / tournament / rank]
    D --> E[Crossover<br/>1pt / 2pt / uniform / SBX]
    E --> F[Mutation<br/>bitflip / Gaussian / polynomial]
    F --> G[Evaluate offspring]
    G --> H[Survivor selection<br/>generational / steady-state]
    H --> I[Elitism<br/>top-k pass directly to next generation]
    I --> B
    C -- Yes --> J[Return best-so-far]

1.2 Relation to General Search

A GA can be viewed as a stochastic beam search with a population, but with three essential differences:

Multi-solution parallelism: $N$ candidates are maintained in parallel, not a single frontier.
Recombination: the crossover operator mixes information from two parents—a feature absent from other heuristic searches like simulated annealing or hill climbing.
Implicit parallelism: by the schema theorem, each generation effectively evaluates $\mathcal{O}(N^3)$ schemata simultaneously (see §7).

2. Encoding Schemes

Encoding (representation) is the most critical decision in GA design. The wrong encoding renders all operators ineffective.

Encoding	Chromosome form	Typical use case	Matching crossover/mutation
Binary	$\{0, 1\}^l$	Subset selection, Boolean rules, combinatorial optimization	bit-flip mutation, 1-point/uniform crossover
Real-valued	$\mathbb{R}^d$	Continuous function optimization, neural network weights	Gaussian mutation, BLX-α / SBX
Permutation	A permutation in $S_n$	TSP, scheduling, assembly sequences	order crossover (OX), PMX, swap mutation
Tree	Abstract syntax tree	GP program evolution, symbolic regression	subtree crossover, point mutation
Graph	Nodes + edges	Neural network topology (NEAT), circuits	NEAT-style add-node / add-link
Gray code	$\{0,1\}^l$	Numerical optimization, avoiding the Hamming cliff	Same as binary

Hamming cliff: under standard binary encoding, adjacent integers (e.g., 7 = 0111 and 8 = 1000) can have a large Hamming distance, and a single bit-flip cannot bridge them. Gray code guarantees a Hamming distance of exactly 1 between adjacent integers and is a common trick for numerical optimization.

Encoding design principles:

Completeness: every legal solution has at least one encoding.
Non-redundancy: encoding and solution should be roughly one-to-one (avoiding many equivalent genotypes).
Locality: small changes in genotype should correspond to small changes in phenotype—otherwise mutation degenerates into shuffling.

3. Selection Operators

The selection operator determines the selection pressure—the reproductive advantage of high-fitness individuals over low-fitness ones. Pressure that is too low (near-random search) yields slow convergence; pressure that is too high (only the best gets copied) leads to premature convergence.

3.1 Roulette Wheel

Each individual is selected with probability proportional to its fitness: $$p_i = \frac{f_i}{\sum_{j=1}^N f_j}$$ Issues: collapses when $f$ takes negative values or has wildly different scales; pressure is hard to control under fitness scale drift.

3.2 Tournament Selection

Randomly draw $k$ individuals and pick the one with the highest fitness as parent. The tournament size $k$ directly controls selection pressure:

$k=1$: pure random
$k=2$: mild pressure, the most common choice
$k=N$: always picks the best (degenerates to truncation selection)

The expected selection pressure (takeover time) of tournament selection: in the limit of no mutation and no crossover, $k$-tournament drives the entire population to the best individual in $\mathcal{O}(\log N / \log k)$ generations.

3.3 Rank-based

Probabilities are assigned by fitness rank, independent of the actual $f$ values. Linear ranking: $$p_i = \frac{1}{N}\Big[\eta^- + (\eta^+ - \eta^-)\frac{i-1}{N-1}\Big]$$ where $i$ is the rank (1 worst, $N$ best), $\eta^+ \in [1, 2]$ controls pressure, and $\eta^+ + \eta^- = 2$.

3.4 Boltzmann Selection

Mimics simulated annealing, with temperature $T$ decreasing over generations: $$p_i \propto \exp(f_i / T)$$ Large $T$ approaches uniform; $T \to 0$ degenerates to greedy. Combined with simulated annealing this gives the GASA family.

Selection operator	Pressure tunability	Computational cost	Sensitive to fitness scale
Roulette	Hard to tune	$\mathcal{O}(N)$	Highly sensitive
Tournament	Easy to tune ($k$)	$\mathcal{O}(k)$	Insensitive
Rank	Easy to tune ($\eta^+$)	$\mathcal{O}(N \log N)$ for sort	Completely insensitive
Boltzmann	Schedulable ($T$)	$\mathcal{O}(N)$	Sensitive (but $T$ helps)

Practical recommendation: default to tournament ($k=2$ or $k=3$)—pressure is controllable, robust to fitness scale, and easy to implement.

4. Crossover Operators

Crossover (recombination) is the core innovation that distinguishes GA from other random searches: by mixing two parents' genes, "good components (building blocks)" can combine to produce a better child.

4.1 1-point Crossover

Randomly choose a cut point $c \in \{1, \dots, l-1\}$; child 1 = $p_1[1:c] + p_2[c+1:l]$, and child 2 the reverse. Issue: genes at the head and tail of the chromosome are harder to separate (positional bias); genes far apart are unlikely to be inherited together.

4.2 2-point Crossover

Two cut points are chosen and the middle segment is exchanged. This mitigates positional bias and is one of the engineering defaults.

4.3 Uniform Crossover

Each position is independently inherited from $p_1$ or $p_2$ with probability $p$ (often $p=0.5$).

Pro: completely eliminates positional bias.
Con: highly destructive to long building blocks—only suitable for problems with strong gene independence.

4.4 Simulated Binary Crossover (SBX)

The de facto standard for real-valued GAs (Deb & Agrawal 1995). Given two parents $x_1, x_2$, the offspring are: $$c_{1,2} = 0.5 \big[(1 \mp \beta_q)(x_1) + (1 \pm \beta_q)(x_2)\big]$$ where $\beta_q$ is determined by the distribution parameter $\eta_c$ (typically $\eta_c \in [2, 5]$): $$\beta_q = \begin{cases}(2u)^{1/(\eta_c+1)} & u \le 0.5 \\ (1/(2(1-u)))^{1/(\eta_c+1)} & u > 0.5\end{cases}, \quad u \sim \mathcal{U}[0,1]$$ Key property of SBX: the offspring distribution is centered on the parents, and larger $\eta_c$ pulls offspring closer to the parents—mimicking the "locality" of binary 1-point crossover in the real-valued domain.

5. Mutation Operators

Mutation provides exploration—pushing the population into regions not yet covered, preventing premature convergence.

5.1 Bit-flip (binary)

Each bit is flipped independently with probability $p_m$ (typically $1/l$, so the expected number of flips per chromosome is 1).

5.2 Gaussian Mutation (real-valued)

$x_i \leftarrow x_i + \mathcal{N}(0, \sigma_i^2)$. $\sigma_i$ may be globally shared, per-dimension independent, or self-adaptive (this transitions into ES; see Evolution Strategies and Neuroevolution).

5.3 Polynomial Mutation

The standard mutation for real-valued GAs proposed by Deb. Given bounds $[x^L, x^U]$ and a distribution parameter $\eta_m$ (typically $20$): $$x_i' = x_i + (x^U - x^L)\delta, \quad \delta = \begin{cases}(2u)^{1/(\eta_m+1)} - 1 & u < 0.5 \\ 1 - (2(1-u))^{1/(\eta_m+1)} & u \ge 0.5\end{cases}$$ Larger $\eta_m$ makes mutation more conservative.

5.4 Self-adaptive Mutation

Record each individual's mutation rate $p_m^{(i)}$ as an extra gene that itself evolves—successful $p_m$ values get selected and preserved. This is the bridge connecting GA to ES.

6. Elitism and Diversity Preservation

6.1 Elitism

Each generation directly copies the top-$k$ individuals to the next, ensuring best-so-far is monotonically non-decreasing. $k$ is usually $1\% \sim 10\%$ of the population. Too large $k$ destroys diversity; too small fails to protect good solutions.

6.2 The Diversity Crisis

Without any safeguard, GAs tend toward premature convergence—the population rapidly collapses around the first local optimum found. Two mainstream remedies:

Fitness sharing (Goldberg & Richardson 1987): nearby individuals "share" fitness with each other, simulating ecological niche competition. $$f'_i = \frac{f_i}{\sum_{j=1}^N \mathrm{sh}(d_{ij})}, \quad \mathrm{sh}(d) = \max\Big(0,\ 1 - (d/\sigma_{\text{share}})^\alpha\Big)$$

Crowding (De Jong 1975): offspring compete for survival only with the parent most similar to them, thereby preserving different niches.

Island model: split the population into $K$ isolated subpopulations with occasional migration. Naturally suited to distributed implementations.

7. The Schema Theorem (Holland 1975)

The schema theorem is GA's only classical theoretical guarantee. It formalizes GA's "implicit parallelism".

7.1 Definition of a Schema

A schema (pattern) is a string over the alphabet $\{0, 1, *\}^l$, where $*$ is a wildcard. For example, $H = 1\!*\!*\!0\!*$ denotes the set of all bit strings whose first bit is 1 and whose fourth bit is 0 (containing $2^3 = 8$ strings when $l=5$).

Two key attributes:

Order $o(H)$: the number of fixed (non-$*$) positions in $H$. E.g., $o(1\!*\!*\!0\!*) = 2$.
Defining length $\delta(H)$: the distance between the first and the last fixed positions. E.g., $\delta(1\!*\!*\!0\!*) = 4 - 1 = 3$.

7.2 Statement of the Theorem

Let $m(H, t)$ be the number of individuals in generation $t$ that match schema $H$, $f(H)$ their average fitness, $\bar f$ the population average fitness, $p_c$ and $p_m$ the crossover and mutation probabilities, and $l$ the chromosome length. Under roulette-wheel selection + 1-point crossover + bit-flip mutation:

\[m(H, t+1) \geq m(H, t) \cdot \frac{f(H)}{\bar f} \cdot \left[1 - p_c \frac{\delta(H)}{l-1} - o(H)\, p_m\right]\]

7.3 Term-by-term Interpretation

$\frac{f(H)}{\bar f}$: the selection-pressure term. If schema $H$ is above average ($f(H) > \bar f$), it is amplified by selection.
$1 - p_c \frac{\delta(H)}{l-1}$: the crossover-disruption term. The wider the schema span $\delta(H)$, the more easily it is cut and disrupted by single-point crossover.
$1 - o(H) p_m$: the mutation-disruption term. The higher the schema order $o(H)$, the more fixed positions are exposed to mutation.

7.4 Corollary: The Building-Block Hypothesis

Short (small $\delta$), low-order (small $o$), above-average-fitness schemata—called building blocks—grow at an exponential rate. GA constructs increasingly larger good solutions by recombining building blocks. This is the explanation Holland and Goldberg gave for "why GA works".

7.5 The Linkage Problem

The building-block hypothesis carries an implicit assumption: related genes must be placed close together on the chromosome. Otherwise the crossover probability $p_c \delta(H)/(l-1)$ will repeatedly disrupt them. This is the linkage problem.

Solutions:

Hand-engineered encoding: domain knowledge places related genes close together.
mGA / fmGA (messy GA): let the GA learn the linkage structure on its own.
EDA (Estimation of Distribution Algorithms): skip explicit crossover; directly learn a probabilistic model of the population (BOA, UMDA, etc.) and sample new generations from the distribution.

The schema theorem gives a necessary, not sufficient condition. It explains "under what conditions GA can work" but does not guarantee that GA converges to the global optimum. This differs in nature from the convex-optimization convergence proofs for SGD.

8. Genetic Programming (GP)

Genetic Programming (GP) was proposed by John Koza in 1992. It generalizes the evolved object from bit strings/real vectors to computer programs—specifically, the abstract syntax trees (ASTs) that represent programs.

8.1 Function Set and Terminal Set

A GP program consists of two types of nodes:

Node type	Role	Examples
Function set	Internal nodes	$\{+, -, \times, /, \sin, \cos, \mathrm{if}, \mathrm{and}\}$
Terminal set	Leaf nodes	$\{x_1, x_2, \dots, c_1, c_2, \dots\}$

Closure property: every function in the function set must be defined for all outputs of terminals and other functions—e.g., $/$ must handle division by zero (commonly via protected division: $x/y = 1$ if $y=0$).

Sufficiency property: the combinatorial space spanned by the function set and terminal set must contain the target program.

8.2 Tree Crossover and Mutation

flowchart LR
    subgraph P1["Parent 1: x*x + sin x"]
        A1["+"] --> B1["*"]
        A1 --> C1["sin"]
        B1 --> D1["x"]
        B1 --> E1["x"]
        C1 --> F1["x"]
    end
    subgraph P2["Parent 2: x+1 / 2"]
        A2["/"] --> B2["+"]
        A2 --> C2["2"]
        B2 --> D2["x"]
        B2 --> E2["1"]
    end
    subgraph C["Offspring (swap sin subtree with x+1 subtree)"]
        AC["+"] --> BC["*"]
        AC --> CC["+"]
        BC --> DC["x"]
        BC --> EC["x"]
        CC --> FC["x"]
        CC --> GC["1"]
    end

Subtree crossover: randomly pick a subtree from each parent and swap them. This is the standard recombination operator in GP.

Point mutation: replace a node with another function of the same arity; or replace a leaf with another terminal of the same type.

Subtree mutation: delete a subtree and grow a new random subtree in its place.

8.3 Bloat and Its Control

Bloat is the most thorny phenomenon in GP: tree size grows unboundedly across generations while fitness fails to improve correspondingly. One cause is "neutral mutation"—code that does not affect output (introns) shields against crossover and is therefore selectively retained.

Control techniques:

Depth limit: hard cap on tree depth. Crude but effective.
Parsimony pressure: add a complexity penalty to fitness, $f' = f - \alpha \cdot \text{size}(\text{tree})$.
Tarpeian method: kill individuals exceeding a threshold with some probability.
Operator equalization: maintain fixed quotas across size bins.

8.4 Classical Applications

Symbolic regression: given data $\{(x_i, y_i)\}$, evolve a function $f$ that minimizes $\sum (f(x_i) - y_i)^2$. Eureqa / PySR are modern representatives.
Classification rule evolution: evolve decision lists or if-then rules.
Circuit design: Koza's group used GP to reinvent several patented analog circuits (low-pass filter, amplifier topologies)—termed "human-competitive results".
Game strategies: evolve heuristics for board games / poker strategies.

9. Multi-objective Evolutionary Algorithms: NSGA-II

Real-world problems often have conflicting objectives (accuracy vs. model size, throughput vs. cost). Multi-objective optimization (MOO) does not seek a single best solution but the Pareto front.

Pareto dominance: solution $a$ dominates $b$ iff $a$ is no worse than $b$ on all objectives and strictly better on at least one. The Pareto front is the set of all solutions not dominated by any other solution.

NSGA-II (Deb et al. 2002) is one of the most highly cited algorithms in MOO. Its core mechanisms:

Fast non-dominated sorting: stratify the population by dominance ($F_1$ is the current Pareto front, $F_2$ is the front after removing $F_1$, and so on). Complexity $\mathcal{O}(MN^2)$.
Crowding distance: within each layer, sort by neighborhood sparsity to encourage uniform spread along the front.
Selection: compare layer first (lower is better); within the same layer, compare by crowding distance (larger is better).
Survivor selection: combine parents and offspring, then take the top $N$.

NSGA-II is widely used in NAS, hyperparameter optimization, and machine-learning model trade-offs (e.g., NSGA-Net directly searches the Pareto front of model families).

10. Examples

10.1 0/1 Knapsack

Item set $\{(w_i, v_i)\}_{i=1}^n$, capacity $W$.

Encoding: binary string $x \in \{0,1\}^n$, with $x_i=1$ meaning item $i$ is packed.
Fitness: $f(x) = \sum v_i x_i$ if $\sum w_i x_i \le W$, otherwise heavily penalize $f = -\infty$ or $f = -\sum w_i x_i$.
Operators: 1-point crossover + bit-flip.
Typical performance: 100 items, 200 population, 500 generations can approach 95%+ of the optimum.

10.2 TSP

Visit-all-cities problem on $n$ cities.

Encoding: permutation $\pi \in S_n$.
Fitness: $f(\pi) = -\sum d(\pi_i, \pi_{i+1})$.
Operators: order crossover (OX) or PMX; combine with 2-opt local search (memetic algorithm).

10.3 Symbolic Regression

Data $\{(x_i, y_i)\}$, the goal is to find $f$.

GP tree encoding, function set $\{+, -, \times, /, \sin, \exp, \log\}$ (protected).
Fitness: $-\mathrm{MSE}(f, \text{data})$, optionally with a parsimony penalty.
Modern practice: PySR (Cranmer 2023) and Eureqa combine GP with symbolic simplification, successfully rediscovering known formulas on physics-equation benchmarks (Feynman dataset).

11. Python Example (DEAP-style)

Below is a minimal runnable GA main loop, demonstrating a binary GA on $f(x) = \sum_{i=1}^{20} x_i^2$ (i.e., the negation of OneMax).

import random
from typing import List, Tuple

POP_SIZE = 100
CHROM_LEN = 20
P_CROSS = 0.9
P_MUT = 1.0 / CHROM_LEN
GENERATIONS = 50
TOURN_K = 3

Chrom = List[int]

def fitness(c: Chrom) -> int:
    return sum(c)  # OneMax: 越多 1 越好

def tournament(pop: List[Chrom]) -> Chrom:
    contestants = random.sample(pop, TOURN_K)
    return max(contestants, key=fitness)

def one_point_crossover(p1: Chrom, p2: Chrom) -> Tuple[Chrom, Chrom]:
    if random.random() >= P_CROSS:
        return p1[:], p2[:]
    cut = random.randint(1, CHROM_LEN - 1)
    return p1[:cut] + p2[cut:], p2[:cut] + p1[cut:]

def bitflip_mutation(c: Chrom) -> Chrom:
    return [1 - g if random.random() < P_MUT else g for g in c]

def init_population() -> List[Chrom]:
    return [[random.randint(0, 1) for _ in range(CHROM_LEN)]
            for _ in range(POP_SIZE)]

def run_ga():
    pop = init_population()
    best = max(pop, key=fitness)
    for gen in range(GENERATIONS):
        new_pop = [best]  # 1-elitism
        while len(new_pop) < POP_SIZE:
            p1, p2 = tournament(pop), tournament(pop)
            c1, c2 = one_point_crossover(p1, p2)
            new_pop.append(bitflip_mutation(c1))
            if len(new_pop) < POP_SIZE:
                new_pop.append(bitflip_mutation(c2))
        pop = new_pop
        cur_best = max(pop, key=fitness)
        if fitness(cur_best) > fitness(best):
            best = cur_best
        print(f"Gen {gen:3d}  best={fitness(best)}  avg={sum(map(fitness,pop))/POP_SIZE:.2f}")
    return best

if __name__ == "__main__":
    run_ga()

In production it is preferable to use DEAP (creator + tools + algorithms.eaSimple), pymoo (multi-objective first), or PyGAD (lightweight, integrates with sklearn / Keras).

12. Hyperparameter Tuning and Pitfalls

Hyperparameter	Typical range	Tuning intuition
Population size $N$	$50 \sim 500$	High dimensionality, multimodality → larger; expensive evaluations → smaller
Crossover probability $p_c$	$0.6 \sim 0.95$	Usually high, letting recombination dominate
Mutation probability $p_m$	$1/l \sim 5/l$	Too low: no exploration; too high: random search
Tournament size $k$	$2 \sim 5$	Larger → stronger pressure → premature convergence
Elitism ratio	$1\% \sim 10\%$	Too large: lose diversity
Maximum generations	$10^2 \sim 10^4$	Add early stopping

Common pitfalls:

Wildly different fitness scales → replace roulette with rank/tournament.
Premature convergence → add fitness sharing / increase mutation rate / use island model.
Slow evaluation → use a surrogate model or cache duplicates.
GP bloat → add depth limit + parsimony pressure.
Discrete constraints → use repair operators rather than penalty functions.

13. Cross-references

Parent page: 本页 §1（派系入口） — Overall positioning and algorithmic lineage of the Evolutionaries
Same directory: Evolution Strategies and Neuroevolution — ES/CMA-ES/NEAT generalize GA to continuous and topological settings
Same directory: Swarm Intelligence and AutoML — Comparison of PSO/ACO/NAS with GA
Existing page: ../../02_Symbolic_AI/1_search.md §evolutionary — GA seen from the heuristic-search perspective
Related page: BOHB / TPE from the Bayesian camp are strong competitors of GA in HPO; see 贝叶斯派

References

Holland, J. H. (1975). Adaptation in Natural and Artificial Systems. University of Michigan Press. — Original source of GA and the schema theorem.
Goldberg, D. E. (1989). Genetic Algorithms in Search, Optimization, and Machine Learning. Addison-Wesley. — The engineering textbook for GA.
Goldberg, D. E., & Richardson, J. (1987). Genetic algorithms with sharing for multimodal function optimization. ICGA.
De Jong, K. A. (1975). An Analysis of the Behavior of a Class of Genetic Adaptive Systems. PhD thesis, University of Michigan.
Koza, J. R. (1992). Genetic Programming: On the Programming of Computers by Means of Natural Selection. MIT Press.
Koza, J. R. (1994). Genetic Programming II: Automatic Discovery of Reusable Programs. MIT Press.
Deb, K., & Agrawal, R. B. (1995). Simulated binary crossover for continuous search space. Complex Systems, 9, 115–148.
Deb, K., Pratap, A., Agarwal, S., & Meyarivan, T. (2002). A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE Transactions on Evolutionary Computation, 6(2), 182–197.
Eiben, A. E., & Smith, J. E. (2015). Introduction to Evolutionary Computing (2nd ed.). Springer.
Cranmer, M. (2023). Interpretable machine learning for science with PySR and SymbolicRegression.jl. arXiv:2305.01582.
Fortin, F.-A., et al. (2012). DEAP: Evolutionary algorithms made easy. JMLR, 13, 2171–2175.

Evolution Strategies and Neuroevolution

This page goes deep into representative algorithms developed by the Evolutionaries along two directions: continuous optimization and neural network evolution. These include Evolution Strategies (ES), Covariance Matrix Adaptation ES (CMA-ES), OpenAI ES, NEAT, and HyperNEAT. Their common feature is treating "mutation" as the primary evolutionary operator (in contrast to GA's "crossover-driven" view) and devoting substantial work to adapting the mutation distribution.

Before reading this page, we recommend looking at 本页 §1（派系入口） and Genetic Algorithms and Genetic Programming to understand the basic GA terminology.

1. Origins of Evolution Strategies

Evolution Strategies (ES) were developed by Ingo Rechenberg and Hans-Paul Schwefel at the Technical University of Berlin in the late 1960s, independently of Holland's GA. The earliest applications were wind-tunnel airfoil optimization—objectives that were continuous, non-analytically-differentiable, and expensive to evaluate.

Key differences between ES and GA:

Dimension	GA (Holland)	ES (Rechenberg/Schwefel)
Origin	Adaptive systems theory	Engineering optimization (fluid mechanics)
Chromosome	Mainly binary strings	Mainly real vectors
Primary operator	Crossover	Mutation
Selection	Fitness-proportional	Truncation ($\mu$ chosen from $\lambda$)
Self-adaptation	Mutation rate hand-tuned	Mutation step size $\sigma$ self-adaptive
Theory	Schema theorem	1/5 success rule, IGO/natural gradient

2. Classical ES Variants

2.1 (1+1)-ES: The Minimal Working Model

The population is a single individual $\theta_t \in \mathbb{R}^d$. Each generation produces one mutated offspring: $$\theta'_t = \theta_t + \sigma_t \cdot \mathbf{z}, \quad \mathbf{z} \sim \mathcal{N}(0, I_d)$$ If $f(\theta'_t) \ge f(\theta_t)$, set $\theta_{t+1} = \theta'_t$; otherwise keep $\theta_t$.

2.2 ($\mu$, $\lambda$)-ES vs ($\mu$ + $\lambda$)-ES

($\mu$, $\lambda$): $\mu$ parents produce $\lambda > \mu$ offspring; the next generation is chosen only from the offspring—parents do not participate, forcing updates.
($\mu$ + $\lambda$): parents and offspring are merged and the top $\mu$ are kept, giving monotonic non-decrease.

In practice, ($\mu$, $\lambda$) is more common—it avoids the problem of "stale $\sigma$ dragging $\theta$" under self-adaptive step sizes. A typical ratio is $\lambda / \mu \approx 4 \sim 7$.

2.3 The 1/5 Success Rule (Rechenberg 1973)

Too small a $\sigma$ converges slowly; too large a $\sigma$ has a low success rate. Rechenberg's empirical rule: target success rate $\approx 1/5$.

If the recent success rate is $> 1/5$ → $\sigma \leftarrow \sigma / c$ (enlarge step, $c \in (0.8, 1)$)
If $< 1/5$ → $\sigma \leftarrow \sigma \cdot c$
If $\approx 1/5$ → keep

This is the first self-adaptive step-size mechanism in evolutionary algorithms historically. It has theoretical support on the sphere function (from asymptotic convergence-rate analysis) and is the conceptual root of CMA-ES, xNES, and others.

3. CMA-ES: The Modern Champion of ES

The Covariance Matrix Adaptation Evolution Strategy (CMA-ES) was developed by Nikolaus Hansen and Andreas Ostermeier between 1996 and 2001, and is widely regarded as one of the strongest black-box continuous optimization algorithms today. On benchmarks such as BBOB (Black-Box Optimization Benchmark), it has long sat in the top tier.

3.1 Algorithmic Skeleton

Each generation, CMA-ES samples $\lambda$ candidates from a multivariate Gaussian: $$\mathbf{x}_i^{(g+1)} \sim \mathbf{m}^{(g)} + \sigma^{(g)} \mathcal{N}(\mathbf{0}, \mathbf{C}^{(g)}), \quad i=1,\dots,\lambda$$

where:

$\mathbf{m}^{(g)} \in \mathbb{R}^d$: the distribution mean (current best estimate)
$\sigma^{(g)} \in \mathbb{R}_+$: overall step size
$\mathbf{C}^{(g)} \in \mathbb{R}^{d \times d}$: covariance matrix (shape/scale prior)

After evaluation, the top-$\mu$ of the $\lambda$ offspring update $\mathbf{m}, \sigma, \mathbf{C}$. The essence of CMA-ES lies entirely in how these three quantities are adapted.

3.2 Mean Update

$$\mathbf{m}^{(g+1)} = \sum_{i=1}^\mu w_i\, \mathbf{x}_{i:\lambda}^{(g+1)}, \quad \sum w_i = 1,\ w_1 \ge \dots \ge w_\mu \ge 0$$ where $\mathbf{x}_{i:\lambda}$ is the $i$-th offspring after sorting by fitness. Weights $w_i$ usually decay logarithmically. $\mu_{\mathrm{eff}} = 1/\sum w_i^2$ is called the effective selection mass.

3.3 Covariance Matrix Update: rank-1 and rank-$\mu$

Rank-$\mu$ update: estimate the covariance from this generation's top-$\mu$ offspring, $$\mathbf{C}_\mu^{(g+1)} = \sum_{i=1}^\mu w_i (\mathbf{x}_{i:\lambda} - \mathbf{m}^{(g)})(\mathbf{x}_{i:\lambda} - \mathbf{m}^{(g)})^\top / \sigma^{(g)2}$$ But a single generation gives too few samples and the estimate is noisy.

Rank-1 update: use an evolution path $\mathbf{p}_c$ to accumulate displacement information across generations: $$\mathbf{p}_c^{(g+1)} = (1-c_c)\mathbf{p}_c^{(g)} + \sqrt{c_c(2-c_c)\mu_{\mathrm{eff}}} \cdot \frac{\mathbf{m}^{(g+1)} - \mathbf{m}^{(g)}}{\sigma^{(g)}}$$ $$\mathbf{C}^{(g+1)} = (1-c_1-c_\mu)\mathbf{C}^{(g)} + c_1\mathbf{p}_c\mathbf{p}_c^\top + c_\mu \mathbf{C}_\mu^{(g+1)}$$

Intuition: if the mean drifts in the same direction over several generations, the path $\mathbf{p}_c$ becomes long; CMA-ES "etches" this direction into the covariance, biasing future samples toward it—capturing long-range correlations.

3.4 Global Step-Size Control (CSA)

Similarly, maintain a conjugate evolution path $\mathbf{p}_\sigma$: $$\mathbf{p}_\sigma^{(g+1)} = (1-c_\sigma)\mathbf{p}_\sigma^{(g)} + \sqrt{c_\sigma(2-c_\sigma)\mu_{\mathrm{eff}}}\cdot \mathbf{C}^{(g)\,-1/2} \frac{\mathbf{m}^{(g+1)}-\mathbf{m}^{(g)}}{\sigma^{(g)}}$$ $$\sigma^{(g+1)} = \sigma^{(g)} \exp\!\left(\frac{c_\sigma}{d_\sigma}\left(\frac{\|\mathbf{p}_\sigma\|}{E\|\mathcal{N}(0,I)\|}-1\right)\right)$$

Intuition: if successive steps are almost collinear (the path is long), $\sigma$ is too small → enlarge it; if they cancel each other out (the path is short), $\sigma$ is too large → shrink it. This is the continuous, anisotropic generalization of the 1/5 rule.

3.5 Connection to Natural Gradients: the IGO Framework

In the NES paper (2014), Wierstra et al. showed that CMA-ES can, under a particular parameterization, be interpreted as natural-gradient ascent in the parameter space of the search distribution—where the Fisher information metric automatically yields the rank-1/rank-$\mu$ update form. Ollivier et al. (2017) then unified an entire family of evolutionary algorithms (CMA-ES, xNES, PBIL, EDA) into a single Information-Geometric Optimization (IGO) flow on the same parameter space.

Engineering takeaway: the "magic" of CMA-ES is almost entirely determined by Fisher geometry. Few of its hyperparameters need tuning—the library defaults work directly in 80% of scenarios.

3.6 Suitable Problem Sizes

Dimension $d \le 100$: standard CMA-ES handles it fully.
$d \in [100, 1000]$: use sep-CMA-ES (diagonal covariance) or LM-CMA-ES (low-rank approximation).
$d > 1000$: the $\mathcal{O}(d^2)$ memory of the covariance matrix breaks down—switch to NES variants or simplified schemes like OpenAI ES that only use isotropic Gaussian perturbations.

4. ES from a Modern Lens: as a Black-Box Optimizer

Viewing ES as "simulated annealing + adaptive variance + multiple sampling" is a reasonable engineering picture. Its engineering advantages include:

Zeroth-order: needs only $f(\theta)$, not $\nabla f$.
Parallelizable: $\lambda$ offspring are evaluated independently.
Invariance: CMA-ES is invariant under monotonic transforms of fitness (it uses ranking only) and invariant under linear transforms (covariance adaptation)—two properties that make it more robust than SGD on poorly scaled problems.
No reliance on backprop: the evaluator can be a simulator, a real robot, or human ratings.

5. OpenAI ES: Pushing ES to Large-Scale RL

The 2017 paper Evolution Strategies as a Scalable Alternative to Reinforcement Learning by Salimans, Ho, Chen, and Sidor (OpenAI) is the landmark event for ES's revival in the deep-learning era.

5.1 The Algorithm

Parameter $\theta \in \mathbb{R}^d$ (the flattened vector of an entire neural network). Each step:

Sample $n$ Gaussian perturbations $\epsilon_1, \dots, \epsilon_n \sim \mathcal{N}(0, I)$.
Evaluate $F_i = F(\theta + \sigma \epsilon_i)$ (the return from one rollout in the environment).
Estimate the gradient and update: $$\nabla_\theta \mathbb{E}_{\epsilon}[F(\theta + \sigma\epsilon)] \approx \frac{1}{n\sigma}\sum_{i=1}^n F_i\, \epsilon_i$$ $$\theta \leftarrow \theta + \alpha \cdot \frac{1}{n\sigma}\sum_{i=1}^n F_i\, \epsilon_i$$

This is essentially the fixed-variance version of NES (Natural Evolution Strategies)—a score-function gradient (REINFORCE-style) estimate over $\theta$.

5.2 The Key Engineering Innovation: Seed Communication

Inter-worker communication cost is the bottleneck of large-scale ES with thousands of CPU workers. The OpenAI trick: all workers share the same RNG seed table, and communication only requires broadcasting tuples $(i, F_i)$—each worker regenerates $\epsilon_i$ from seed $i$ on its own. This cuts communication from $\mathcal{O}(d)$ to $\mathcal{O}(1)$.

5.3 ES vs PPO

Dimension	OpenAI ES	PPO / A3C
Signal	Episode-level scalar return	Per-step advantage
Gradient	Finite-difference estimate	Analytic gradient via backprop
Sample efficiency	Low ($10^7 \sim 10^9$ frames)	High ($10^6 \sim 10^8$)
Wall-clock efficiency	Fast (CPU-parallelizable)	Bottlenecked by GPU/data efficiency
Communication	$\mathcal{O}(1)$ per step	Gradient $\mathcal{O}(d)$ per step
Exploration	Parameter-space perturbations	Action-space stochasticity
Sparse rewards	Robust (episode-level returns)	Brittle
Long horizons	Robust	Hard credit assignment
Implementation complexity	Extremely simple (~50 lines)	Moderate (replay/clipping/GAE)

OpenAI ran Atari on 1,440 CPUs and achieved scores comparable to A3C, with shorter wall-clock time. Conclusion: when massive parallelism is available and rewards are sparse with long horizons, ES is a strong RL alternative.

6. NEAT: Co-evolving Topology and Weights

NeuroEvolution of Augmenting Topologies (NEAT) was proposed by Kenneth Stanley and Risto Miikkulainen in 2002. It addresses three core problems that had long plagued evolving neural network topologies: competing conventions, initial explosion, and early death of good structures.

6.1 Three Innovations

(1) Innovation Number (historical marking): each newly added connection is given a globally increasing ID. Whether two connections are "homologous" is determined by ID, not by structural alignment—definitively solving the alignment problem in topological crossover.

(2) Speciation: divide the population into species by structural distance. The distance formula: $$\delta = \frac{c_1 E}{N} + \frac{c_2 D}{N} + c_3 \bar{W}$$ where $E$ = excess genes (those held by one parent but not the other at all), $D$ = disjoint genes (mismatched in the middle), and $\bar W$ = the mean weight difference of shared genes. Individuals with $\delta < \delta_t$ are deemed to be of the same species.

Explicit fitness sharing: fitness within a species is divided evenly ($f_i' = f_i / |S|$), preventing small species (new topologies) from being swamped by large ones—the core protective mechanism for "gradual complexification".

(3) Gradual complexification: start from a minimal topology (input directly to output) and monotonically increase structural complexity through add-node / add-link mutations only. This avoids the wasteful "complexify then prune" cycle.

6.2 The NEAT Speciation Loop

flowchart TD
    A[New generation of offspring] --> B[For each individual]
    B --> C{delta distance to<br/>existing species representative}
    C -- delta < delta_t --> D[Add to that species]
    C -- No species matches --> E[Create new species<br/>this individual as representative]
    D --> F[Within-species fitness sharing<br/>f_i divide |S|]
    E --> F
    F --> G[Allocate reproductive<br/>quotas per species]
    G --> H[Cull species stagnant for too long]
    H --> I[Next-generation selection / mutation]

6.3 Applications of NEAT

OpenAI Gym control tasks: NEAT approaches optimum within hundreds of generations on CartPole and Pendulum.
MarI/O: YouTuber SethBling used NEAT to play Super Mario Bros, with enormous popular-science influence.
Robot control: competitive with traditional RL on continuous control.

7. HyperNEAT: Indirect Encoding and Geometric Priors

Direct encoding (NEAT) does not scale to neural networks with millions of parameters—every connection must be represented separately. HyperNEAT (Stanley, D'Ambrosio, Gauci 2009) solves this with indirect encoding.

7.1 CPPN: Pattern-Generating Networks

A Compositional Pattern Producing Network (CPPN) is an ordinary neural network, but its "task" is not classification—it is producing the connection weights of other neural networks. Specifically, for each pair of neurons $(u, v)$ in the target network, their positions $(x_u, y_u, x_v, y_v)$ are fed into the CPPN, which outputs the connection weight $w_{uv}$: $$w_{uv} = \mathrm{CPPN}(x_u, y_u, x_v, y_v)$$

CPPNs are evolved with NEAT (a CPPN is itself a neural network). Beyond sigmoid, the activation set includes sin, Gaussian, and abs—these function families help produce symmetric, repetitive, and gradient-like geometric patterns.

7.2 Geometric Priors

The core insight of HyperNEAT: in real problems, the connectivity patterns of neural networks often have geometric structure (the retinotopic mapping from retina to V1, the somatotopic map of motor cortex, etc.). Placing neurons in a geometric coordinate space and letting the CPPN learn the coordinate-to-weight mapping amounts to imposing the positional inductive bias that "neighboring neurons should have correlated weights".

7.3 Subsequent Developments

ES-HyperNEAT (Risi & Stanley 2010): lets the CPPN itself decide how many neurons to place and where, no longer fixing the substrate.
Adaptive HyperNEAT: makes the learning rate one of the CPPN's outputs.
This family of ideas is directly related to the modern deep-learning notion of HyperNetworks (Ha et al. 2016)—a neural network that generates the weights of another neural network.

8. Neuroevolution vs Reinforcement Learning: Which to Choose

flowchart TD
    A[Decision starting point] --> B{Reward dense<br/>observable per step?}
    B -- No / sparse --> NE1[Lean toward NE / ES]
    B -- Yes --> C{Action space discrete<br/>and small?}
    C -- Yes --> RL1[Lean toward RL DQN/PPO]
    C -- No / continuous high-dim --> D{Policy needs<br/>long-term memory?}
    D -- Yes --> NE2[NE / NEAT topology search]
    D -- No --> E{Compute: GPU vs CPU?}
    E -- Mostly GPU --> RL2[Lean toward RL]
    E -- Lots of CPU --> NE3[Lean toward ES]
    A --> F{Need to evolve network architecture?}
    F -- Yes --> NE4[NEAT / NAS]
    F -- No --> G{Hyperparameter sensitive?}
    G -- Yes --> NE5[ES is more robust]
    G -- No --> RL3[RL has better sample efficiency]

Summary heuristics:

Lean toward NE/ES	Lean toward RL
Sparse rewards / final-only outcome	Dense rewards
High-dim continuous control, long horizons	Discrete small action spaces, short horizons
Need to evolve topology	Fixed topology
Plenty of CPUs, scarce GPUs	Plenty of GPUs
Black-box simulator (no gradient)	Differentiable simulator
Tolerant of low sample efficiency	Care about sample efficiency

9. Practical Considerations

9.1 Population Size and Parallelism

CMA-ES: default $\lambda = 4 + \lfloor 3\ln d \rfloor$, e.g., $\lambda \approx 18$ for $d=100$. Can be enlarged to a few hundred to accelerate convergence (when parallelism is sufficient).
OpenAI ES: typically $n = 1000 \sim 10000$. The paper used 1,440 CPU workers, each evaluating several perturbations per generation.
NEAT: $N = 150 \sim 300$. The number of species is typically held between 5 and 15 (with adaptive $\delta_t$).

9.2 Hybrids with SGD

Genetic SGD: each generation fine-tunes offspring with SGD before evaluating—a typical memetic recipe.
PBT (Population-Based Training): a GA + SGD hybrid by DeepMind for online tuning of hyperparameters/learning rates (Jaderberg et al. 2017).
ES + backprop: Conti et al. (2018) backpropagated through the ES loss, forming an ES-Gradient hybrid.

9.3 Implementation Pitfalls

The variance of Gaussian perturbations must be regularized—without normalization, perturbation magnitudes blow up as $\sqrt{d}$ in high dimensions.
ES's invariance to fitness scaling holds only when rank-based fitness shaping is used (subtract mean and divide by std, or convert to ranks).
NEAT's $\delta_t$ cannot be fixed—as the population evolves, the scale of structural distance drifts, requiring adaptation.

10. Example: ES vs PPO Pseudocode in a Gymnasium Environment

Below is a minimal comparison skeleton of OpenAI ES vs PPO on the same Gym environment—the structures are highly similar, with the difference lying only in the update rule.

# ============ OpenAI ES ============
def openai_es(env, policy_net, n_workers=64, sigma=0.1, alpha=0.01, gens=500):
    theta = flatten(policy_net.parameters())
    for g in range(gens):
        seeds = [rng.randint(2**31) for _ in range(n_workers)]
        rewards = []
        for s in seeds:                          # 可并行 ray.remote
            eps = sample_normal_with_seed(s, theta.shape)
            theta_perturbed = theta + sigma * eps
            r = rollout(env, unflatten(theta_perturbed))
            rewards.append(r)
        rewards = rank_normalize(rewards)        # fitness shaping
        grad_est = sum(r * regen_eps(s, theta.shape)
                       for r, s in zip(rewards, seeds)) / (n_workers * sigma)
        theta = theta + alpha * grad_est

# ============ PPO (skeleton) ============
def ppo(env, policy_net, value_net, T=2048, K=10, eps=0.2, gens=500):
    for g in range(gens):
        traj = collect_trajectories(env, policy_net, T)
        adv = gae(traj, value_net)
        for _ in range(K):
            ratio = exp(policy_net.logp(traj) - traj.old_logp)
            surr = min(ratio * adv, clip(ratio, 1-eps, 1+eps) * adv)
            loss = -surr.mean() + 0.5 * (value_net(traj.s) - traj.ret).pow(2).mean()
            sgd_step(loss)

In production, the ES part is typically run with ray / torch.distributed worker pools, while PPO uses Stable-Baselines3 / RLlib.

11. Cross-references

Parent page: 本页 §1（派系入口） — Overall positioning of the Evolutionaries
Same directory: Genetic Algorithms and Genetic Programming — The bit-string/tree encoded version of GA
Same directory: Swarm Intelligence and AutoML — Comparison of PSO/ACO/NAS
Reinforcement learning: ../../../2_ReinforcementLearning/ — Details of PPO/A3C/SAC; this page only does comparisons
Deep learning: ../../../1_DeepLearning/ — Pages on HyperNetworks and neural architecture search

References

Rechenberg, I. (1973). Evolutionsstrategie: Optimierung technischer Systeme nach Prinzipien der biologischen Evolution. Frommann-Holzboog. — The original ES PhD thesis.
Schwefel, H.-P. (1981). Numerical Optimization of Computer Models. Wiley.
Hansen, N., & Ostermeier, A. (2001). Completely derandomized self-adaptation in evolution strategies. Evolutionary Computation, 9(2), 159–195.
Hansen, N. (2016). The CMA Evolution Strategy: A Tutorial. arXiv:1604.00772. — The authoritative engineering tutorial for CMA-ES.
Wierstra, D., Schaul, T., Glasmachers, T., Sun, Y., Peters, J., & Schmidhuber, J. (2014). Natural Evolution Strategies. JMLR, 15, 949–980.
Ollivier, Y., Arnold, L., Auger, A., & Hansen, N. (2017). Information-geometric optimization algorithms: A unifying picture via invariance principles. JMLR, 18, 1–65.
Salimans, T., Ho, J., Chen, X., Sidor, S., & Sutskever, I. (2017). Evolution Strategies as a Scalable Alternative to Reinforcement Learning. arXiv:1703.03864.
Stanley, K. O., & Miikkulainen, R. (2002). Evolving neural networks through augmenting topologies. Evolutionary Computation, 10(2), 99–127.
Stanley, K. O., D'Ambrosio, D. B., & Gauci, J. (2009). A hypercube-based encoding for evolving large-scale neural networks. Artificial Life, 15(2), 185–212.
Risi, S., & Stanley, K. O. (2010). Evolving plastic neural networks with novel adaptation rules. GECCO.
Conti, E., Madhavan, V., Petroski Such, F., Lehman, J., Stanley, K. O., & Clune, J. (2018). Improving exploration in evolution strategies for deep reinforcement learning via a population of novelty-seeking agents. NeurIPS.
Jaderberg, M., et al. (2017). Population Based Training of Neural Networks. arXiv:1711.09846.
Ha, D., Dai, A., & Le, Q. V. (2016). HyperNetworks. arXiv:1609.09106.

Swarm Intelligence and AutoML

This page focuses on two parallel threads within the Evolutionaries: Swarm Intelligence (SI) and Automated Machine Learning (AutoML). The former is a family of decentralized optimization methods that emulate the collective behavior of ant colonies, bird flocks, and bee swarms; the latter applies evolution and other black-box optimization techniques to the machine-learning pipeline itself—hyperparameters, model families, and neural network architectures all become objects of search.

Before reading this page, we recommend looking at 本页 §1（派系入口）. Together with Genetic Algorithms and Genetic Programming and Evolution Strategies and Neuroevolution, this page completes the Evolutionary series.

1. Swarm Intelligence: Decentralization, Local Interaction, Emergence

Swarm Intelligence (SI) was proposed by Beni and Wang in 1989 as a research paradigm for collaborative robot swarms. Its three pillars are:

Decentralization: there is no global scheduler; each individual sees only the state of its local neighborhood.
Local interaction: individuals communicate using simple rules and limited information.
Emergence: global intelligent behavior arises "spontaneously" from the interactions of many simple individuals.

Key differences between SI and GA:

Dimension	GA	SI (PSO/ACO)
Inter-individual relations	Selective competition (survival of the fittest)	Cooperative communication (pheromone / global-best sharing)
Primary operator	Crossover + mutation	Update rule (move toward the best)
Has explicit "generations"?	Yes, with discrete generations	Usually continuous updates
Primary metaphor	Natural selection	Swarm behavior

2. Particle Swarm Optimization (PSO)

Particle Swarm Optimization (PSO) was proposed by James Kennedy and Russell Eberhart in 1995, inspired by the cooperative foraging of bird flocks.

2.1 The Algorithm

Each particle $i$ maintains:

position $\mathbf{x}_i \in \mathbb{R}^d$
velocity $\mathbf{v}_i \in \mathbb{R}^d$
personal best $\mathbf{p}_i$
global best $\mathbf{g}$ (or neighborhood best $\mathbf{l}_i$)

Each step: $$\mathbf{v}_i^{t+1} = \omega \mathbf{v}_i^t + c_1 r_1 (\mathbf{p}_i - \mathbf{x}_i^t) + c_2 r_2 (\mathbf{g} - \mathbf{x}_i^t)$$ $$\mathbf{x}_i^{t+1} = \mathbf{x}_i^t + \mathbf{v}_i^{t+1}$$

where $r_1, r_2 \sim \mathcal{U}[0,1]^d$ are sampled component-wise; $\omega$ is the inertia weight; $c_1, c_2$ are the cognitive and social learning factors, with typical values $c_1 = c_2 = 1.49$ and $\omega \in [0.4, 0.9]$.

2.2 Parameter Intuitions

Large $\omega$ → strong inertia, strong exploration; small $\omega$ → strong convergence. A common linear decay schedule is from 0.9 down to 0.4.
Large $c_1$ → particles trust their own history → multimodal protection.
Large $c_2$ → particles trust the global best → fast convergence but easy premature convergence.
Clerc constriction factor ($\chi$): a more stable setting $\mathbf{v} \leftarrow \chi[\mathbf{v} + c_1 r_1 (\mathbf{p}_i - \mathbf{x}) + c_2 r_2 (\mathbf{g} - \mathbf{x})]$, with $\chi = 2/|2 - \varphi - \sqrt{\varphi^2 - 4\varphi}|$ and $\varphi = c_1+c_2 > 4$.

2.3 Topologies

gbest (star): all particles share the global best → fast convergence, prone to premature convergence.
lbest (ring): each particle sees only its left/right neighbors → slow convergence but resists premature convergence.
Von Neumann lattice: each particle sees its four (top/bottom/left/right) neighbors.
Random topology: dynamically changing neighbor relations.

2.4 Important Variants

CLPSO (Comprehensive Learning PSO, Liang 2006): each dimension picks its own learning target, improving multimodal search.
APSO (Adaptive PSO): adapts $\omega, c_1, c_2$ based on the evolutionary state.
Bare-bones PSO: removes velocity entirely, sampling directly $\mathbf{x}_i^{t+1} \sim \mathcal{N}((\mathbf{p}_i + \mathbf{g})/2, |\mathbf{p}_i - \mathbf{g}|)$.

3. Ant Colony Optimization (ACO)

Ant Colony Optimization (ACO) was introduced in Marco Dorigo's 1992 PhD thesis. It is inspired by ants communicating indirectly through pheromones and ultimately converging to the shortest path.

3.1 Key Mechanisms of ACO

Solution construction: each ant incrementally builds a complete solution (e.g., a TSP tour), making decisions according to: $$p_{ij}^k = \frac{[\tau_{ij}]^\alpha [\eta_{ij}]^\beta}{\sum_{l \in N_i^k}[\tau_{il}]^\alpha [\eta_{il}]^\beta}$$ where $\tau_{ij}$ is the pheromone concentration on edge $(i,j)$, $\eta_{ij}$ is a heuristic (e.g., $1/d_{ij}$), and $\alpha, \beta$ control the influence of pheromone and heuristic respectively.
Pheromone update: after all ants finish a round: $$\tau_{ij} \leftarrow (1-\rho)\tau_{ij} + \sum_{k=1}^m \Delta\tau_{ij}^k$$ $\rho \in (0, 1]$ is the evaporation rate, $\Delta\tau_{ij}^k = Q/L^k$ if ant $k$ traversed $(i,j)$ and 0 otherwise, and $L^k$ is the total cost of ant $k$'s tour.
Positive feedback: edges of good solutions accumulate more pheromone → subsequent ants are more inclined to traverse them → reinforcing good solutions. The evaporation mechanism prevents over-reinforcement of suboptimal solutions discovered early.

3.2 ACO Variants

AS (Ant System): the original version (Dorigo 1992); all ants update.
MMAS (MAX-MIN Ant System, Stützle 2000): pheromone is bounded to $[\tau_{\min}, \tau_{\max}]$, and only the best-so-far ant updates—greatly mitigating premature convergence.
ACS (Ant Colony System, Dorigo & Gambardella 1997): introduces a local pheromone update + a pseudo-random proportional selection rule.
Rank-based AS: only the top-$\sigma$ ants update, weighted by rank.

3.3 Application Scenarios

ACO is best suited to discrete combinatorial optimization:

TSP, vehicle routing (VRP), scheduling, network routing (AntNet), assembly sequences, feature selection.
It is less competitive than PSO/CMA-ES on continuous optimization—the discrete pheromone structure is awkward in continuous space.

4. Other Swarm Algorithms (Brief Tour)

4.1 Artificial Bee Colony (ABC, Karaboga 2005)

Mimics bee foraging. Three roles:

Employed bees: each maintains a food source (candidate solution) and performs local search.
Onlooker bees: choose food sources based on the dance (fitness) of employed bees.
Scout bees: when a food source is exhausted (no improvement for limit generations), scout bees randomly restart it.

Update rule: $x'_{ij} = x_{ij} + \phi(x_{ij} - x_{kj})$, $\phi \in [-1, 1]$, with $k$ being another randomly chosen food source.

ABC's strengths are few parameters (mainly limit and population size) and good resistance to premature convergence; its weakness is that it offers no clear edge over PSO/CMA-ES outside high-dimensional, multimodal problems.

4.2 Firefly Algorithm (FA, Yang 2008)

Each firefly is attracted to brighter ones: $$\mathbf{x}_i \leftarrow \mathbf{x}_i + \beta_0 e^{-\gamma r_{ij}^2}(\mathbf{x}_j - \mathbf{x}_i) + \alpha \boldsymbol{\epsilon}_i$$ where $r_{ij} = \|\mathbf{x}_i - \mathbf{x}_j\|$, $\beta_0$ is the base attraction, $\gamma$ is the decay coefficient, and $\alpha$ is the random perturbation magnitude.

Its hallmark is distance-based decay—brighter fireflies attract only nearby fireflies strongly, naturally forming multiple niches → suitable for multimodal optimization.

4.3 Cuckoo Search (CS, Yang & Deb 2009)

Inspired by cuckoos laying eggs in other birds' nests, with eggs ejected if discovered. Algorithm steps:

Generate new solutions via Lévy flight: $\mathbf{x}_i^{t+1} = \mathbf{x}_i^t + \alpha \otimes L(s, \lambda)$, where $L$ is a Lévy distribution (heavy-tailed, with long-range jumps).
Randomly replace solutions in some "nests".
With probability $p_a$, abandon the worst "discovered" eggs and regenerate them.

The heavy-tailed nature of Lévy flights gives CS stronger global exploration than PSO.

4.4 Comparison of Swarm Algorithms

Algorithm	Main inspiration	Information sharing	Strengths	Weaknesses
PSO	Bird flocks	Shares personal/global best	Continuous optimization, simple to implement	Premature convergence in high dimensions
ACO	Ant pheromones	Pheromone matrix	Discrete combinatorial, TSP	Awkward for continuous optimization
ABC	Bee division of labor	Fitness-proportional dance	Multimodal, few parameters	Slow convergence in high dimensions
FA	Firefly brightness decay	Distance-based attraction decay	Multimodal	Sensitive to parameter $\gamma$
CS	Cuckoo + Lévy	Replaces worst nest	Strong global exploration	Poor local refinement, often hybridized

5. Swarm Intelligence vs GA: An Overview

Dimension	GA	SI (PSO/ACO)
Encoding	Bit string / real / tree	Real / path / discrete structure
Primary operations	Selection + crossover + mutation	Update rule (move toward best)
Information flow	Parent → child (generational)	Continuous sharing among individuals
Convergence speed	Slower	Faster (but premature convergence is common)
Diversity preservation	sharing/crowding/island	Topology/inertia
Applicability	General	Depends on the specific algorithm

6. AutoML: Automating the Machine Learning Pipeline

AutoML (Automated Machine Learning) aims to enable non-experts to build high-quality machine-learning systems—automating data preprocessing, feature engineering, model selection, hyperparameter optimization, ensemble, and deployment across the entire pipeline.

6.1 Subproblems of AutoML

flowchart LR
    A[Raw data] --> B[Data cleaning / missing-value handling]
    B --> C[Feature engineering / feature selection]
    C --> D[Model family selection]
    D --> E[Hyperparameter optimization HPO]
    E --> F[Neural architecture search NAS]
    E --> G[Model ensemble / distillation]
    F --> H[Deployment / monitoring]
    G --> H
    subgraph Meta-learning
        M1[Meta-data]
        M2[warm-start]
    end
    M1 -.- B
    M1 -.- D
    M2 -.- E

Main subproblems:

HPO (Hyperparameter Optimization): learning rate, regularization, network width, etc.
NAS (Neural Architecture Search): the network architecture itself.
Meta-learning: leverages experience from prior tasks to speed up new tasks.
Data-augmentation search: AutoAugment, RandAugment.
Feature engineering: FeatureTools, AutoCross.

6.2 The Three Schools of AutoML

Evolutionary: encode the ML pipeline as a chromosome and evolve it (TPOT, AutoML-Zero).
Bayesian: fit a hyperparameter→performance surrogate using a Gaussian process or TPE (Hyperopt, Optuna).
Reinforcement learning: model pipeline construction as an MDP (Zoph & Le NAS-RL, Google Vizier).

7. Comparison of Hyperparameter Optimization Methods

Method	Principle	Dimensions	Smoothness needed?	Typical tools
Grid search	Discrete grid traversal	Low ($\le 4$)	No	sklearn
Random search	Uniform sampling	Medium ($\le 20$)	No	sklearn
Evolutionary	GA / CMA-ES / TPE	Medium–high	No	DEAP, nevergrad
Bayesian optimization (BO)	GP/TPE/SMAC	Medium ($\le 50$)	Soft (kernel assumption)	scikit-optimize, Optuna, BoTorch
Hyperband	Multi-armed bandit + early stopping	Any	No	hyperband, HpBandSter
BOHB	BO + Hyperband hybrid	Medium–high	Soft	HpBandSter
PBT	Evolution + online training	Any	No	Ray Tune

Rules of thumb:

Dimensions $\le 5$, cheap evaluations: random search is enough.
Dimensions 5–30, expensive evaluations: Bayesian optimization / TPE.
Training-curve information available: Hyperband / BOHB (more resource-efficient).
Hyperparameters tuned alongside training: PBT.
Non-differentiable objectives / many discrete choices: CMA-ES, GA.

In Random Search for Hyper-Parameter Optimization, Bergstra & Bengio (2012) proved that in many scenarios random search beats grid search—because most hyperparameters are unimportant ("low effective dimensionality") and grids waste evaluations on the irrelevant dimensions. This is essential reading for HPO.

8. Neural Architecture Search (NAS)

NAS is the most-watched subproblem in AutoML. Its goal is to automatically discover network architectures that surpass human designs.

8.1 Three Dimensions

Every NAS method must answer three questions:

Dimension	Options
Search space	macro (full net) / cell-based / hierarchical / morphism-based
Search strategy	RL / evolution / Bayesian / differentiable / random
Performance estimation	Full training / early stopping / weight sharing / performance predictor

8.2 Search Spaces

Macro: specify the entire network layer by layer (Zoph & Le 2017 original). The space is huge and search is hard.
Cell-based: search small cells, then stack them into a network (NASNet, AmoebaNet, DARTS). The mainstream choice—shrinks the space dramatically and transfers across resolutions/tasks.
Hierarchical: cells themselves contain hierarchical motifs.
Morphism-based: starts from a known model and expands via equivalence-preserving transforms (NetMorph).

8.3 Search Strategies

8.3.1 Evolutionary: AmoebaNet (Real et al. 2019)

Encode the architecture as a DAG chromosome. The key innovation of Regularized Evolution is replacing elitism with "age-based" selection—always retiring the oldest individual to avoid being dominated long-term by an early lucky solution. Under the same compute budget, it beat NAS-RL, discovering AmoebaNet-A with 83.9% top-1 on ImageNet.

8.3.2 RL: NAS-RL (Zoph & Le 2017)

A Controller RNN samples architecture descriptors step by step; the architecture is trained on CIFAR for validation accuracy as reward; REINFORCE / PPO updates the controller.

Original version: 800 GPUs × 28 days.
Subsequent NASNet, ENAS dramatically reduced cost.

8.3.3 Differentiable: DARTS (Liu et al. 2019)

Turn "select one of the candidate ops" into a softmax-weighted sum: $$\bar o^{(i,j)}(x) = \sum_{o \in \mathcal{O}} \frac{\exp(\alpha_o^{(i,j)})}{\sum_{o'} \exp(\alpha_{o'}^{(i,j)})}\, o(x)$$ where $\alpha$ are architecture parameters, trained jointly with weights $w$ via SGD (bilevel optimization). At the end, $\arg\max \alpha$ discretizes the architecture.

DARTS reduced NAS cost from GPU-days to a few GPU-hours and has been widely adopted. But it has known issues: search-evaluation gap, skip-connection collapse, etc., spawning improvements such as PC-DARTS, Fair-DARTS, and SDARTS.

8.4 Comparison Diagram of the Three NAS Strategies

flowchart TD
    SP[Search space] --> S1[Search strategy]
    S1 --> RL[Reinforcement learning<br/>RNN controller<br/>NAS-RL Zoph 2017]
    S1 --> EV[Evolution<br/>Chromosome + mutation / crossover<br/>AmoebaNet Real 2019]
    S1 --> DG[Differentiable<br/>softmax over ops<br/>DARTS Liu 2019]
    RL --> P1[Full training<br/>per sampled architecture]
    EV --> P1
    DG --> P2[Weight sharing<br/>One-shot]
    P1 --> EVAL[Accuracy reward]
    P2 --> EVAL
    EVAL --> S1

Dimension	NAS-RL	AmoebaNet	DARTS
Strategy	Reinforcement learning	Evolution (Regularized Evolution)	Differentiable (bilevel)
Compute	800 GPU·days	$\sim$ 3000 GPU·days	0.4 GPU·days
One-shot?	No	No (later ENAS is)	Yes
Reproduction stability	Poor (PPO tuning)	Good	Medium (skip-connection collapse)
Architectures found	NASNet	AmoebaNet	DARTS-cells

Fully training each candidate is extremely expensive. Three major cost-reduction techniques:

Performance Predictor: train a regressor on historical (architecture, accuracy) data and directly predict for new architectures → PNAS, Neural Predictor.
Weight Sharing / One-Shot: one supernet contains all candidates; after training the supernet, all subnets inherit weights → ENAS, SPOS, ProxylessNAS.
Early Stopping / Learning Curve Extrapolation: terminate low-potential training early → Hyperband / Successive Halving.

8.6 ProxylessNAS and Hardware Awareness

ProxylessNAS (Cai et al. 2019) searches directly on the target hardware (phone/GPU/FPGA), incorporating latency into the loss: $$\mathcal{L} = \mathcal{L}_{\mathrm{CE}} + \lambda_1 \|w\|^2 + \lambda_2 \mathbb{E}[\mathrm{Latency}]$$ Latency is obtained per submodule via lookup tables and is differentiable with respect to architecture parameters. This idea drove the development of phone-side SOTA networks like MobileNet-V3 and EfficientNet.

9. Comparison of Industrial AutoML Frameworks

Framework	Main algorithms	Strengths	Suitable scenarios
DEAP	GA/GP/ES (Python)	Flexible, textbook-style API	Research, teaching, custom EAs
pymoo	NSGA-II/III, MOEA/D	Multi-objective optimization	Engineering optimization, Pareto fronts
nevergrad (Facebook)	CMA-ES, TBPSA, bayes	Many black-box benchmarks	General black-box optimization
Optuna	TPE, CMA-ES, NSGA-II	Seamless sklearn/PyTorch, pruning	Industrial first choice for HPO
Hyperopt	TPE	Early classic	HPO, gradually being replaced by Optuna
Ray Tune	PBT, ASHA, BOHB, Optuna	Distributed + scheduling	Large-scale distributed HPO
NNI (Microsoft)	TPE, SMAC, PPO, Evolution, DARTS	NAS + HPO unified	NAS experiments
AutoGluon (AWS)	Model ensembling + meta-learning	One-click on tabular/image/text	End-to-end engineering deployment
AutoKeras	NAS + Bayesian	Friendly to Keras users	Quick onboarding
AutoSklearn	SMAC + meta-learning	sklearn-compatible	Small tabular data
TPOT	GP evolves sklearn pipelines	Pipeline-level evolution	Tabular ML pipeline optimization
AutoML-Zero	Evolves basic instructions	Evolves the algorithm itself from scratch	Research / extreme automation

10. Limitations of AutoML

AutoML is no silver bullet. Three fundamental limitations:

10.1 Search Cost

AmoebaNet: $\sim$ 3000 GPU·days
Early NAS-RL: 800 GPU·days
DARTS / one-shot dramatically reduce cost, but it is still no "free lunch"

The carbon cost was quantified by Strubell et al. (2019): training a single large NAS model emits roughly as much carbon as a car's entire life cycle.

10.2 Generalization and Transfer

Architectures searched on CIFAR-10 are not necessarily optimal on ImageNet (although in practice transfer is usually effective).
Architectures searched on ImageNet-1k are not necessarily optimal on medical imaging or satellite imagery.
Proxy-task bias: using small datasets / few epochs as a proxy can produce preferences that misalign with the actual large-scale task.

10.3 Interpretability

Architectures searched by evolution/RL often have "weird" connections (skip chains, irregular op combinations) that humans struggle to understand.
HPO returns "$\eta = 3.7 \times 10^{-4}$" as the optimum without offering principled insight.
This creates tension with the Symbolists' / Bayesians' pursuit of interpretability.

10.4 Fairness and Reproducibility

Lindauer & Hutter (2020) systematically pointed out: many NAS papers compare different search algorithms with different search spaces, making it impossible to attribute the gains. The arrival of benchmarks such as NAS-Bench-101/201/301 finally enabled fair comparisons of search algorithms themselves.

11. Engineering Practice Checklist

A checklist to walk through when deploying an AutoML system:

[ ] Is the search space reasonable? Too large → wasteful; too small → unable to find good solutions.
[ ] Correlation between proxy task and target task? Run transfer experiments.
[ ] Compute budget cap? Constrain both wall-clock and GPU-days.
[ ] Early-stopping strategy? Hyperband / ASHA usually saves 5–10×.
[ ] One-shot? Weigh supernet training cost against subnet inheritance accuracy.
[ ] Multi-objective (accuracy + latency + memory)? Use NSGA-II / Pareto.
[ ] Meta-learning warm-start? Can speed up cold starts.
[ ] Monitor the search process: top-$k$ diversity, convergence curves, hyperparameter importance.
[ ] Final validation: full training + held-out test set + multiple seeds.
[ ] Reproducibility: fix seeds, log all hyperparameters, containerize.

12. Cross-references

Parent page: 本页 §1（派系入口） — Overall positioning and algorithmic lineage of the Evolutionaries
Same directory: Genetic Algorithms and Genetic Programming — GA/GP is the engine of AutoML's evolutionary branch
Same directory: Evolution Strategies and Neuroevolution — ES/CMA-ES are important in HPO and NAS
Deep learning: ../../../1_DeepLearning/ — Details of NAS-discovered networks like AmoebaNet/DARTS
Bayesians: 贝叶斯派 — TPE/SMAC/BO form the Bayesian camp of HPO

References

Beni, G., & Wang, J. (1989). Swarm intelligence in cellular robotic systems. NATO Advanced Workshop on Robots and Biological Systems.
Kennedy, J., & Eberhart, R. (1995). Particle swarm optimization. Proceedings of ICNN, 4, 1942–1948. — The original PSO paper.
Clerc, M., & Kennedy, J. (2002). The particle swarm: explosion, stability, and convergence in a multidimensional complex space. IEEE TEC, 6(1), 58–73.
Liang, J. J., Qin, A. K., Suganthan, P. N., & Baskar, S. (2006). Comprehensive learning particle swarm optimizer for global optimization of multimodal functions. IEEE TEC, 10(3), 281–295.
Dorigo, M. (1992). Optimization, Learning and Natural Algorithms. PhD thesis, Politecnico di Milano. — The original ACO PhD thesis.
Dorigo, M., & Gambardella, L. M. (1997). Ant colony system: A cooperative learning approach to TSP. IEEE TEC, 1(1), 53–66.
Stützle, T., & Hoos, H. H. (2000). MAX-MIN ant system. Future Generation Computer Systems, 16(8), 889–914.
Karaboga, D. (2005). An idea based on honey bee swarm for numerical optimization. Technical Report, Erciyes University.
Yang, X.-S. (2008). Nature-Inspired Metaheuristic Algorithms. Luniver Press. — The Firefly Algorithm.
Yang, X.-S., & Deb, S. (2009). Cuckoo search via Lévy flights. World Congress on Nature & Biologically Inspired Computing.
Bergstra, J., & Bengio, Y. (2012). Random search for hyper-parameter optimization. JMLR, 13, 281–305.
Snoek, J., Larochelle, H., & Adams, R. P. (2012). Practical Bayesian optimization of machine learning algorithms. NeurIPS.
Li, L., Jamieson, K., DeSalvo, G., Rostamizadeh, A., & Talwalkar, A. (2017). Hyperband: A novel bandit-based approach to hyperparameter optimization. JMLR, 18, 1–52.
Falkner, S., Klein, A., & Hutter, F. (2018). BOHB: Robust and efficient hyperparameter optimization at scale. ICML.
Zoph, B., & Le, Q. V. (2017). Neural architecture search with reinforcement learning. ICLR. — The original NAS-RL paper.
Real, E., Aggarwal, A., Huang, Y., & Le, Q. V. (2019). Regularized evolution for image classifier architecture search. AAAI. — AmoebaNet.
Liu, H., Simonyan, K., & Yang, Y. (2019). DARTS: Differentiable architecture search. ICLR.
Pham, H., Guan, M. Y., Zoph, B., Le, Q. V., & Dean, J. (2018). Efficient neural architecture search via parameter sharing. ICML. — ENAS.
Cai, H., Zhu, L., & Han, S. (2019). ProxylessNAS: Direct neural architecture search on target task and hardware. ICLR.
Strubell, E., Ganesh, A., & McCallum, A. (2019). Energy and policy considerations for deep learning in NLP. ACL.
Lindauer, M., & Hutter, F. (2020). Best practices for scientific research on neural architecture search. JMLR, 21, 1–18.
Akiba, T., Sano, S., Yanase, T., Ohta, T., & Koyama, M. (2019). Optuna: A next-generation hyperparameter optimization framework. KDD.
Olson, R. S., & Moore, J. H. (2016). TPOT: A tree-based pipeline optimization tool for automating machine learning. Automated Machine Learning Workshop.

Encoding	Chromosome form	Typical use case	Matching crossover/mutation
Binary	\(\{0, 1\}^l\)	Subset selection, Boolean rules, combinatorial optimization	bit-flip mutation, 1-point/uniform crossover
Real-valued	\(\mathbb{R}^d\)	Continuous function optimization, neural network weights	Gaussian mutation, BLX-α / SBX
Permutation	A permutation in \(S_n\)	TSP, scheduling, assembly sequences	order crossover (OX), PMX, swap mutation
Tree	Abstract syntax tree	GP program evolution, symbolic regression	subtree crossover, point mutation
Graph	Nodes + edges	Neural network topology (NEAT), circuits	NEAT-style add-node / add-link
Gray code	\(\{0,1\}^l\)	Numerical optimization, avoiding the Hamming cliff	Same as binary

Selection operator	Pressure tunability	Computational cost	Sensitive to fitness scale
Roulette	Hard to tune	\(\mathcal{O}(N)\)	Highly sensitive
Tournament	Easy to tune (\(k\))	\(\mathcal{O}(k)\)	Insensitive
Rank	Easy to tune (\(\eta^+\))	\(\mathcal{O}(N \log N)\) for sort	Completely insensitive
Boltzmann	Schedulable (\(T\))	\(\mathcal{O}(N)\)	Sensitive (but \(T\) helps)

Node type	Role	Examples
Function set	Internal nodes	\(\{+, -, \times, /, \sin, \cos, \mathrm{if}, \mathrm{and}\}\)
Terminal set	Leaf nodes	\(\{x_1, x_2, \dots, c_1, c_2, \dots\}\)

Hyperparameter	Typical range	Tuning intuition
Population size \(N\)	\(50 \sim 500\)	High dimensionality, multimodality → larger; expensive evaluations → smaller
Crossover probability \(p_c\)	\(0.6 \sim 0.95\)	Usually high, letting recombination dominate
Mutation probability \(p_m\)	\(1/l \sim 5/l\)	Too low: no exploration; too high: random search
Tournament size \(k\)	\(2 \sim 5\)	Larger → stronger pressure → premature convergence
Elitism ratio	\(1\% \sim 10\%\)	Too large: lose diversity
Maximum generations	\(10^2 \sim 10^4\)	Add early stopping

Method	Principle	Dimensions	Smoothness needed?	Typical tools
Grid search	Discrete grid traversal	Low (\(\le 4\))	No	sklearn
Random search	Uniform sampling	Medium (\(\le 20\))	No	sklearn
Evolutionary	GA / CMA-ES / TPE	Medium–high	No	DEAP, nevergrad
Bayesian optimization (BO)	GP/TPE/SMAC	Medium (\(\le 50\))	Soft (kernel assumption)	scikit-optimize, Optuna, BoTorch
Hyperband	Multi-armed bandit + early stopping	Any	No	hyperband, HpBandSter
BOHB	BO + Hyperband hybrid	Medium–high	Soft	HpBandSter
PBT	Evolution + online training	Any	No	Ray Tune

Evolutionaries

1. Tribal Profile

2. The Master Algorithm: The Genetic Search Loop

3. Algorithmic Lineage

4. Comparison with Gradient Optimization

5. The Modern Revival: From Niche to Mainstream

6. Subpage Navigation

7. Division of Labor with Existing Pages

References

Genetic Algorithms and Genetic Programming

1. The Complete GA Pipeline

1.1 Formal Definition

1.2 Relation to General Search

2. Encoding Schemes

3. Selection Operators

3.1 Roulette Wheel

3.2 Tournament Selection

3.3 Rank-based

3.4 Boltzmann Selection

4. Crossover Operators

4.1 1-point Crossover

4.2 2-point Crossover

4.3 Uniform Crossover

4.4 Simulated Binary Crossover (SBX)

5. Mutation Operators

5.1 Bit-flip (binary)

5.2 Gaussian Mutation (real-valued)

5.3 Polynomial Mutation

5.4 Self-adaptive Mutation

6. Elitism and Diversity Preservation

6.1 Elitism

6.2 The Diversity Crisis

7. The Schema Theorem (Holland 1975)

7.1 Definition of a Schema

7.2 Statement of the Theorem

7.3 Term-by-term Interpretation

7.4 Corollary: The Building-Block Hypothesis

7.5 The Linkage Problem

8. Genetic Programming (GP)

8.1 Function Set and Terminal Set

8.2 Tree Crossover and Mutation

8.3 Bloat and Its Control

8.4 Classical Applications

9. Multi-objective Evolutionary Algorithms: NSGA-II

10. Examples

10.1 0/1 Knapsack

10.2 TSP

10.3 Symbolic Regression

11. Python Example (DEAP-style)

12. Hyperparameter Tuning and Pitfalls

13. Cross-references

References

Evolution Strategies and Neuroevolution

1. Origins of Evolution Strategies

2. Classical ES Variants

2.1 (1+1)-ES: The Minimal Working Model

2.2 (\(\mu\), \(\lambda\))-ES vs (\(\mu\) + \(\lambda\))-ES

2.3 The 1/5 Success Rule (Rechenberg 1973)

3. CMA-ES: The Modern Champion of ES

3.1 Algorithmic Skeleton

3.2 Mean Update

3.3 Covariance Matrix Update: rank-1 and rank-\(\mu\)

3.4 Global Step-Size Control (CSA)

3.5 Connection to Natural Gradients: the IGO Framework

3.6 Suitable Problem Sizes

4. ES from a Modern Lens: as a Black-Box Optimizer

5. OpenAI ES: Pushing ES to Large-Scale RL

5.1 The Algorithm

5.2 The Key Engineering Innovation: Seed Communication

5.3 ES vs PPO

6. NEAT: Co-evolving Topology and Weights

6.1 Three Innovations

6.2 The NEAT Speciation Loop

6.3 Applications of NEAT

7. HyperNEAT: Indirect Encoding and Geometric Priors

7.1 CPPN: Pattern-Generating Networks

7.2 Geometric Priors

7.3 Subsequent Developments

8. Neuroevolution vs Reinforcement Learning: Which to Choose

9. Practical Considerations