RL in Scientific Discovery

Overview

Reinforcement learning has achieved breakthroughs not only in games but also demonstrates tremendous potential across multiple scientific domains. The core value of RL in science lies in automating complex search and optimization processes and discovering solutions that humans would struggle to conceive.

Molecular Design and Drug Discovery

Problem Formulation

Molecular design can be modeled as a sequential decision problem:

State: Current molecular structure (graph or SMILES string)
Action: Add atom, add bond, modify functional group, etc.
Reward: Based on molecular properties (e.g., drug activity, synthesizability, toxicity)

Methods

SMILES-based generation:

Represent molecules as SMILES strings, generate character by character using RNN/Transformer, and optimize target properties with RL:

\[r = w_1 \cdot \text{Activity}(m) + w_2 \cdot \text{Synthesizability}(m) - w_3 \cdot \text{Toxicity}(m)\]

Graph-based generation:

Treat molecules as graphs, using Graph Neural Networks (GNN) and RL to incrementally build molecular graphs.

Multi-objective optimization:

Drug design typically requires simultaneous optimization of multiple properties (activity, selectivity, ADMET characteristics, etc.), using multi-objective RL or constrained optimization.

Representative Work

REINVENT: Uses policy gradients to optimize molecular generators
MolDQN: Q-learning-based molecular optimization
ChemRL: Molecular design framework combining GNNs and RL

Challenges

Enormous chemical space (~\(10^{60}\) possible small molecules)
Reward functions depend on computational simulations or experimental validation (expensive)
Must ensure chemical validity of generated molecules

Protein Science

AlphaFold's Connection to RL

Although AlphaFold primarily relies on supervised learning and attention mechanisms, it has deep connections to RL:

Structure search can be modeled as sequential decision-making
Sampling strategies resemble Monte Carlo methods
Iterative refinement resembles policy improvement

RL in Protein Design

Protein sequence design: Given a target structure, use RL to optimize amino acid sequences
Protein folding pathways: Dynamic decision-making simulating the folding process
Enzyme activity optimization: Optimizing catalytic efficiency through directed evolution simulation

Chip Design

Google's Chip Placement

Mirhoseini et al. (2021, Nature) used RL to optimize chip macro placement:

Problem formulation:

State: Current chip layout (placed macros on a grid)
Action: Place the next macro at a grid location
Reward: Combination of wirelength, congestion, and timing

\[r = -w_1 \cdot \text{Wirelength} - w_2 \cdot \text{Congestion} - w_3 \cdot \text{TimingViolation}\]

Method:

Graph neural network encodes the chip netlist
Policy network uses attention mechanisms to handle chips of different scales
Transfer learning: Pretrain on multiple chip designs for rapid adaptation to new chips

Results:

Completed placement in 6 hours (human engineers need weeks)
Quality comparable to or better than human experts
Applied to Google TPU design

Controversy and Follow-up

Some researchers questioned whether RL truly outperforms traditional EDA tools
Subsequent work validated the method's effectiveness on larger-scale chips
Inspired further research applying RL to EDA workflows

Nuclear Fusion Plasma Control

DeepMind + EPFL Collaboration

Degrave et al. (2022, Nature) used RL to control plasma in a tokamak device:

Problem:

Sustaining nuclear fusion requires confining plasma in specific shapes
Plasma is extremely unstable, requiring real-time control of multiple magnetic field coils
Traditional control methods rely on extensive manual tuning

RL approach:

State: Plasma shape parameters, magnetic field measurements
Action: Voltages for 19 magnetic field coils
Reward: Plasma shape error + stability metrics

Training pipeline:

Train in physics simulators (TORBEAM, etc.)
Learn to produce multiple plasma shapes (elongated, snowflake, etc.)
Validate on the TCV tokamak device

Results:

Successfully controlled multiple plasma configurations
Discovered novel control strategies not attempted by humans
Demonstrated RL's potential in complex physical system control

Mathematical Discovery

FunSearch

Romera-Paredes et al. (2024, Nature) used LLM + evolutionary search to discover new mathematical results:

Method:

Encode mathematical problems as program search problems
LLM generates candidate programs (solutions)
Automatically evaluate program quality
Evolutionary strategy selects and improves the best solutions

Achievements:

Discovered constructions surpassing the best known solutions for the cap set problem
Found new efficient heuristics for bin packing

Connection to RL:

The search process can be viewed as an exploration-exploitation tradeoff
The evaluation function resembles a reward signal
Evolutionary selection resembles policy improvement

Other Mathematical Applications

Using RL to discover new matrix multiplication algorithms (AlphaTensor)
Search strategies for assisting theorem proving
Heuristic discovery for combinatorial optimization problems

Robotics

RL applications in robotics constitute a vast independent field; see the dedicated sections for details.

Main directions:

Locomotion control
Dexterous manipulation
Navigation and planning
Human-robot interaction

Materials Science

Materials Discovery

New material search: Searching for materials with target properties in vast composition spaces
Synthesis pathway planning: Determining preparation steps for materials
Property optimization: Tuning process parameters to optimize material performance

Battery Materials

Using RL to optimize battery charging/discharging strategies:

Extending battery lifespan
Optimizing charging speed
Balancing performance and safety

Catalyst Design

RL-assisted search for optimal catalyst combinations:

State: Catalyst composition and structure
Action: Composition adjustments
Reward: Catalytic activity and selectivity

Cross-Domain Commonalities

Common Patterns of RL in Scientific Applications

Element	Pattern
State space	Typically high-dimensional and structured (graphs, sequences, fields)
Action space	Design/control parameters
Reward	Based on simulation or experimental evaluation
Challenges	Sample efficiency, sparse rewards, high validation costs
Advantages	Automated search, novel strategy discovery, surpassing human intuition

Key Success Factors

Good problem formulation: Correctly mapping the scientific problem to an MDP
Efficient simulators: RL requires extensive interaction, needing fast and accurate simulation
Domain knowledge integration: Reward design and state representation require domain expertise
Transfer learning: Effective transfer from simulation to experiment
Multi-objective optimization: Scientific problems typically involve multiple competing objectives

References

Mirhoseini et al., "A Graph Placement Methodology for Fast Chip Design" (Nature 2021)
Degrave et al., "Magnetic Control of Tokamak Plasmas through Deep Reinforcement Learning" (Nature 2022)
Romera-Paredes et al., "Mathematical Discoveries from Program Search with Large Language Models" (Nature 2024)
Fawzi et al., "Discovering Faster Matrix Multiplication Algorithms with Reinforcement Learning" (Nature 2022)
Zhou et al., "Optimization of Molecules via Deep Reinforcement Learning" (Scientific Reports 2019)
Jumper et al., "Highly Accurate Protein Structure Prediction with AlphaFold" (Nature 2021)