RL in Scientific Discovery
Overview
Reinforcement learning has achieved breakthroughs not only in games but also demonstrates tremendous potential across multiple scientific domains. The core value of RL in science lies in automating complex search and optimization processes and discovering solutions that humans would struggle to conceive.
Molecular Design and Drug Discovery
Problem Formulation
Molecular design can be modeled as a sequential decision problem:
- State: Current molecular structure (graph or SMILES string)
- Action: Add atom, add bond, modify functional group, etc.
- Reward: Based on molecular properties (e.g., drug activity, synthesizability, toxicity)
Methods
SMILES-based generation:
Represent molecules as SMILES strings, generate character by character using RNN/Transformer, and optimize target properties with RL:
Graph-based generation:
Treat molecules as graphs, using Graph Neural Networks (GNN) and RL to incrementally build molecular graphs.
Multi-objective optimization:
Drug design typically requires simultaneous optimization of multiple properties (activity, selectivity, ADMET characteristics, etc.), using multi-objective RL or constrained optimization.
Representative Work
- REINVENT: Uses policy gradients to optimize molecular generators
- MolDQN: Q-learning-based molecular optimization
- ChemRL: Molecular design framework combining GNNs and RL
Challenges
- Enormous chemical space (~\(10^{60}\) possible small molecules)
- Reward functions depend on computational simulations or experimental validation (expensive)
- Must ensure chemical validity of generated molecules
Protein Science
AlphaFold's Connection to RL
Although AlphaFold primarily relies on supervised learning and attention mechanisms, it has deep connections to RL:
- Structure search can be modeled as sequential decision-making
- Sampling strategies resemble Monte Carlo methods
- Iterative refinement resembles policy improvement
RL in Protein Design
- Protein sequence design: Given a target structure, use RL to optimize amino acid sequences
- Protein folding pathways: Dynamic decision-making simulating the folding process
- Enzyme activity optimization: Optimizing catalytic efficiency through directed evolution simulation
Chip Design
Google's Chip Placement
Mirhoseini et al. (2021, Nature) used RL to optimize chip macro placement:
Problem formulation:
- State: Current chip layout (placed macros on a grid)
- Action: Place the next macro at a grid location
- Reward: Combination of wirelength, congestion, and timing
Method:
- Graph neural network encodes the chip netlist
- Policy network uses attention mechanisms to handle chips of different scales
- Transfer learning: Pretrain on multiple chip designs for rapid adaptation to new chips
Results:
- Completed placement in 6 hours (human engineers need weeks)
- Quality comparable to or better than human experts
- Applied to Google TPU design
Controversy and Follow-up
- Some researchers questioned whether RL truly outperforms traditional EDA tools
- Subsequent work validated the method's effectiveness on larger-scale chips
- Inspired further research applying RL to EDA workflows
Nuclear Fusion Plasma Control
DeepMind + EPFL Collaboration
Degrave et al. (2022, Nature) used RL to control plasma in a tokamak device:
Problem:
- Sustaining nuclear fusion requires confining plasma in specific shapes
- Plasma is extremely unstable, requiring real-time control of multiple magnetic field coils
- Traditional control methods rely on extensive manual tuning
RL approach:
- State: Plasma shape parameters, magnetic field measurements
- Action: Voltages for 19 magnetic field coils
- Reward: Plasma shape error + stability metrics
Training pipeline:
- Train in physics simulators (TORBEAM, etc.)
- Learn to produce multiple plasma shapes (elongated, snowflake, etc.)
- Validate on the TCV tokamak device
Results:
- Successfully controlled multiple plasma configurations
- Discovered novel control strategies not attempted by humans
- Demonstrated RL's potential in complex physical system control
Mathematical Discovery
FunSearch
Romera-Paredes et al. (2024, Nature) used LLM + evolutionary search to discover new mathematical results:
Method:
- Encode mathematical problems as program search problems
- LLM generates candidate programs (solutions)
- Automatically evaluate program quality
- Evolutionary strategy selects and improves the best solutions
Achievements:
- Discovered constructions surpassing the best known solutions for the cap set problem
- Found new efficient heuristics for bin packing
Connection to RL:
- The search process can be viewed as an exploration-exploitation tradeoff
- The evaluation function resembles a reward signal
- Evolutionary selection resembles policy improvement
Other Mathematical Applications
- Using RL to discover new matrix multiplication algorithms (AlphaTensor)
- Search strategies for assisting theorem proving
- Heuristic discovery for combinatorial optimization problems
Robotics
RL applications in robotics constitute a vast independent field; see the dedicated sections for details.
Main directions:
- Locomotion control
- Dexterous manipulation
- Navigation and planning
- Human-robot interaction
Materials Science
Materials Discovery
- New material search: Searching for materials with target properties in vast composition spaces
- Synthesis pathway planning: Determining preparation steps for materials
- Property optimization: Tuning process parameters to optimize material performance
Battery Materials
Using RL to optimize battery charging/discharging strategies:
- Extending battery lifespan
- Optimizing charging speed
- Balancing performance and safety
Catalyst Design
RL-assisted search for optimal catalyst combinations:
- State: Catalyst composition and structure
- Action: Composition adjustments
- Reward: Catalytic activity and selectivity
Cross-Domain Commonalities
Common Patterns of RL in Scientific Applications
| Element | Pattern |
|---|---|
| State space | Typically high-dimensional and structured (graphs, sequences, fields) |
| Action space | Design/control parameters |
| Reward | Based on simulation or experimental evaluation |
| Challenges | Sample efficiency, sparse rewards, high validation costs |
| Advantages | Automated search, novel strategy discovery, surpassing human intuition |
Key Success Factors
- Good problem formulation: Correctly mapping the scientific problem to an MDP
- Efficient simulators: RL requires extensive interaction, needing fast and accurate simulation
- Domain knowledge integration: Reward design and state representation require domain expertise
- Transfer learning: Effective transfer from simulation to experiment
- Multi-objective optimization: Scientific problems typically involve multiple competing objectives
References
- Mirhoseini et al., "A Graph Placement Methodology for Fast Chip Design" (Nature 2021)
- Degrave et al., "Magnetic Control of Tokamak Plasmas through Deep Reinforcement Learning" (Nature 2022)
- Romera-Paredes et al., "Mathematical Discoveries from Program Search with Large Language Models" (Nature 2024)
- Fawzi et al., "Discovering Faster Matrix Multiplication Algorithms with Reinforcement Learning" (Nature 2022)
- Zhou et al., "Optimization of Molecules via Deep Reinforcement Learning" (Scientific Reports 2019)
- Jumper et al., "Highly Accurate Protein Structure Prediction with AlphaFold" (Nature 2021)