Skip to content

RL in Scientific Discovery

Overview

Reinforcement learning has achieved breakthroughs not only in games but also demonstrates tremendous potential across multiple scientific domains. The core value of RL in science lies in automating complex search and optimization processes and discovering solutions that humans would struggle to conceive.

Molecular Design and Drug Discovery

Problem Formulation

Molecular design can be modeled as a sequential decision problem:

  • State: Current molecular structure (graph or SMILES string)
  • Action: Add atom, add bond, modify functional group, etc.
  • Reward: Based on molecular properties (e.g., drug activity, synthesizability, toxicity)

Methods

SMILES-based generation:

Represent molecules as SMILES strings, generate character by character using RNN/Transformer, and optimize target properties with RL:

\[r = w_1 \cdot \text{Activity}(m) + w_2 \cdot \text{Synthesizability}(m) - w_3 \cdot \text{Toxicity}(m)\]

Graph-based generation:

Treat molecules as graphs, using Graph Neural Networks (GNN) and RL to incrementally build molecular graphs.

Multi-objective optimization:

Drug design typically requires simultaneous optimization of multiple properties (activity, selectivity, ADMET characteristics, etc.), using multi-objective RL or constrained optimization.

Representative Work

  • REINVENT: Uses policy gradients to optimize molecular generators
  • MolDQN: Q-learning-based molecular optimization
  • ChemRL: Molecular design framework combining GNNs and RL

Challenges

  • Enormous chemical space (~\(10^{60}\) possible small molecules)
  • Reward functions depend on computational simulations or experimental validation (expensive)
  • Must ensure chemical validity of generated molecules

Protein Science

AlphaFold's Connection to RL

Although AlphaFold primarily relies on supervised learning and attention mechanisms, it has deep connections to RL:

  • Structure search can be modeled as sequential decision-making
  • Sampling strategies resemble Monte Carlo methods
  • Iterative refinement resembles policy improvement

RL in Protein Design

  • Protein sequence design: Given a target structure, use RL to optimize amino acid sequences
  • Protein folding pathways: Dynamic decision-making simulating the folding process
  • Enzyme activity optimization: Optimizing catalytic efficiency through directed evolution simulation

Chip Design

Google's Chip Placement

Mirhoseini et al. (2021, Nature) used RL to optimize chip macro placement:

Problem formulation:

  • State: Current chip layout (placed macros on a grid)
  • Action: Place the next macro at a grid location
  • Reward: Combination of wirelength, congestion, and timing
\[r = -w_1 \cdot \text{Wirelength} - w_2 \cdot \text{Congestion} - w_3 \cdot \text{TimingViolation}\]

Method:

  • Graph neural network encodes the chip netlist
  • Policy network uses attention mechanisms to handle chips of different scales
  • Transfer learning: Pretrain on multiple chip designs for rapid adaptation to new chips

Results:

  • Completed placement in 6 hours (human engineers need weeks)
  • Quality comparable to or better than human experts
  • Applied to Google TPU design

Controversy and Follow-up

  • Some researchers questioned whether RL truly outperforms traditional EDA tools
  • Subsequent work validated the method's effectiveness on larger-scale chips
  • Inspired further research applying RL to EDA workflows

Nuclear Fusion Plasma Control

DeepMind + EPFL Collaboration

Degrave et al. (2022, Nature) used RL to control plasma in a tokamak device:

Problem:

  • Sustaining nuclear fusion requires confining plasma in specific shapes
  • Plasma is extremely unstable, requiring real-time control of multiple magnetic field coils
  • Traditional control methods rely on extensive manual tuning

RL approach:

  • State: Plasma shape parameters, magnetic field measurements
  • Action: Voltages for 19 magnetic field coils
  • Reward: Plasma shape error + stability metrics

Training pipeline:

  1. Train in physics simulators (TORBEAM, etc.)
  2. Learn to produce multiple plasma shapes (elongated, snowflake, etc.)
  3. Validate on the TCV tokamak device

Results:

  • Successfully controlled multiple plasma configurations
  • Discovered novel control strategies not attempted by humans
  • Demonstrated RL's potential in complex physical system control

Mathematical Discovery

FunSearch

Romera-Paredes et al. (2024, Nature) used LLM + evolutionary search to discover new mathematical results:

Method:

  1. Encode mathematical problems as program search problems
  2. LLM generates candidate programs (solutions)
  3. Automatically evaluate program quality
  4. Evolutionary strategy selects and improves the best solutions

Achievements:

  • Discovered constructions surpassing the best known solutions for the cap set problem
  • Found new efficient heuristics for bin packing

Connection to RL:

  • The search process can be viewed as an exploration-exploitation tradeoff
  • The evaluation function resembles a reward signal
  • Evolutionary selection resembles policy improvement

Other Mathematical Applications

  • Using RL to discover new matrix multiplication algorithms (AlphaTensor)
  • Search strategies for assisting theorem proving
  • Heuristic discovery for combinatorial optimization problems

Robotics

RL applications in robotics constitute a vast independent field; see the dedicated sections for details.

Main directions:

  • Locomotion control
  • Dexterous manipulation
  • Navigation and planning
  • Human-robot interaction

Materials Science

Materials Discovery

  • New material search: Searching for materials with target properties in vast composition spaces
  • Synthesis pathway planning: Determining preparation steps for materials
  • Property optimization: Tuning process parameters to optimize material performance

Battery Materials

Using RL to optimize battery charging/discharging strategies:

  • Extending battery lifespan
  • Optimizing charging speed
  • Balancing performance and safety

Catalyst Design

RL-assisted search for optimal catalyst combinations:

  • State: Catalyst composition and structure
  • Action: Composition adjustments
  • Reward: Catalytic activity and selectivity

Cross-Domain Commonalities

Common Patterns of RL in Scientific Applications

Element Pattern
State space Typically high-dimensional and structured (graphs, sequences, fields)
Action space Design/control parameters
Reward Based on simulation or experimental evaluation
Challenges Sample efficiency, sparse rewards, high validation costs
Advantages Automated search, novel strategy discovery, surpassing human intuition

Key Success Factors

  1. Good problem formulation: Correctly mapping the scientific problem to an MDP
  2. Efficient simulators: RL requires extensive interaction, needing fast and accurate simulation
  3. Domain knowledge integration: Reward design and state representation require domain expertise
  4. Transfer learning: Effective transfer from simulation to experiment
  5. Multi-objective optimization: Scientific problems typically involve multiple competing objectives

References

  • Mirhoseini et al., "A Graph Placement Methodology for Fast Chip Design" (Nature 2021)
  • Degrave et al., "Magnetic Control of Tokamak Plasmas through Deep Reinforcement Learning" (Nature 2022)
  • Romera-Paredes et al., "Mathematical Discoveries from Program Search with Large Language Models" (Nature 2024)
  • Fawzi et al., "Discovering Faster Matrix Multiplication Algorithms with Reinforcement Learning" (Nature 2022)
  • Zhou et al., "Optimization of Molecules via Deep Reinforcement Learning" (Scientific Reports 2019)
  • Jumper et al., "Highly Accurate Protein Structure Prediction with AlphaFold" (Nature 2021)

评论 #