Skip to content

Frontier Advances in Reasoning

Overview

In 2024-2025, LLM reasoning capabilities saw major breakthroughs. OpenAI's o1/o3 series, DeepSeek-R1, and the discovery of Reasoning Scaling Laws mark the transition of AI reasoning from "prompt engineering" to systematic optimization at both "training-time and inference-time." This article surveys the latest advances in reasoning models and future trends.


1. Paradigm Shift in Reasoning Models

1.1 From Prompting to Training

graph LR
    A[Phase 1<br/>Prompt Engineering<br/>2022-2023] --> B[Phase 2<br/>Reasoning Fine-Tuning<br/>2024]
    B --> C[Phase 3<br/>Reasoning-Native Models<br/>2024-2025]
    C --> D[Phase 4<br/>Reasoning Scaling<br/>2025-]

    A -.-> |CoT, ToT| A
    B -.-> |RL-trained reasoning| B
    C -.-> |o1, R1| C
    D -.-> |Test-time Compute| D
Phase Method Source of Reasoning Ability Representative
Prompt Engineering Design prompt templates Eliciting existing model capability CoT, ToT
Reasoning Fine-Tuning Fine-tune on reasoning data Reasoning patterns in training data WizardMath
Reasoning-Native Models Optimize reasoning at training time RL + Process rewards o1, R1
Reasoning Scaling Expand test-time compute More test-time computation o3, future models

1.2 Core Formula: Two Scaling Dimensions of Reasoning

Training-time scaling (traditional scaling laws):

\[ L(N, D) = \left(\frac{N_c}{N}\right)^{\alpha_N} + \left(\frac{D_c}{D}\right)^{\alpha_D} + L_\infty \]

where \(N\) is parameter count and \(D\) is training data volume.

Inference-time scaling (newly discovered):

\[ \text{Performance}(C_{\text{test}}) \propto \log(C_{\text{test}}) \]

where \(C_{\text{test}}\) is inference-time compute (token count, search steps, etc.).

Key Insight: When training-time scaling yields diminishing returns, inference-time compute scaling provides a new dimension for performance improvement.


2. OpenAI o1 Series

2.1 o1 (September 2024)

OpenAI's o1 is the first large-scale reasoning model, performing extended chain reasoning through a "thinking" process before answering.

Core Features:

  • Hidden thinking process: The model reasons internally before providing answers; the thinking process is not fully visible to users
  • Long-chain reasoning: Can perform thousands of tokens of internal thinking
  • RL training: Uses reinforcement learning (not just SFT) to train reasoning capabilities
  • Process Reward Models (PRM): Provides reward signals for each reasoning step

Performance:

Benchmark GPT-4o o1-preview o1
AIME 2024 (Math Competition) 13.4% 56.7% 83.3%
GPQA Diamond (Graduate Science) 53.6% 73.3% 78.0%
Codeforces (Programming Competition) 11% 62% 89%
MATH 60.3% 85.5% 94.8%

2.2 o3 (December 2024 Preview)

o3 further improved upon o1:

  • Achieved 87.5% on ARC-AGI benchmark (high-compute mode), previous best was 5%
  • Further reasoning scaling

2.3 o1's Reasoning Mechanism (Speculated)

Although OpenAI has not disclosed full details, the community speculates on core mechanisms:

graph TD
    INPUT[User Question] --> THINK[Internal Thinking Process<br/>Chain of Internal Thoughts]
    THINK --> SEARCH[Search/Backtrack<br/>Explore Multiple Reasoning Paths]
    SEARCH --> VERIFY[Self-Verification<br/>Check Reasoning Steps]
    VERIFY --> |Uncertain| SEARCH
    VERIFY --> |Confident| OUTPUT[Final Answer]

    PRM[Process Reward Model] -.-> SEARCH
    PRM -.-> VERIFY

Training Pipeline (Speculated):

  1. Use SFT to train basic reasoning capabilities
  2. Use Process Reward Models (PRM) to provide dense rewards for each reasoning step
  3. Use RL (possibly PPO or similar) to optimize the reasoning policy
  4. Allow longer thinking chains and multiple attempts at inference time

3. DeepSeek-R1

3.1 Core Contributions of R1

DeepSeek (January 2025) released R1, the first open-source reasoning model, revealing the mechanism of reasoning capability emergence:

Key Finding:

Pure RL training can spontaneously give rise to reasoning capabilities -- without human-annotated reasoning data.

3.2 Training Pipeline

graph TD
    BASE[DeepSeek-V3 Base Model] --> RL1[Pure RL Training<br/>GRPO Algorithm]
    RL1 --> R1_ZERO[R1-Zero<br/>Spontaneous Reasoning Emergence]
    R1_ZERO --> COLD[Cold-Start SFT<br/>Small Amount of High-Quality Reasoning Data]
    COLD --> RL2[RL Training<br/>Reasoning + General Tasks]
    RL2 --> R1[DeepSeek-R1<br/>Final Model]
    R1 --> DISTILL[Distillation<br/>R1 → Small Models]
    DISTILL --> R1_7B[R1-Distill-7B]
    DISTILL --> R1_32B[R1-Distill-32B]

3.3 GRPO Algorithm

DeepSeek's Group Relative Policy Optimization (GRPO) algorithm:

\[ \mathcal{L}_{\text{GRPO}}(\theta) = \mathbb{E}_{q \sim P(Q)} \left[ \frac{1}{G} \sum_{i=1}^{G} \min\left(\frac{\pi_\theta(o_i|q)}{\pi_{\text{ref}}(o_i|q)} A_i, \text{clip}\left(\frac{\pi_\theta(o_i|q)}{\pi_{\text{ref}}(o_i|q)}, 1\pm\epsilon\right) A_i\right) \right] \]

where \(G\) outputs are sampled from the same question, and the advantage \(A_i\) is computed via within-group relative ranking.

Difference from PPO: No need to train a separate value function (Critic); instead, advantages are estimated through within-group comparison.

3.4 Spontaneous Emergence of Reasoning Capabilities

R1-Zero (pure RL training, no reasoning data) exhibited surprising reasoning behaviors:

Emergent Behavior Description
Self-verification "Let me check whether this answer is correct..."
Reflection "Wait, I might have made a mistake..."
Problem decomposition "This problem can be divided into three parts..."
Multi-path exploration "Let me try another approach..."
Step-by-step derivation Shows complete mathematical derivation process

Key Insight: These reasoning patterns were not learned from annotated data but spontaneously emerged during RL optimization.

3.5 Performance Comparison

Benchmark DeepSeek-V3 DeepSeek-R1 OpenAI o1
AIME 2024 39.2% 79.8% 79.2%
MATH-500 90.2% 97.3% 96.4%
Codeforces 51.6% 96.3% 96.6%
GPQA Diamond 59.1% 71.5% 78.0%

4. Test-Time Compute Scaling

4.1 Core Concept

Test-time compute scaling means: giving the model more inference time/computation can continuously improve performance.

\[ \text{Performance} = f(C_{\text{train}}, C_{\text{test}}) \]

Traditional approaches only focus on \(C_{\text{train}}\); now \(C_{\text{test}}\) becomes an equally important dimension.

4.2 Ways to Allocate Test-Time Compute

Method Description Example
Longer thinking chains Allow model to generate more reasoning tokens o1's long thinking process
Multiple sampling Generate multiple candidate answers Self-Consistency
Tree search Systematically explore reasoning space ToT, MCTS
Verification + retry Verify answers and retry on failure Reflexion
Ensembling Aggregate results from multiple models/strategies Multi-model voting

4.3 Scaling Curves

Experiments found that reasoning performance has a logarithmic relationship with test-time compute:

\[ \text{Accuracy} \approx a + b \cdot \log(C_{\text{test}}) \]

This means:

  • Initial increases in inference compute bring significant improvements
  • Diminishing marginal returns but continuous gains
  • Similar to training scaling power laws, but flatter

5. Process Reward Models (PRM)

5.1 Outcome Rewards vs. Process Rewards

Outcome Reward Model (ORM): Only evaluates the final answer

\[ R_{\text{ORM}}(\tau) = \begin{cases} 1 & \text{if final answer is correct} \\ 0 & \text{otherwise} \end{cases} \]

Process Reward Model (PRM): Evaluates each reasoning step

\[ R_{\text{PRM}}(\tau) = \prod_{t=1}^{T} P(\text{step } t \text{ is correct} \mid s_1, \ldots, s_t) \]

5.2 Advantages of PRM

  1. Dense rewards: Feedback at every step, not just at the end
  2. Error localization: Can precisely identify which step in the reasoning chain went wrong
  3. Better search guidance: Provides more accurate evaluation for MCTS/ToT
  4. Stronger training signal: Avoids the credit assignment problem of sparse rewards

5.3 PRM800K Dataset

PRM800K, released by Lightman et al. (2023), contains 800K step-level annotations:

  • Each reasoning step of each math problem is annotated as correct/incorrect/neutral
  • Human annotation ensures quality
  • Demonstrated that PRM significantly outperforms ORM on the MATH benchmark

6. Evolution Map of Reasoning Models

graph TD
    subgraph 2022-2023: Prompting Era
        COT[CoT<br/>Wei et al.] --> SC[Self-Consistency]
        COT --> TOT[Tree of Thoughts]
        COT --> REACT[ReAct]
    end

    subgraph 2024: Year of Reasoning Models
        O1[OpenAI o1<br/>2024.09] --> O1MINI[o1-mini]
        PRM[PRM Research<br/>Lightman et al.] --> O1
        QWEN[Qwen-QwQ<br/>2024.11]
    end

    subgraph 2025: Open-Source Reasoning Era
        R1[DeepSeek-R1<br/>2025.01]
        O3[OpenAI o3<br/>Preview]
        R1 --> R1D[R1-Distill Series]
        R1 --> OPEN[Open-Source Reasoning Model Ecosystem]
    end

    COT --> O1
    TOT --> O1
    O1 --> R1
    O1 --> O3

7. Key Open Questions

7.1 Theoretical Questions

  1. Upper bound of reasoning scaling: Is there a theoretical ceiling for test-time compute scaling?
  2. Emergence mechanism: Why does RL training spontaneously produce reasoning behavior?
  3. Optimal compute allocation: What is the optimal ratio between training-time and inference-time compute?
  4. Nature of reasoning: Is LLM reasoning genuine logical reasoning or pattern matching?

7.2 Engineering Questions

  1. Reasoning cost: Long thinking chains consume many tokens; how to optimize?
  2. Latency: Reasoning models have longer response times; how to meet real-time requirements?
  3. Controllability: How to control reasoning depth (simple questions don't need long thinking)?
  4. Interpretability: How to audit hidden thinking processes?

7.3 Application Questions

  1. Agent reasoning: How to combine reasoning models with tool use and multi-agent collaboration?
  2. Domain adaptation: How to adapt general reasoning models to specific domains?
  3. Distillation efficiency: How to efficiently distill reasoning capabilities into small models?

8. Impact on Agent Design

Breakthroughs in reasoning models have profound implications for agent architectures:

Traditional Approach Reasoning Model Approach Impact
External CoT prompting Built-in long-chain reasoning Reduces prompt engineering
External ToT search Internal search Simplifies architecture
Self-Consistency sampling Internal self-verification Reduces API calls
External Reflexion loop Built-in reflection mechanism More compact agents
Plan-then-Execute Automatic planning at inference time End-to-end reasoning + action

Core Trend: External reasoning augmentation mechanisms are being internalized into the model itself, making agent architectures simpler while reasoning capabilities become stronger.


References

  1. OpenAI. (2024). Learning to Reason with LLMs. openai.com.
  2. OpenAI. (2024). OpenAI o1 System Card. openai.com.
  3. DeepSeek. (2025). DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. arXiv:2501.12948.
  4. Lightman, H. et al. (2023). Let's Verify Step by Step. ICLR 2024.
  5. Snell, C. et al. (2024). Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters. arXiv:2408.03314.
  6. Wang, P. et al. (2024). Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations. ACL 2024.

评论 #