Frontier Advances in Reasoning

Overview

In 2024-2025, LLM reasoning capabilities saw major breakthroughs. OpenAI's o1/o3 series, DeepSeek-R1, and the discovery of Reasoning Scaling Laws mark the transition of AI reasoning from "prompt engineering" to systematic optimization at both "training-time and inference-time." This article surveys the latest advances in reasoning models and future trends.

1. Paradigm Shift in Reasoning Models

1.1 From Prompting to Training

graph LR
    A[Phase 1<br/>Prompt Engineering<br/>2022-2023] --> B[Phase 2<br/>Reasoning Fine-Tuning<br/>2024]
    B --> C[Phase 3<br/>Reasoning-Native Models<br/>2024-2025]
    C --> D[Phase 4<br/>Reasoning Scaling<br/>2025-]

    A -.-> |CoT, ToT| A
    B -.-> |RL-trained reasoning| B
    C -.-> |o1, R1| C
    D -.-> |Test-time Compute| D

Phase	Method	Source of Reasoning Ability	Representative
Prompt Engineering	Design prompt templates	Eliciting existing model capability	CoT, ToT
Reasoning Fine-Tuning	Fine-tune on reasoning data	Reasoning patterns in training data	WizardMath
Reasoning-Native Models	Optimize reasoning at training time	RL + Process rewards	o1, R1
Reasoning Scaling	Expand test-time compute	More test-time computation	o3, future models

1.2 Core Formula: Two Scaling Dimensions of Reasoning

Training-time scaling (traditional scaling laws):

\[ L(N, D) = \left(\frac{N_c}{N}\right)^{\alpha_N} + \left(\frac{D_c}{D}\right)^{\alpha_D} + L_\infty \]

where \(N\) is parameter count and \(D\) is training data volume.

Inference-time scaling (newly discovered):

\[ \text{Performance}(C_{\text{test}}) \propto \log(C_{\text{test}}) \]

where \(C_{\text{test}}\) is inference-time compute (token count, search steps, etc.).

Key Insight: When training-time scaling yields diminishing returns, inference-time compute scaling provides a new dimension for performance improvement.

2. OpenAI o1 Series

2.1 o1 (September 2024)

OpenAI's o1 is the first large-scale reasoning model, performing extended chain reasoning through a "thinking" process before answering.

Core Features:

Hidden thinking process: The model reasons internally before providing answers; the thinking process is not fully visible to users
Long-chain reasoning: Can perform thousands of tokens of internal thinking
RL training: Uses reinforcement learning (not just SFT) to train reasoning capabilities
Process Reward Models (PRM): Provides reward signals for each reasoning step

Performance:

Benchmark	GPT-4o	o1-preview	o1
AIME 2024 (Math Competition)	13.4%	56.7%	83.3%
GPQA Diamond (Graduate Science)	53.6%	73.3%	78.0%
Codeforces (Programming Competition)	11%	62%	89%
MATH	60.3%	85.5%	94.8%

2.2 o3 (December 2024 Preview)

o3 further improved upon o1:

Achieved 87.5% on ARC-AGI benchmark (high-compute mode), previous best was 5%
Further reasoning scaling

2.3 o1's Reasoning Mechanism (Speculated)

Although OpenAI has not disclosed full details, the community speculates on core mechanisms:

graph TD
    INPUT[User Question] --> THINK[Internal Thinking Process<br/>Chain of Internal Thoughts]
    THINK --> SEARCH[Search/Backtrack<br/>Explore Multiple Reasoning Paths]
    SEARCH --> VERIFY[Self-Verification<br/>Check Reasoning Steps]
    VERIFY --> |Uncertain| SEARCH
    VERIFY --> |Confident| OUTPUT[Final Answer]

    PRM[Process Reward Model] -.-> SEARCH
    PRM -.-> VERIFY

Training Pipeline (Speculated):

Use SFT to train basic reasoning capabilities
Use Process Reward Models (PRM) to provide dense rewards for each reasoning step
Use RL (possibly PPO or similar) to optimize the reasoning policy
Allow longer thinking chains and multiple attempts at inference time

3. DeepSeek-R1

3.1 Core Contributions of R1

DeepSeek (January 2025) released R1, the first open-source reasoning model, revealing the mechanism of reasoning capability emergence:

Key Finding:

Pure RL training can spontaneously give rise to reasoning capabilities -- without human-annotated reasoning data.

3.2 Training Pipeline

graph TD
    BASE[DeepSeek-V3 Base Model] --> RL1[Pure RL Training<br/>GRPO Algorithm]
    RL1 --> R1_ZERO[R1-Zero<br/>Spontaneous Reasoning Emergence]
    R1_ZERO --> COLD[Cold-Start SFT<br/>Small Amount of High-Quality Reasoning Data]
    COLD --> RL2[RL Training<br/>Reasoning + General Tasks]
    RL2 --> R1[DeepSeek-R1<br/>Final Model]
    R1 --> DISTILL[Distillation<br/>R1 → Small Models]
    DISTILL --> R1_7B[R1-Distill-7B]
    DISTILL --> R1_32B[R1-Distill-32B]

3.3 GRPO Algorithm

DeepSeek's Group Relative Policy Optimization (GRPO) algorithm:

\[ \mathcal{L}_{\text{GRPO}}(\theta) = \mathbb{E}_{q \sim P(Q)} \left[ \frac{1}{G} \sum_{i=1}^{G} \min\left(\frac{\pi_\theta(o_i|q)}{\pi_{\text{ref}}(o_i|q)} A_i, \text{clip}\left(\frac{\pi_\theta(o_i|q)}{\pi_{\text{ref}}(o_i|q)}, 1\pm\epsilon\right) A_i\right) \right] \]

where \(G\) outputs are sampled from the same question, and the advantage \(A_i\) is computed via within-group relative ranking.

Difference from PPO: No need to train a separate value function (Critic); instead, advantages are estimated through within-group comparison.

3.4 Spontaneous Emergence of Reasoning Capabilities

R1-Zero (pure RL training, no reasoning data) exhibited surprising reasoning behaviors:

Emergent Behavior	Description
Self-verification	"Let me check whether this answer is correct..."
Reflection	"Wait, I might have made a mistake..."
Problem decomposition	"This problem can be divided into three parts..."
Multi-path exploration	"Let me try another approach..."
Step-by-step derivation	Shows complete mathematical derivation process

Key Insight: These reasoning patterns were not learned from annotated data but spontaneously emerged during RL optimization.

3.5 Performance Comparison

Benchmark	DeepSeek-V3	DeepSeek-R1	OpenAI o1
AIME 2024	39.2%	79.8%	79.2%
MATH-500	90.2%	97.3%	96.4%
Codeforces	51.6%	96.3%	96.6%
GPQA Diamond	59.1%	71.5%	78.0%

4. Test-Time Compute Scaling

4.1 Core Concept

Test-time compute scaling means: giving the model more inference time/computation can continuously improve performance.

\[ \text{Performance} = f(C_{\text{train}}, C_{\text{test}}) \]

Traditional approaches only focus on \(C_{\text{train}}\); now \(C_{\text{test}}\) becomes an equally important dimension.

4.2 Ways to Allocate Test-Time Compute

Method	Description	Example
Longer thinking chains	Allow model to generate more reasoning tokens	o1's long thinking process
Multiple sampling	Generate multiple candidate answers	Self-Consistency
Tree search	Systematically explore reasoning space	ToT, MCTS
Verification + retry	Verify answers and retry on failure	Reflexion
Ensembling	Aggregate results from multiple models/strategies	Multi-model voting

4.3 Scaling Curves

Experiments found that reasoning performance has a logarithmic relationship with test-time compute:

\[ \text{Accuracy} \approx a + b \cdot \log(C_{\text{test}}) \]

This means:

Initial increases in inference compute bring significant improvements
Diminishing marginal returns but continuous gains
Similar to training scaling power laws, but flatter

5. Process Reward Models (PRM)

5.1 Outcome Rewards vs. Process Rewards

Outcome Reward Model (ORM): Only evaluates the final answer

\[ R_{\text{ORM}}(\tau) = \begin{cases} 1 & \text{if final answer is correct} \\ 0 & \text{otherwise} \end{cases} \]

Process Reward Model (PRM): Evaluates each reasoning step

\[ R_{\text{PRM}}(\tau) = \prod_{t=1}^{T} P(\text{step } t \text{ is correct} \mid s_1, \ldots, s_t) \]

5.2 Advantages of PRM

Dense rewards: Feedback at every step, not just at the end
Error localization: Can precisely identify which step in the reasoning chain went wrong
Better search guidance: Provides more accurate evaluation for MCTS/ToT
Stronger training signal: Avoids the credit assignment problem of sparse rewards

5.3 PRM800K Dataset

PRM800K, released by Lightman et al. (2023), contains 800K step-level annotations:

Each reasoning step of each math problem is annotated as correct/incorrect/neutral
Human annotation ensures quality
Demonstrated that PRM significantly outperforms ORM on the MATH benchmark

6. Evolution Map of Reasoning Models

graph TD
    subgraph 2022-2023: Prompting Era
        COT[CoT<br/>Wei et al.] --> SC[Self-Consistency]
        COT --> TOT[Tree of Thoughts]
        COT --> REACT[ReAct]
    end

    subgraph 2024: Year of Reasoning Models
        O1[OpenAI o1<br/>2024.09] --> O1MINI[o1-mini]
        PRM[PRM Research<br/>Lightman et al.] --> O1
        QWEN[Qwen-QwQ<br/>2024.11]
    end

    subgraph 2025: Open-Source Reasoning Era
        R1[DeepSeek-R1<br/>2025.01]
        O3[OpenAI o3<br/>Preview]
        R1 --> R1D[R1-Distill Series]
        R1 --> OPEN[Open-Source Reasoning Model Ecosystem]
    end

    COT --> O1
    TOT --> O1
    O1 --> R1
    O1 --> O3

7. Key Open Questions

7.1 Theoretical Questions

Upper bound of reasoning scaling: Is there a theoretical ceiling for test-time compute scaling?
Emergence mechanism: Why does RL training spontaneously produce reasoning behavior?
Optimal compute allocation: What is the optimal ratio between training-time and inference-time compute?
Nature of reasoning: Is LLM reasoning genuine logical reasoning or pattern matching?

7.2 Engineering Questions

Reasoning cost: Long thinking chains consume many tokens; how to optimize?
Latency: Reasoning models have longer response times; how to meet real-time requirements?
Controllability: How to control reasoning depth (simple questions don't need long thinking)?
Interpretability: How to audit hidden thinking processes?

7.3 Application Questions

Agent reasoning: How to combine reasoning models with tool use and multi-agent collaboration?
Domain adaptation: How to adapt general reasoning models to specific domains?
Distillation efficiency: How to efficiently distill reasoning capabilities into small models?

8. Impact on Agent Design

Breakthroughs in reasoning models have profound implications for agent architectures:

Traditional Approach	Reasoning Model Approach	Impact
External CoT prompting	Built-in long-chain reasoning	Reduces prompt engineering
External ToT search	Internal search	Simplifies architecture
Self-Consistency sampling	Internal self-verification	Reduces API calls
External Reflexion loop	Built-in reflection mechanism	More compact agents
Plan-then-Execute	Automatic planning at inference time	End-to-end reasoning + action

Core Trend: External reasoning augmentation mechanisms are being internalized into the model itself, making agent architectures simpler while reasoning capabilities become stronger.

References

OpenAI. (2024). Learning to Reason with LLMs. openai.com.
OpenAI. (2024). OpenAI o1 System Card. openai.com.
DeepSeek. (2025). DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. arXiv:2501.12948.
Lightman, H. et al. (2023). Let's Verify Step by Step. ICLR 2024.
Snell, C. et al. (2024). Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters. arXiv:2408.03314.
Wang, P. et al. (2024). Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations. ACL 2024.