Frontier Advances in Reasoning
Overview
In 2024-2025, LLM reasoning capabilities saw major breakthroughs. OpenAI's o1/o3 series, DeepSeek-R1, and the discovery of Reasoning Scaling Laws mark the transition of AI reasoning from "prompt engineering" to systematic optimization at both "training-time and inference-time." This article surveys the latest advances in reasoning models and future trends.
1. Paradigm Shift in Reasoning Models
1.1 From Prompting to Training
graph LR
A[Phase 1<br/>Prompt Engineering<br/>2022-2023] --> B[Phase 2<br/>Reasoning Fine-Tuning<br/>2024]
B --> C[Phase 3<br/>Reasoning-Native Models<br/>2024-2025]
C --> D[Phase 4<br/>Reasoning Scaling<br/>2025-]
A -.-> |CoT, ToT| A
B -.-> |RL-trained reasoning| B
C -.-> |o1, R1| C
D -.-> |Test-time Compute| D
| Phase | Method | Source of Reasoning Ability | Representative |
|---|---|---|---|
| Prompt Engineering | Design prompt templates | Eliciting existing model capability | CoT, ToT |
| Reasoning Fine-Tuning | Fine-tune on reasoning data | Reasoning patterns in training data | WizardMath |
| Reasoning-Native Models | Optimize reasoning at training time | RL + Process rewards | o1, R1 |
| Reasoning Scaling | Expand test-time compute | More test-time computation | o3, future models |
1.2 Core Formula: Two Scaling Dimensions of Reasoning
Training-time scaling (traditional scaling laws):
where \(N\) is parameter count and \(D\) is training data volume.
Inference-time scaling (newly discovered):
where \(C_{\text{test}}\) is inference-time compute (token count, search steps, etc.).
Key Insight: When training-time scaling yields diminishing returns, inference-time compute scaling provides a new dimension for performance improvement.
2. OpenAI o1 Series
2.1 o1 (September 2024)
OpenAI's o1 is the first large-scale reasoning model, performing extended chain reasoning through a "thinking" process before answering.
Core Features:
- Hidden thinking process: The model reasons internally before providing answers; the thinking process is not fully visible to users
- Long-chain reasoning: Can perform thousands of tokens of internal thinking
- RL training: Uses reinforcement learning (not just SFT) to train reasoning capabilities
- Process Reward Models (PRM): Provides reward signals for each reasoning step
Performance:
| Benchmark | GPT-4o | o1-preview | o1 |
|---|---|---|---|
| AIME 2024 (Math Competition) | 13.4% | 56.7% | 83.3% |
| GPQA Diamond (Graduate Science) | 53.6% | 73.3% | 78.0% |
| Codeforces (Programming Competition) | 11% | 62% | 89% |
| MATH | 60.3% | 85.5% | 94.8% |
2.2 o3 (December 2024 Preview)
o3 further improved upon o1:
- Achieved 87.5% on ARC-AGI benchmark (high-compute mode), previous best was 5%
- Further reasoning scaling
2.3 o1's Reasoning Mechanism (Speculated)
Although OpenAI has not disclosed full details, the community speculates on core mechanisms:
graph TD
INPUT[User Question] --> THINK[Internal Thinking Process<br/>Chain of Internal Thoughts]
THINK --> SEARCH[Search/Backtrack<br/>Explore Multiple Reasoning Paths]
SEARCH --> VERIFY[Self-Verification<br/>Check Reasoning Steps]
VERIFY --> |Uncertain| SEARCH
VERIFY --> |Confident| OUTPUT[Final Answer]
PRM[Process Reward Model] -.-> SEARCH
PRM -.-> VERIFY
Training Pipeline (Speculated):
- Use SFT to train basic reasoning capabilities
- Use Process Reward Models (PRM) to provide dense rewards for each reasoning step
- Use RL (possibly PPO or similar) to optimize the reasoning policy
- Allow longer thinking chains and multiple attempts at inference time
3. DeepSeek-R1
3.1 Core Contributions of R1
DeepSeek (January 2025) released R1, the first open-source reasoning model, revealing the mechanism of reasoning capability emergence:
Key Finding:
Pure RL training can spontaneously give rise to reasoning capabilities -- without human-annotated reasoning data.
3.2 Training Pipeline
graph TD
BASE[DeepSeek-V3 Base Model] --> RL1[Pure RL Training<br/>GRPO Algorithm]
RL1 --> R1_ZERO[R1-Zero<br/>Spontaneous Reasoning Emergence]
R1_ZERO --> COLD[Cold-Start SFT<br/>Small Amount of High-Quality Reasoning Data]
COLD --> RL2[RL Training<br/>Reasoning + General Tasks]
RL2 --> R1[DeepSeek-R1<br/>Final Model]
R1 --> DISTILL[Distillation<br/>R1 → Small Models]
DISTILL --> R1_7B[R1-Distill-7B]
DISTILL --> R1_32B[R1-Distill-32B]
3.3 GRPO Algorithm
DeepSeek's Group Relative Policy Optimization (GRPO) algorithm:
where \(G\) outputs are sampled from the same question, and the advantage \(A_i\) is computed via within-group relative ranking.
Difference from PPO: No need to train a separate value function (Critic); instead, advantages are estimated through within-group comparison.
3.4 Spontaneous Emergence of Reasoning Capabilities
R1-Zero (pure RL training, no reasoning data) exhibited surprising reasoning behaviors:
| Emergent Behavior | Description |
|---|---|
| Self-verification | "Let me check whether this answer is correct..." |
| Reflection | "Wait, I might have made a mistake..." |
| Problem decomposition | "This problem can be divided into three parts..." |
| Multi-path exploration | "Let me try another approach..." |
| Step-by-step derivation | Shows complete mathematical derivation process |
Key Insight: These reasoning patterns were not learned from annotated data but spontaneously emerged during RL optimization.
3.5 Performance Comparison
| Benchmark | DeepSeek-V3 | DeepSeek-R1 | OpenAI o1 |
|---|---|---|---|
| AIME 2024 | 39.2% | 79.8% | 79.2% |
| MATH-500 | 90.2% | 97.3% | 96.4% |
| Codeforces | 51.6% | 96.3% | 96.6% |
| GPQA Diamond | 59.1% | 71.5% | 78.0% |
4. Test-Time Compute Scaling
4.1 Core Concept
Test-time compute scaling means: giving the model more inference time/computation can continuously improve performance.
Traditional approaches only focus on \(C_{\text{train}}\); now \(C_{\text{test}}\) becomes an equally important dimension.
4.2 Ways to Allocate Test-Time Compute
| Method | Description | Example |
|---|---|---|
| Longer thinking chains | Allow model to generate more reasoning tokens | o1's long thinking process |
| Multiple sampling | Generate multiple candidate answers | Self-Consistency |
| Tree search | Systematically explore reasoning space | ToT, MCTS |
| Verification + retry | Verify answers and retry on failure | Reflexion |
| Ensembling | Aggregate results from multiple models/strategies | Multi-model voting |
4.3 Scaling Curves
Experiments found that reasoning performance has a logarithmic relationship with test-time compute:
This means:
- Initial increases in inference compute bring significant improvements
- Diminishing marginal returns but continuous gains
- Similar to training scaling power laws, but flatter
5. Process Reward Models (PRM)
5.1 Outcome Rewards vs. Process Rewards
Outcome Reward Model (ORM): Only evaluates the final answer
Process Reward Model (PRM): Evaluates each reasoning step
5.2 Advantages of PRM
- Dense rewards: Feedback at every step, not just at the end
- Error localization: Can precisely identify which step in the reasoning chain went wrong
- Better search guidance: Provides more accurate evaluation for MCTS/ToT
- Stronger training signal: Avoids the credit assignment problem of sparse rewards
5.3 PRM800K Dataset
PRM800K, released by Lightman et al. (2023), contains 800K step-level annotations:
- Each reasoning step of each math problem is annotated as correct/incorrect/neutral
- Human annotation ensures quality
- Demonstrated that PRM significantly outperforms ORM on the MATH benchmark
6. Evolution Map of Reasoning Models
graph TD
subgraph 2022-2023: Prompting Era
COT[CoT<br/>Wei et al.] --> SC[Self-Consistency]
COT --> TOT[Tree of Thoughts]
COT --> REACT[ReAct]
end
subgraph 2024: Year of Reasoning Models
O1[OpenAI o1<br/>2024.09] --> O1MINI[o1-mini]
PRM[PRM Research<br/>Lightman et al.] --> O1
QWEN[Qwen-QwQ<br/>2024.11]
end
subgraph 2025: Open-Source Reasoning Era
R1[DeepSeek-R1<br/>2025.01]
O3[OpenAI o3<br/>Preview]
R1 --> R1D[R1-Distill Series]
R1 --> OPEN[Open-Source Reasoning Model Ecosystem]
end
COT --> O1
TOT --> O1
O1 --> R1
O1 --> O3
7. Key Open Questions
7.1 Theoretical Questions
- Upper bound of reasoning scaling: Is there a theoretical ceiling for test-time compute scaling?
- Emergence mechanism: Why does RL training spontaneously produce reasoning behavior?
- Optimal compute allocation: What is the optimal ratio between training-time and inference-time compute?
- Nature of reasoning: Is LLM reasoning genuine logical reasoning or pattern matching?
7.2 Engineering Questions
- Reasoning cost: Long thinking chains consume many tokens; how to optimize?
- Latency: Reasoning models have longer response times; how to meet real-time requirements?
- Controllability: How to control reasoning depth (simple questions don't need long thinking)?
- Interpretability: How to audit hidden thinking processes?
7.3 Application Questions
- Agent reasoning: How to combine reasoning models with tool use and multi-agent collaboration?
- Domain adaptation: How to adapt general reasoning models to specific domains?
- Distillation efficiency: How to efficiently distill reasoning capabilities into small models?
8. Impact on Agent Design
Breakthroughs in reasoning models have profound implications for agent architectures:
| Traditional Approach | Reasoning Model Approach | Impact |
|---|---|---|
| External CoT prompting | Built-in long-chain reasoning | Reduces prompt engineering |
| External ToT search | Internal search | Simplifies architecture |
| Self-Consistency sampling | Internal self-verification | Reduces API calls |
| External Reflexion loop | Built-in reflection mechanism | More compact agents |
| Plan-then-Execute | Automatic planning at inference time | End-to-end reasoning + action |
Core Trend: External reasoning augmentation mechanisms are being internalized into the model itself, making agent architectures simpler while reasoning capabilities become stronger.
References
- OpenAI. (2024). Learning to Reason with LLMs. openai.com.
- OpenAI. (2024). OpenAI o1 System Card. openai.com.
- DeepSeek. (2025). DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. arXiv:2501.12948.
- Lightman, H. et al. (2023). Let's Verify Step by Step. ICLR 2024.
- Snell, C. et al. (2024). Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters. arXiv:2408.03314.
- Wang, P. et al. (2024). Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations. ACL 2024.