Skip to content

Deep Learning Frontier Trends

Overview

Deep learning is evolving toward greater efficiency, broader generality, and closer approximation of human cognition. This chapter explores test-time compute, neural architecture search, KAN, Liquid Neural Networks, neuromorphic computing, and other cutting-edge directions.


1. Test-Time Compute

1.1 Core Idea

Traditional paradigm: More training compute → better performance.

New paradigm: More compute at inference time → better performance.

\[ \text{Performance} \propto f(\text{Training Compute}) + g(\text{Inference Compute}) \]

1.2 Main Approaches

Chain-of-Thought (CoT):

  • Let the model reason step by step
  • Explicitly generate intermediate steps
  • Simple prompting improves math/logic abilities

Best-of-N Sampling:

  • Generate N candidate answers
  • Use a verifier to select the best
  • Simple but effective

Search-Based Reasoning (Tree/Graph of Thought):

  • Build reasoning trees
  • Search across multiple reasoning paths
  • Backtracking and exploration

1.3 DeepSeek-R1 and Reasoning Models

  • o1/o3: OpenAI's reasoning models, trained with RL for long-chain reasoning
  • DeepSeek-R1: Open-source reasoning model, trained with GRPO
  • Core Idea: Teach models to "think longer" at inference time

1.4 Inference Compute Scaling Law

\[ \text{Performance} \approx f(\text{Inference Tokens}) \]

More inference tokens → better performance (especially pronounced on math, coding, and reasoning tasks)


2. Neural Architecture Search (NAS)

2.1 Basic Concept

Automatically search for optimal neural network architectures, replacing manual design.

Search Space:

  • Layer types (convolution, attention, MLP, etc.)
  • Connection patterns
  • Hyperparameters (channels, kernel size, etc.)

2.2 Main Methods

Method Description Representative
Reinforcement Learning Use RL to search architectures NASNet
Evolutionary Mutation + selection AmoebaNet
Differentiable Relax discrete choices to continuous optimization DARTS
Supernet Train one network containing all sub-architectures Once-for-All

2.3 DARTS

\[ \min_w \mathcal{L}_{\text{val}}(w^*(\alpha), \alpha) \]
\[ \text{s.t.} \quad w^*(\alpha) = \arg\min_w \mathcal{L}_{\text{train}}(w, \alpha) \]

where \(\alpha\) are architecture parameters, jointly learning weights and architecture through bilevel optimization.

2.4 Current Status

  • NAS in the LLM era mainly used for searching efficient architectures (e.g., mobile models)
  • Hardware-aware NAS: Considers hardware constraints like latency and power consumption
  • Searching optimal configurations within Transformer architecture (layers, heads, FFN size, etc.)

3. KAN (Kolmogorov-Arnold Networks)

3.1 Theoretical Foundation

Kolmogorov-Arnold Representation Theorem:

Any multivariate continuous function \(f(x_1, \ldots, x_n)\) can be represented as:

\[ f(x_1, \ldots, x_n) = \sum_{q=0}^{2n} \Phi_q\left(\sum_{p=1}^{n} \phi_{q,p}(x_p)\right) \]

where \(\phi_{q,p}\) and \(\Phi_q\) are univariate functions.

3.2 KAN vs MLP

Feature MLP KAN
Learnable component Weights (scalars on edges) Activation functions (functions on edges)
Fixed component Activation functions Summation nodes
Approximation theorem Universal Approximation Kolmogorov-Arnold
Interpretability Low Higher
Parameter efficiency Average Potentially better (certain tasks)

3.3 KAN Implementation

Edge activation functions are parameterized with B-splines:

\[ \phi(x) = \sum_{i} c_i B_i(x) \]

Each edge learns a spline function rather than a scalar weight.

3.4 Current Status

  • Strengths: Potential in scientific computing and interpretability
  • Limitations: No demonstrated advantage on large-scale tasks (LLMs) yet
  • Active Research: Integration with Transformers, Graph KAN, time series KAN

4. Liquid Neural Networks

4.1 Core Idea

Continuous-time neural networks inspired by biological nervous systems (C. elegans).

Liquid Time-Constant Networks (LTC):

\[ \frac{dx}{dt} = -\left[\frac{1}{\tau} + f(x, I, \theta)\right] \odot x + f(x, I, \theta) \]

where \(\tau\) is the time constant and \(f\) is a learned nonlinear function.

4.2 Characteristics

  • Causality: Naturally handles causal reasoning
  • Continuous time: Adapts to irregular sampling
  • Compact: Requires very few neurons (19 neurons can learn to drive)
  • Interpretable: Simple structure, analyzable

4.3 Applications

  • Autonomous driving
  • Time series forecasting
  • Robot control
  • Weather prediction

4.4 CfC (Closed-form Continuous-time)

\[ x(t) = \sigma(-f(x_0, I, \theta) \cdot (t - t_0)) \odot g(x_0, I, \theta) + [1 - \sigma(-f \cdot (t-t_0))] \odot h(x_0, I, \theta) \]

Closed-form solution, avoiding ODE solver computational overhead.


5. Neuromorphic Computing

5.1 Spiking Neural Networks (SNN)

Simulate biological neurons' spike-firing mechanism:

LIF (Leaky Integrate-and-Fire) Model:

\[ \tau_m \frac{dV}{dt} = -(V - V_{\text{rest}}) + R \cdot I(t) \]

When \(V > V_{\text{thresh}}\), fire a spike and reset.

5.2 vs Artificial Neural Networks

Feature ANN SNN
Information encoding Continuous values Spike trains
Computation Matrix multiplication Event-driven
Energy consumption High Extremely low
Hardware GPU Neuromorphic chips
Training Backpropagation Surrogate gradient/STDP
Temporal modeling Requires design Naturally temporal

5.3 Neuromorphic Chips

Chip Developer Features
Loihi 2 Intel 128 cores, 1M neurons
TrueNorth IBM 1M neurons, 70mW power
SpiNNaker 2 Univ. of Manchester Large-scale simulation
Akida BrainChip Edge AI

5.4 Challenges and Prospects

  • Training difficulty: Spikes are non-differentiable, requiring surrogate gradients
  • Accuracy gap: Underperforms ANNs on most tasks
  • Use cases: Ultra-low-power edge devices, event cameras
  • Future directions: ANN-SNN hybrids, large-scale SNNs

6. Other Frontier Directions

6.1 World Models

  • Learn internal models of environments for prediction and planning
  • JEPA architecture (proposed by Yann LeCun)
  • Genie 2 (DeepMind): Interactive 3D world generation

6.2 Retrieval-Augmented Generation (RAG)

  • Combine external knowledge bases with LLMs
  • Reduce hallucinations, update knowledge
  • Vector retrieval + reranking + generation

6.3 AI Agents

  • Tool use (function calling)
  • Multi-step reasoning and planning
  • Code execution environments
  • Multi-agent collaboration

6.4 Embodied Intelligence

  • Vision-Language-Action models (VLA)
  • RT-2, Octo, and other robot foundation models
  • Sim-to-real transfer

7. Summary

graph TD
    A[DL Frontiers] --> B[Inference Scaling]
    A --> C[Architecture Innovation]
    A --> D[Compute Paradigms]
    A --> E[Application Directions]

    B --> B1[Test-Time Compute]
    B --> B2[Reasoning Models o1/R1]

    C --> C1[KAN]
    C --> C2[Liquid Networks]
    C --> C3[NAS]

    D --> D1[Neuromorphic Computing]
    D --> D2[Analog Computing]

    E --> E1[World Models]
    E --> E2[AI Agents]
    E --> E3[Embodied Intelligence]

Future Trend Predictions:

  1. Inference compute will become as important as training compute
  2. Hybrid architectures will continue to develop (Transformer + SSM + MoE)
  3. Neuromorphic computing has opportunities in edge scenarios
  4. AI Agents will become the primary application paradigm
  5. Embodied intelligence is a long-term important direction

References

  • Wei et al., "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models," NeurIPS 2022
  • Liu et al., "KAN: Kolmogorov-Arnold Networks," 2024
  • Hasani et al., "Liquid Time-constant Networks," AAAI 2021
  • Roy et al., "Towards Spike-Based Machine Intelligence with Neuromorphic Computing," Nature 2019
  • Liu et al., "DARTS: Differentiable Architecture Search," ICLR 2019

评论 #