Deep Learning Frontier Trends

Overview

Deep learning is evolving toward greater efficiency, broader generality, and closer approximation of human cognition. This chapter explores test-time compute, neural architecture search, KAN, Liquid Neural Networks, neuromorphic computing, and other cutting-edge directions.

1. Test-Time Compute

1.1 Core Idea

Traditional paradigm: More training compute → better performance.

New paradigm: More compute at inference time → better performance.

\[ \text{Performance} \propto f(\text{Training Compute}) + g(\text{Inference Compute}) \]

1.2 Main Approaches

Chain-of-Thought (CoT):

Let the model reason step by step
Explicitly generate intermediate steps
Simple prompting improves math/logic abilities

Best-of-N Sampling:

Generate N candidate answers
Use a verifier to select the best
Simple but effective

Search-Based Reasoning (Tree/Graph of Thought):

Build reasoning trees
Search across multiple reasoning paths
Backtracking and exploration

1.3 DeepSeek-R1 and Reasoning Models

o1/o3: OpenAI's reasoning models, trained with RL for long-chain reasoning
DeepSeek-R1: Open-source reasoning model, trained with GRPO
Core Idea: Teach models to "think longer" at inference time

1.4 Inference Compute Scaling Law

\[ \text{Performance} \approx f(\text{Inference Tokens}) \]

More inference tokens → better performance (especially pronounced on math, coding, and reasoning tasks)

2. Neural Architecture Search (NAS)

2.1 Basic Concept

Automatically search for optimal neural network architectures, replacing manual design.

Search Space:

Layer types (convolution, attention, MLP, etc.)
Connection patterns
Hyperparameters (channels, kernel size, etc.)

2.2 Main Methods

Method	Description	Representative
Reinforcement Learning	Use RL to search architectures	NASNet
Evolutionary	Mutation + selection	AmoebaNet
Differentiable	Relax discrete choices to continuous optimization	DARTS
Supernet	Train one network containing all sub-architectures	Once-for-All

2.3 DARTS

\[ \min_w \mathcal{L}_{\text{val}}(w^*(\alpha), \alpha) \]

\[ \text{s.t.} \quad w^*(\alpha) = \arg\min_w \mathcal{L}_{\text{train}}(w, \alpha) \]

where \(\alpha\) are architecture parameters, jointly learning weights and architecture through bilevel optimization.

2.4 Current Status

NAS in the LLM era mainly used for searching efficient architectures (e.g., mobile models)
Hardware-aware NAS: Considers hardware constraints like latency and power consumption
Searching optimal configurations within Transformer architecture (layers, heads, FFN size, etc.)

3. KAN (Kolmogorov-Arnold Networks)

3.1 Theoretical Foundation

Kolmogorov-Arnold Representation Theorem:

Any multivariate continuous function \(f(x_1, \ldots, x_n)\) can be represented as:

\[ f(x_1, \ldots, x_n) = \sum_{q=0}^{2n} \Phi_q\left(\sum_{p=1}^{n} \phi_{q,p}(x_p)\right) \]

where \(\phi_{q,p}\) and \(\Phi_q\) are univariate functions.

3.2 KAN vs MLP

Feature	MLP	KAN
Learnable component	Weights (scalars on edges)	Activation functions (functions on edges)
Fixed component	Activation functions	Summation nodes
Approximation theorem	Universal Approximation	Kolmogorov-Arnold
Interpretability	Low	Higher
Parameter efficiency	Average	Potentially better (certain tasks)

3.3 KAN Implementation

Edge activation functions are parameterized with B-splines:

\[ \phi(x) = \sum_{i} c_i B_i(x) \]

Each edge learns a spline function rather than a scalar weight.

3.4 Current Status

Strengths: Potential in scientific computing and interpretability
Limitations: No demonstrated advantage on large-scale tasks (LLMs) yet
Active Research: Integration with Transformers, Graph KAN, time series KAN

4. Liquid Neural Networks

4.1 Core Idea

Continuous-time neural networks inspired by biological nervous systems (C. elegans).

Liquid Time-Constant Networks (LTC):

\[ \frac{dx}{dt} = -\left[\frac{1}{\tau} + f(x, I, \theta)\right] \odot x + f(x, I, \theta) \]

where \(\tau\) is the time constant and \(f\) is a learned nonlinear function.

4.2 Characteristics

Causality: Naturally handles causal reasoning
Continuous time: Adapts to irregular sampling
Compact: Requires very few neurons (19 neurons can learn to drive)
Interpretable: Simple structure, analyzable

4.3 Applications

Autonomous driving
Time series forecasting
Robot control
Weather prediction

4.4 CfC (Closed-form Continuous-time)

\[ x(t) = \sigma(-f(x_0, I, \theta) \cdot (t - t_0)) \odot g(x_0, I, \theta) + [1 - \sigma(-f \cdot (t-t_0))] \odot h(x_0, I, \theta) \]

Closed-form solution, avoiding ODE solver computational overhead.

5. Neuromorphic Computing

5.1 Spiking Neural Networks (SNN)

Simulate biological neurons' spike-firing mechanism:

LIF (Leaky Integrate-and-Fire) Model:

\[ \tau_m \frac{dV}{dt} = -(V - V_{\text{rest}}) + R \cdot I(t) \]

When \(V > V_{\text{thresh}}\), fire a spike and reset.

5.2 vs Artificial Neural Networks

Feature	ANN	SNN
Information encoding	Continuous values	Spike trains
Computation	Matrix multiplication	Event-driven
Energy consumption	High	Extremely low
Hardware	GPU	Neuromorphic chips
Training	Backpropagation	Surrogate gradient/STDP
Temporal modeling	Requires design	Naturally temporal

5.3 Neuromorphic Chips

Chip	Developer	Features
Loihi 2	Intel	128 cores, 1M neurons
TrueNorth	IBM	1M neurons, 70mW power
SpiNNaker 2	Univ. of Manchester	Large-scale simulation
Akida	BrainChip	Edge AI

5.4 Challenges and Prospects

Training difficulty: Spikes are non-differentiable, requiring surrogate gradients
Accuracy gap: Underperforms ANNs on most tasks
Use cases: Ultra-low-power edge devices, event cameras
Future directions: ANN-SNN hybrids, large-scale SNNs

6. Other Frontier Directions

6.1 World Models

Learn internal models of environments for prediction and planning
JEPA architecture (proposed by Yann LeCun)
Genie 2 (DeepMind): Interactive 3D world generation

6.2 Retrieval-Augmented Generation (RAG)

Combine external knowledge bases with LLMs
Reduce hallucinations, update knowledge
Vector retrieval + reranking + generation

6.3 AI Agents

Tool use (function calling)
Multi-step reasoning and planning
Code execution environments
Multi-agent collaboration

6.4 Embodied Intelligence

Vision-Language-Action models (VLA)
RT-2, Octo, and other robot foundation models
Sim-to-real transfer

7. Summary

graph TD
    A[DL Frontiers] --> B[Inference Scaling]
    A --> C[Architecture Innovation]
    A --> D[Compute Paradigms]
    A --> E[Application Directions]

    B --> B1[Test-Time Compute]
    B --> B2[Reasoning Models o1/R1]

    C --> C1[KAN]
    C --> C2[Liquid Networks]
    C --> C3[NAS]

    D --> D1[Neuromorphic Computing]
    D --> D2[Analog Computing]

    E --> E1[World Models]
    E --> E2[AI Agents]
    E --> E3[Embodied Intelligence]

Future Trend Predictions:

Inference compute will become as important as training compute
Hybrid architectures will continue to develop (Transformer + SSM + MoE)
Neuromorphic computing has opportunities in edge scenarios
AI Agents will become the primary application paradigm
Embodied intelligence is a long-term important direction

References

Wei et al., "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models," NeurIPS 2022
Liu et al., "KAN: Kolmogorov-Arnold Networks," 2024
Hasani et al., "Liquid Time-constant Networks," AAAI 2021
Roy et al., "Towards Spike-Based Machine Intelligence with Neuromorphic Computing," Nature 2019
Liu et al., "DARTS: Differentiable Architecture Search," ICLR 2019