Skip to content

JEPA Architecture

1. Core Definition

JEPA (Joint Embedding Predictive Architecture) is an architecture for building world models, proposed by Yann LeCun. Its central idea is: instead of predicting raw signals (pixels, tokens), perform predictions in an abstract representation space.

This definition contains two key elements:

  1. Joint Embedding: Both input and target are mapped into the same representation space
  2. Predictive: The model's objective is to predict the target's representation from the input's representation

JEPA arose from LeCun's fundamental critique of the dominant AI paradigms:

Generative models (including LLMs) predict raw signals, which is fundamentally inefficient. Intelligence does not require predicting pixel-level details — it only needs to predict key changes at an abstract level.


2. Architecture Components

JEPA consists of three core components:

x-encoder (Input Encoder)

Encodes the input (e.g., a current video frame) into an abstract representation \(s_x\). This encoder compresses high-dimensional raw signals into a low-dimensional semantic representation.

y-encoder (Target Encoder)

Encodes the target (e.g., a future video frame) into an abstract representation \(s_y\). A critical design choice: the y-encoder is typically updated via Exponential Moving Average (EMA) rather than trained directly through gradients — this is the core mechanism for preventing representation collapse.

predictor (Predictor)

Predicts the target representation \(s_y\) from the input representation \(s_x\), optionally conditioned on an action \(a\):

\[\hat{s}_y = \text{Predictor}(s_x, a)\]

The predictor does not need to reconstruct the raw signal; it only needs to make accurate predictions in the representation space.

Information Flow

Input x → [x-encoder] → s_x → [predictor(a)] → ŝ_y ← compare → s_y ← [y-encoder] ← Target y

The training objective is to make \(\hat{s}_y\) as close to \(s_y\) as possible, while preventing all representations from collapsing to a single point.


3. Why Predict in Latent Space Rather Than Pixel Space

This is JEPA's most fundamental design philosophy and deserves thorough understanding.

The Problem with Pixel Prediction

Traditional generative models (VAEs, GANs, diffusion models, autoregressive models) all perform predictions at the pixel or token level. This introduces a deep problem:

Raw signals contain vast amounts of semantically irrelevant detail — precise textures, lighting angles, pixel noise. Predicting these details wastes enormous model capacity without contributing to understanding the world.

Consider a scenario: a ball rolls off a table. The semantically important information is "the ball's trajectory, velocity, and time of landing," not "the exact color value of every pixel on the ball's surface."

Advantages of Abstract Representations

Benefits of predicting in an abstract representation space:

Dimension Pixel-Space Prediction Representation-Space Prediction (JEPA)
Prediction target Exact value of every pixel Key changes at the semantic level
Capacity allocation Large capacity spent on irrelevant details Focused on semantic information
Uncertainty handling Must model all possible pixel combinations Only needs to model semantic uncertainty
Sample efficiency Low — requires large amounts of data to learn pixel statistics High — captures semantic patterns more quickly

Analogy to the Human Brain

The human brain does not represent the world by "internally rendering a precise image," but rather by maintaining an abstract, structured representation. You know there is a red cup on the table, but you do not store the exact color of every pixel of that cup in your mind.

JEPA aims to emulate precisely this mode of abstract representation: retaining semantically important structure while discarding perceptual noise and detail.


4. The Energy-Based Model Perspective

JEPA can be understood from the perspective of Energy-Based Models (EBMs):

Compatible \((x, y)\) pairs have low energy; incompatible pairs have high energy.

Training adjusts the energy function so that genuine (input, target) pairs have small distances in representation space (low energy), while unrelated pairs have large distances (high energy).

The advantage of this perspective is that it does not require the model to generate anything — it only requires the model to learn to distinguish "what is compatible with what." This is a weaker requirement than what generative models demand, and is therefore theoretically easier to learn.


5. Avoiding Representation Collapse

The core technical challenge JEPA faces is representation collapse: if all inputs are mapped to a single point, the predictor's error is zero, but the representations are completely meaningless.

JEPA addresses this through an asymmetric architecture:

  • The y-encoder is updated via EMA and does not receive gradients directly — it changes slowly, providing a stable learning target
  • The x-encoder and predictor are trained normally through gradients
  • This asymmetry breaks the equilibrium that leads to collapse

This differs from the solution used in contrastive learning. Contrastive learning prevents collapse through explicit negative samples, but selecting good negative samples is often a challenging problem. JEPA requires no negative samples — its asymmetric architecture alone is sufficient to prevent collapse.

This is a key advantage of JEPA over contrastive learning: it is simpler and does not require carefully designed negative sampling strategies.


6. V-JEPA: A Breakthrough in Video Understanding

V-JEPA (Video JEPA) is the application of the JEPA architecture to video understanding.

Core idea: show the model part of a video (with certain spatiotemporal regions masked out) and have it predict the masked portions in representation space. This forces the model to learn spatiotemporal regularities in video — how objects move and how scenes change.

V-JEPA's training is entirely self-supervised and requires no labeled data. It learns the dynamics of motion and spatiotemporal structure from unlabeled video.


7. V-JEPA2: From Video Representations to Robot Planning

V-JEPA2 (released June 2025) is a major upgrade to V-JEPA, showcasing the most exciting experimental results from the JEPA research program.

Key figures:

  • Trained on over 1 million hours of unlabeled video
  • With only 62 hours of robot manipulation data, achieves zero-shot planning

The significance of this result:

By learning the dynamics of motion from massive amounts of unlabeled video, the model acquires a powerful world representation. When transferred to robotic tasks, it requires only minimal task-specific data to produce reasonable plans.

This is precisely the extreme sample efficiency that JEPA pursues — good representation learning can drastically reduce the data requirements of downstream tasks. 62 hours versus 1 million hours represents a data efficiency leverage ratio exceeding 16,000x.


8. LeJEPA: Theoretical Completion

LeJEPA is a theoretical upgrade that addresses gaps in the JEPA framework, tackling issues that were insufficiently handled in the original JEPA:

  • How to better incorporate action-conditioning
  • How to handle hierarchical representations
  • How to make predictions across different time scales

LeJEPA is primarily at the stage of theoretical exploration and represents the frontier of thinking along the JEPA research trajectory.


9. AMI Labs: From Academia to Industry

In March 2026, AMI Labs, co-founded by LeCun, announced a $1.03 billion seed round — the largest seed round in European history.

AMI Labs is positioned as:

Pursuing world models as a core pathway to build an alternative AI paradigm beyond LLMs.

This funding reflects enormous industry confidence in the JEPA approach. LeCun's central arguments are:

  1. LLMs are fundamentally limited because they predict tokens, not world states
  2. JEPA predicts abstract representations of world states, which is closer to genuine world understanding
  3. World models are the necessary path to more general intelligence, rather than incremental improvements to LLMs

Whether this argument is correct remains an open question. However, the founding of AMI Labs marks the transition of world model research from academic discussion to large-scale industrial investment.


10. LeCun's Core Thesis

LeCun's thinking on JEPA and world models can be organized into a logical chain:

  1. Humans learn world models from video and interaction, not from text
  2. The brain's world model operates at the level of abstract representations, not at the level of raw perception
  3. LLMs process only text — text is an extremely compressed and lossy encoding of world information
  4. Even when LLMs exhibit "common sense," this is merely statistical co-occurrence, not genuine world dynamics
  5. True world understanding requires predicting the evolution of world states in an abstract representation space
  6. JEPA is the architecture designed precisely for this purpose

LLMs learn "how people describe the world," whereas JEPA aims to learn "how the world works."


11. Relationship to Self-Supervised Learning

JEPA can be viewed as a particular form of self-supervised learning:

  • Contrastive learning (e.g., SimCLR): pulls positive pairs closer and pushes negative pairs apart. Requires negative samples and carries collapse risk.
  • Generative self-supervision (e.g., MAE): masks part of the input and predicts the original pixels. Operates in pixel space, wasting capacity on fine details.
  • JEPA: masks part of the input and predicts in representation space. Requires no negative samples and does not operate in pixel space.

JEPA can be said to have absorbed the lessons of both contrastive learning and generative self-supervision, arriving at a middle path between the two.


12. Open Questions

Despite the remarkable progress along the JEPA research trajectory, several fundamental questions remain unresolved:

  1. How to scale to broader domains: V-JEPA2 performs impressively in video and robotics, but can JEPA extend to more abstract domains such as language, reasoning, and social interaction?
  2. Incorporating action conditioning: For JEPA to become a complete world model, it must effectively incorporate action conditioning — this component is still under development
  3. Hierarchical prediction: Changes in the real world occur across multiple time scales (millisecond-level physical motion, second-level actions, minute-level events). How can hierarchical abstract prediction be achieved?
  4. Integration with planning: What else is needed to bridge the gap from a world model to actual decision-making and planning?

JEPA represents a compelling alternative paradigm — it bets that "understanding the world" is closer to the essence of intelligence than "generating text." Whether this bet is correct will become clear in the years ahead.

See 世界模型 for the overall world model framework, and 空间智能与学习式仿真 for other competing approaches.


评论 #