Object-Centric Learning

1. The Core Problem: Why Pixels Are Not Enough

When you see a photo of a table with a cup of coffee and a book on it, you don't think "there's a brown region next to a white region." You immediately perceive: a table, with a cup of coffee on it, and a book beside it. What you perceive are objects, not pixels.

Object-Centric Learning aims to endow AI systems with this same ability: automatically decomposing a scene into individual objects, each with its own attributes (position, shape, color, texture), and with explicit relationships between them (on top of, next to, occluding).

Humans understand the world in terms of objects, which enables compositional generalization — recombining known objects and relations to understand scenes never encountered before.

This ability is essential for human-like intelligence. If you know what a "cup" and a "table" are, and what the relation "on top of" means, then even if you have never seen that particular cup on that particular table, you can immediately understand the scene. This is Compositional Generalization — understanding an infinite number of combinations from a finite set of elements.

A model that processes raw pixels holistically may be completely bewildered when facing novel combinations.

2. Core Mechanism: Slot Attention

Slot Attention is an attention mechanism proposed by Locatello et al. in 2020 that provides a concise yet powerful paradigm for object-centric learning. The core idea is:

Use a set of learnable "slots" to competitively bind to input features, so that each slot ultimately captures the representation of one object.

Workflow

Slot Attention operates in four steps:

Step 1: Initialization. Randomly initialize \(K\) slot vectors, all with the same dimensionality. At the start, these slots do not represent any specific object — they are "empty containers."

Step 2: Attention computation. Compute attention weights between each slot and all input features. Crucially, the attention is softmax-normalized across the slot dimension, meaning that multiple slots compete with each other — they vie for the "right to explain" the same region of input features.

Step 3: Slot update. Each slot aggregates the input features it "won" via weighted summation, then updates its own representation through a GRU or MLP.

Step 4: Iteration. Repeat Steps 2 and 3 for several rounds until the slot assignments stabilize.

The elegance of this process lies in the competition mechanism: because of the softmax normalization, if Slot A has high attention over a certain region, the attention of other slots over that region is suppressed. This naturally drives different slots to "take responsibility" for different objects.

Training

Slot Attention is typically trained under a self-supervised framework — the model must reconstruct the input image from its slot representations. Each slot is independently decoded into an image patch and a mask, and the outputs of all slots are combined to reconstruct the full scene. This approach requires no object-level annotations.

3. Why Object-Centric Learning Is Extremely Difficult

Despite the elegant framework offered by Slot Attention, object-centric learning faces six fundamental challenges in practice:

Challenge 1: Object boundaries are not naturally given

"What counts as an object" has no unique answer.

Is a cloud an object? What about a shadow? Is a river one object or an aggregate of infinitely many water molecules? Is a stack of books on a table "one stack" or "five books"?

Object boundaries depend on cognitive goals and levels of abstraction — there is no objectively "correct segmentation." The model must learn to define objects at an appropriate granularity, and that granularity is itself task-dependent.

Challenge 2: Occlusion, deformation, merging, splitting

Objects in the real world are not neat geometric primitives:

Occlusion: A cup is half-hidden behind a book — the model must understand it is still a complete cup
Deformation: Ropes bend, fabrics wrinkle, body poses change
Merging: Two people walk close together and visually "fuse" in the image
Splitting: A drop of water falls and splashes into many droplets

The model must maintain continuity of object identity through all these drastic visual changes.

Challenge 3: The number of objects is not fixed

A scene may contain 1 cup, or 5 people and 200 leaves. Slot Attention uses a fixed number of slots, which means an upper bound must be set in advance, and surplus slots must learn to "stay empty." How to handle variable-length object sets and dynamic addition or removal of objects remains an incompletely solved problem.

Challenge 4: Combinatorial explosion of relations

As the number of objects grows, the relations between them grow combinatorially: who is on top of whom, who touched whom, who occludes whom, which interactions matter. For \(n\) objects, the number of potential pairwise relations is \(O(n^2)\). The model must discover objects while maintaining its capacity for relational reasoning, and it must also learn to ignore unimportant interactions.

Challenge 5: Extremely weak supervision

In real-world data, it is nearly impossible to obtain precise object-level annotations — "which object does this pixel belong to," "which object in frame 2 corresponds to object A in frame 1." The model must discover "objectness" from unsupervised or very weakly supervised data. This demands strong inductive biases so that object-level decomposition can emerge from reconstruction objectives or other proxy tasks.

Challenge 6: The world is not made of objects alone

Not all phenomena are well-suited to object-based representations.

Lighting, fluids, temperature fields, wind, sound waves — these are all field-like phenomena that permeate space without clear boundaries. A complete world representation cannot consist only of objects; it must also handle continuous fields. Purely object-centric methods may fall short in these scenarios.

4. Frontier Advances as of 2025

Object-centric learning has undergone a leap from synthetic data to real-world scenes over the past few years. Several works in 2025 mark major breakthroughs in the field.

GLASS (CVPR 2025)

GLASS is the first work to apply Slot Attention to compositional image generation in complex real-world scenes. It combines Slot Attention with diffusion models:

Each slot controls the generation of one object in the scene
By composing the contents of different slots, it can generate object combinations never seen during training
It achieves unprecedented compositional control on complex real-world scenes

The significance of GLASS is that it demonstrates Slot Attention can be used not only for analysis (decomposing a scene into objects) but also for generation (composing objects into a scene).

SlotAdapt (ICLR 2025)

SlotAdapt proposes a method that integrates Slot Attention with pretrained diffusion models, surpassing all previous methods on object discovery tasks. Its core idea is to leverage the rich visual priors already learned by diffusion models to assist Slot Attention in object decomposition.

This "standing on the shoulders of giants" strategy is crucial: rather than learning object concepts from scratch, it harnesses the visual knowledge already captured by large-scale pretrained models.

Theoretical Breakthrough: Provable Compositional Generalization

Brendel's research group has theoretically proven that when Slot Attention is trained with a Compositional Consistency Loss, it can achieve provably compositional generalization.

This is a highly significant result because it is not merely empirical ("it works well on some benchmark") but a theoretical guarantee: if the training procedure satisfies certain conditions, the model is guaranteed to generalize to novel object combinations.

5. Slot Attention Compared with Other Methods

Method	Core Idea	Strengths	Limitations
Traditional object detection (YOLO, Faster R-CNN)	Supervised learning; predict bounding boxes and classes	High accuracy; mature engineering	Requires extensive annotations; cannot discover novel categories
Semantic segmentation	Pixel-level classification	Fine spatial resolution	Still requires annotations; lacks object-level reasoning
Slot Attention	Competitive attention; self-supervised decomposition	No annotations needed; can discover objects	Struggles with complex scenes; fixed number of slots
VAE-based methods	Learn disentangled latent variables	Probabilistic framework; can quantify uncertainty	Decomposition granularity is hard to control

The fundamental difference between Slot Attention and traditional methods is that it does not train a classifier on known categories; instead, it discovers the concept of "object" itself from data. This is closer to how human infants learn — infants do not begin distinguishing objects only after being told "this is a cup, this is a table"; rather, they are innately predisposed to decompose visual scenes into independent entities.

6. Connections to Other Directions in Human-Like Intelligence

Object-Centric Learning and World Models

Object-centric representations provide a structured vocabulary for world models. An object-centric world model does not predict \(s_{t+1}\) in a monolithic latent space; instead, it predicts how each object's state evolves over time:

A cup slides off the edge of a table (change in position)
A ball is struck and flies away (change in velocity)
Two blocks collide and change direction (interaction effects)

This decomposition makes world models more interpretable, easier to generalize (adding a new object only requires adding a new slot), and more amenable to causal reasoning (change one object's attributes and observe the effect on others).

Object-Centric Learning and Compositionality

Compositionality is one of the core features of human cognition. We use a finite set of concepts (object types, attributes, relations) to understand an infinite variety of scenes. Object-centric learning is the foundation for achieving this compositionality:

First decompose the world into objects and relations, then represent new scenes as combinations of objects and relations.

If a model lacks object-level representations, it will struggle to achieve true compositional generalization — because it does not know what the recombinable "parts" are.

Object-Centric Learning and Causal Learning

Object-centric representations are also a natural substrate for causal learning. Causal relationships typically operate at the object level — "the ball hit the cup, and the cup fell over" — not at the pixel level. If a model can first decompose a scene into objects and then learn the causal relationships among them, the entire causal learning problem becomes more structured and tractable.

7. Logical Chain

Humans perceive the world in terms of objects, which makes compositional generalization possible.
Slot Attention achieves self-supervised decomposition from pixels to object representations through a competitive attention mechanism.
Object-centric learning faces six fundamental challenges: ambiguous boundaries, occlusion and deformation, variable object counts, relational explosion, extremely weak supervision, and the existence of fields.
Frontier works in 2025 (GLASS, SlotAdapt) have pushed Slot Attention toward complex real-world scenes, and theoretical work has provided provable guarantees for compositional generalization.
Object-centric representations serve as a structured vocabulary for world models, a foundation for compositional generalization, and a natural substrate for causal learning.
Object-centric learning is not a panacea — field-like phenomena (lighting, fluids) are not well-suited to purely object-based representations, and a complete understanding of the world requires a unification of objects and fields.