Causal Learning

1. The Core Insight: Correlation Is Not Causation

The vast majority of modern machine learning methods are, at their core, doing the same thing: discovering statistical correlations from data. Given enough data, a model can accurately estimate \(P(Y|X)\) — "what is the probability of Y given that we observe X?" But there is a fundamental problem:

Correlation does not imply causation.

This is far from a pedantic academic point. Ice cream sales and drowning deaths are highly correlated, but banning ice cream will not reduce drownings. A model that has only learned statistical correlations will systematically fail when confronted with distribution shifts, new environments, or scenarios that require interventional decision-making.

Causal Learning aims to enable AI systems to learn the causal relationships between variables, rather than merely statistical associations. The intellectual foundation of this field comes from Judea Pearl, who received the 2011 Turing Award for his formalization of causal inference.

2. The Causal Hierarchy: Seeing, Doing, Imagining

Pearl proposed a profound classification framework that divides our understanding of the world into three levels, known as the Causal Hierarchy:

Level	Name	Typical Question	Mathematical Form
Level 1	Association	If I see X, what happens to Y?	\(P(Y \mid X)\)
Level 2	Intervention	If I actively change X, what happens to Y?	\(P(Y \mid do(X))\)
Level 3	Counterfactual	If X had been different, what would Y have been?	\(P(Y_x \mid X', Y')\)

Level 1: Association (Seeing)

This is the purely observational level. "A patient took the medication, and recovered" — \(P(\text{recovery} \mid \text{medication})\) may be high. But this does not mean the medication is effective — perhaps only patients with mild symptoms chose to take the medication, while severe cases were hospitalized directly. The correlations in observational data are contaminated by confounders.

The vast majority of current machine learning models, including large language models, operate at this level. They are extraordinarily good at capturing statistical patterns in data, but cannot distinguish genuine causal mechanisms from spurious correlations.

Level 2: Intervention (Doing)

The interventional level answers the question: if I actively change a variable, what happens to the outcome? The key notation here is \(do(X)\), which does not mean "X was observed to happen," but rather "I force X to happen."

\(P(Y \mid X)\) and \(P(Y \mid do(X))\) can be entirely different.

A classic example: in observational data, people carrying umbrellas have a high probability of encountering rain — \(P(\text{rain} \mid \text{umbrella})\) is large. But if you force everyone to carry an umbrella (\(do(\text{umbrella})\)), the weather will not change. Carrying an umbrella and rain share a common cause (the weather forecast predicted rain), but carrying an umbrella does not cause rain.

Interventional reasoning is a core requirement in medicine, economic policy, engineering control, and other fields. Randomized Controlled Trials (RCTs) are the gold-standard method humans use to obtain interventional knowledge.

Level 3: Counterfactual (Imagining)

This is the highest level of causal reasoning. Counterfactuals ask: given what actually happened, what would have occurred if a certain condition had been different?

"This patient took the medication and recovered. Would he have recovered if he had not taken the medication?"

Counterfactual reasoning requires us not only to understand causal mechanisms, but also to mentally "rewind" to a certain point in time, change a condition, and then "replay" the entire process. This is an extremely common ability in everyday human thinking — regret, planning, attribution of responsibility, and learning from experience all rely on counterfactual reasoning.

Pearl's central thesis is:

Current machine learning is essentially stuck at Level 1. Human intelligence operates across all three levels. To achieve human-like intelligence, we must ascend the causal hierarchy.

3. Structural Causal Models: The Formal Language of Causal Reasoning

The Structural Causal Model (SCM) is the mathematical framework for causal reasoning proposed by Pearl. An SCM consists of three components:

Directed Acyclic Graph (DAG): Nodes represent variables, and directed edges represent causal relationships (\(X \rightarrow Y\) means X is a direct cause of Y)
Structural Equations: Each variable is determined by its parent nodes and a noise term, \(Y = f(X, U)\)
Distribution of Exogenous Variables: The probability distribution of the noise terms \(U\)

The power of SCMs lies in their ability to uniformly handle all three levels of the causal hierarchy:

Association: Conditional independence relationships derived from the DAG
Intervention: Computing interventional effects via do-calculus and graph surgery
Counterfactual: The "abduction–intervention–prediction" three-step procedure using structural equations

Key Technical Concepts

Confounders: Third-party variables that simultaneously influence both the cause variable and the effect variable. Confounders are the primary obstacle to causal inference — they make observed correlations differ from true causal effects.

Backdoor Criterion: A criterion given by Pearl — if we can find a set of variables Z that "blocks" all backdoor paths (i.e., confounding paths) from X to Y, then conditioning on Z allows us to recover the causal effect from observational data.

Frontdoor Criterion: An alternative strategy for when confounders cannot be directly measured. Through an intermediate variable M (where X affects M, M affects Y, and M is not directly influenced by the confounder), the causal effect can still be identified.

Do-calculus: A set of inference rules invented by Pearl that can systematically transform expressions containing the \(do\) operator into expressions that can be estimated from observational data. Do-calculus is complete — if a causal effect is theoretically identifiable from observational data, do-calculus is guaranteed to derive it.

4. Why Causal Learning Is Essential for Human-Like Intelligence

Robust Generalization

Statistical models perform well on the training distribution but suffer sharp performance degradation under distribution shift. Causal models are different:

Causal mechanisms are modular and remain invariant across environments.

If a model has learned the causal mechanism "flipping the switch causes the light to turn on," this mechanism holds regardless of the room's color, the time of day, or the weather. A model that has only learned statistical correlations might associate "it is dark" with "the light is on," and fail in a daytime scenario where the switch is flipped.

Sample Efficiency

Knowing the causal structure drastically shrinks the hypothesis space. If you know that temperature depends only on heating power and heat dissipation conditions, you do not need to consider irrelevant variables such as room color or background music. This is one of the reasons human learning is so efficient — we are naturally inclined to seek causal explanations rather than memorize all statistical regularities.

Counterfactual Reasoning

Planning and imagination are, at their core, counterfactual reasoning: "What would happen if I took this route? What if I chose a different plan?" Without counterfactual reasoning ability, an agent can only learn by trial and error, unable to mentally simulate different courses of action.

Transferability

Causal mechanisms possess modularity: changing one mechanism does not affect the others. This means causal knowledge can be transferred and reused across different environments, whereas statistical models often need to be retrained in new settings.

5. Causal Representation Learning: Bridging Deep Learning and Causal Inference

Traditional causal inference operates under a default assumption: the causal variables are known and directly observable. In practice, however, we often face raw pixels, audio waveforms, or text sequences — high-level causal variables (such as "object position," "lighting conditions," or "user intent") are hidden within low-level data.

Causal Representation Learning is a direction championed by Yoshua Bengio, Bernhard Scholkopf, and others, aimed at solving this bridging problem:

How can we automatically discover high-level causal variables and their causal relationships from low-level observational data?

The core difficulty of this problem lies in identifiability: to what extent can we uniquely recover the true causal structure from observational data alone? In the unconstrained case, the answer is generally pessimistic — infinitely many equivalent causal models can explain the same data.

A key recent advance is the finding that, under appropriate assumptions (e.g., data from multiple distinct environments, the presence of temporal structure, or certain independence conditions), causal structure can be identified. This provides a theoretical foundation for "learning causal structure from data."

6. GFlowNets: Turning Causal Graph Search into a Generation Problem

GFlowNets (Generative Flow Networks) are a novel class of generative models proposed by Yoshua Bengio's team, playing a unique role in causal learning.

Traditional causal structure learning faces a combinatorial explosion: the number of possible DAGs over \(n\) variables grows super-exponentially with \(n\). Greedy search easily gets trapped in local optima, and MCMC sampling is inefficient.

The core idea of GFlowNets is to transform DAG search into a sequential generation problem:

Treat constructing a DAG as a sequence of "add edge" actions
Train a generative model that generates DAGs with probability proportional to the posterior probability (or reward)
The generation process satisfies a "flow conservation" condition, ensuring correctness of sampling

The distinctive feature of GFlowNets is that they simultaneously connect three fields:

Field	Connection
Reinforcement Learning	DAG generation is treated as a sequential decision problem
Generative Models	A generative distribution over DAGs is learned
Energy-Based Probabilistic Modeling	The target distribution can be defined by an energy function

The advantage of this approach is that it does not merely produce a single "optimal" causal graph, but yields a full posterior distribution over causal graphs — which is crucial for expressing uncertainty about causal structure.

7. Connections to Other Directions in Human-Like Intelligence

Causal Learning and World Models

A true world model must support interventional reasoning:

"If I do X, how will the world change?"

This is precisely the core capability of Level 2 in the causal hierarchy. Merely predicting "what will happen next" is insufficient — a world model must also be able to answer "what would happen next if I changed a certain condition." A world model without causal structure is essentially just a sophisticated pattern matcher, incapable of supporting planning and decision-making.

Causal Learning and the Spurious Causality Problem

In earlier notes, we discussed shortcut learning and spurious causality: models mistaking statistical correlations for causal mechanisms. Causal learning is the fundamental remedy for spurious causality — if a model can distinguish "X and Y co-occur" from "X causes Y," it will not be deceived by spurious correlations.

Causal Learning and Foundation Models

A frontier trend as of 2025 is the integration of causal methods with foundation models. Large language models possess rich world knowledge, but this knowledge is stored in the form of statistical correlations. How to inject causal reasoning capabilities into foundation models, or leverage the knowledge of foundation models to assist causal discovery, is becoming an active research direction.

8. Key Challenges and Open Problems

The Identifiability Problem

When can we uniquely recover causal structure from observational data alone?

This is the central theoretical question of causal representation learning. Causal learning methods without identifiability guarantees may learn incorrect causal structures. Important progress has been made in recent years within frameworks such as nonlinear ICA and multi-environment learning, but fully general identifiability conditions remain an open problem.

Scalability Challenges

Real-world causal systems may involve thousands of variables, and searching the DAG space is extremely costly. How to scale causal discovery algorithms to large-scale, high-dimensional data is a bottleneck for practical applications.

Deep Integration with Deep Learning

Causal inference and deep learning remain, to a large extent, two relatively independent communities. Causal methods typically assume variables are known and finite, while deep learning handles high-dimensional raw data. Truly bridging the two requires further breakthroughs in causal representation learning.

9. Summary of the Logical Chain

Current ML primarily operates at Level 1 (association) of the causal hierarchy; human intelligence operates across all three levels.
SCMs provide a formal framework for causal reasoning, uniformly handling association, intervention, and counterfactuals.
Causal learning yields robust generalization, sample efficiency, counterfactual reasoning, and transferability — all key capabilities for human-like intelligence.
Causal representation learning seeks to discover causal variables from raw data, serving as the bridge between deep learning and causal inference.
GFlowNets transform causal graph search into a generation problem, simultaneously connecting RL, generative models, and energy-based modeling.
A true world model must support interventional reasoning; causal structure is a core component of world models.
Identifiability is the fundamental theoretical challenge of causal learning — without solving identifiability, causal discovery lacks guarantees.