Causal Machine Learning

Introduction

Traditional machine learning captures correlations in data, while causal inference asks "why" -- identifying causal relationships between variables. Causal inference is the critical step from "prediction" to "intervention" and "counterfactual reasoning."

Related content: Causal Inference

1. Causation vs Correlation

1.1 Simpson's Paradox

A classic example showing how correlation can mislead causal judgment:

Group	Drug Group Recovery	Control Group Recovery	Conclusion
Male	93%	87%	Drug effective
Female	73%	69%	Drug effective
Overall	78%	83%	Drug ineffective?

Reason: women were more likely to take the drug, and women had lower baseline recovery rates -- gender is a confounder.

1.2 Pearl's Ladder of Causation

Level	Question Type	Mathematical Tool	Example
1. Association	What is observed?	\(p(Y\\|X)\)	Is recovery higher among those who took the drug?
2. Intervention	What if I do it?	\(p(Y\\|do(X))\)	If everyone takes the drug, will recovery improve?
3. Counterfactual	What if things had been different?	\(p(Y_x\\|X', Y')\)	Would this patient have recovered without the drug?

2. Structural Causal Models (SCM)

2.1 Definition

A structural causal model consists of a triple \((U, V, F)\):

\(U\): exogenous variables (not explained by other variables in the model)
\(V\): endogenous variables (determined by structural equations)
\(F\): set of structural equations \(v_i = f_i(\text{pa}_i, u_i)\)

2.2 Example

U_X → X → Y ← U_Y
         ↑
         Z

Structural equations:
  X = f_X(U_X)
  Z = f_Z(X, U_Z)
  Y = f_Y(X, Z, U_Y)

An SCM simultaneously encodes:

Observational distribution \(p(X, Y, Z)\)
Interventional distribution \(p(Y | do(X=x))\)
Counterfactuals \(Y_{X=x}\)

3. Causal Graphs (DAG)

3.1 Basic Causal Graph Structures

Chain:     X → Z → Y    (Z is a mediator)
Fork:      X ← Z → Y    (Z is a confounder)
Collider:  X → Z ← Y    (Z is a collider)

d-separation: determines whether two variables are conditionally independent given certain other variables.

Rules:

Chain \(X \to Z \to Y\): conditioning on \(Z\), \(X \perp Y | Z\)
Fork \(X \leftarrow Z \to Y\): conditioning on \(Z\), \(X \perp Y | Z\)
Collider \(X \to Z \leftarrow Y\): without conditioning on \(Z\), \(X \perp Y\); conditioning on \(Z\), \(X \not\perp Y | Z\)

4. do-Calculus

4.1 The do Operator

\(do(X = x)\) represents an intervention -- forcefully setting \(X = x\) and severing all causal arrows pointing into \(X\).

\[ p(Y | do(X = x)) \neq p(Y | X = x) \]

Observational conditioning \(p(Y|X=x)\): the distribution of \(Y\) when \(X=x\) is observed (may be confounded)

Intervention \(p(Y|do(X=x))\): the distribution of \(Y\) after actively setting \(X=x\) (confounding eliminated)

4.2 Backdoor Criterion

A set of variables \(Z\) satisfies the backdoor criterion (relative to \(X \to Y\)) if:

\(Z\) blocks all backdoor paths from \(X\) to \(Y\) (paths through the tail of arrows into \(X\))
\(Z\) contains no descendant of \(X\)

When the backdoor criterion is satisfied:

\[ p(Y | do(X = x)) = \sum_z p(Y | X = x, Z = z) \, p(Z = z) \]

4.3 Frontdoor Criterion

When confounders cannot be directly controlled, if a mediator variable \(M\) exists:

\[ X \to M \to Y, \quad U \to X, \quad U \to Y \]

Frontdoor adjustment formula:

\[ p(Y | do(X = x)) = \sum_m p(M = m | X = x) \sum_{x'} p(Y | X = x', M = m) p(X = x') \]

5. Causal Effect Estimation

5.1 Treatment Effects

Concept	Definition	Meaning
ATE	\(\mathbb{E}[Y(1) - Y(0)]\)	Average Treatment Effect
ATT	\(\mathbb{E}[Y(1) - Y(0) \\| T=1]\)	Average Treatment Effect on the Treated
CATE	\(\mathbb{E}[Y(1) - Y(0) \\| X=x]\)	Conditional Average Treatment Effect
ITE	\(Y_i(1) - Y_i(0)\)	Individual Treatment Effect (unobservable)

The fundamental problem: counterfactuals are unobservable -- we cannot simultaneously observe the same individual with and without treatment.

5.2 Estimation Methods

Method	Applicable Conditions	Approach
Randomized Experiment (RCT)	Random assignment possible	Gold standard, direct comparison
Propensity Score Matching	Ignorability assumption	Match similar individuals
Inverse Probability Weighting (IPW)	Ignorability assumption	Weight by treatment probability
Instrumental Variables (IV)	Instrumental variable exists	Use exogenous variable for identification
Regression Discontinuity (RDD)	Treatment has a threshold	Near-random at the threshold
Difference-in-Differences (DID)	Panel data	Pre-post and between-group differences

5.3 Double Machine Learning

Proposed by Chernozhukov et al. (2018), combining machine learning with causal inference:

\[ \hat{\tau} = \frac{1}{n}\sum_{i=1}^{n}\left[\hat{Y}_i^{res} \cdot \hat{T}_i^{res} / \hat{T}_i^{res2}\right] \]

Steps:

Fit ML model \(Y \sim X\), obtain residuals \(\hat{Y}^{res}\)
Fit ML model \(T \sim X\), obtain residuals \(\hat{T}^{res}\)
Estimate causal effect on the residuals
Use cross-fitting to avoid overfitting bias

from econml.dml import DML
from sklearn.ensemble import GradientBoostingRegressor

dml = DML(
    model_y=GradientBoostingRegressor(),
    model_t=GradientBoostingRegressor(),
    model_final=LinearRegression()
)
dml.fit(Y, T, X=X, W=W)
ate = dml.ate(X)

6. Causal Discovery

Automatically discovering causal structure (learning causal graphs) from data.

6.1 Constraint-Based Methods

Algorithm	Approach	Output
PC Algorithm	Conditional independence tests	CPDAG (partial DAG)
FCI	Allows latent variables	PAG
GES	Greedy search for optimal graph	CPDAG

PC algorithm flow:

1. Start with a fully connected undirected graph
2. Conditional independence test: if X ⊥ Y | Z, remove X-Y edge
3. Orient collider structures: X → Z ← Y
4. Propagate orientations (avoid new colliders and cycles)

6.2 Score-Based Methods

Search for the DAG that maximizes a scoring function:

\[ \text{Score}(G) = \sum_{i} \text{Score}(X_i | \text{Pa}_G(X_i)) \]

Common scores: BIC, BGe, MDL.

6.3 Functional Model-Based Methods

Assume specific functional forms to identify causal direction:

LiNGAM: Linear Non-Gaussian Acyclic Model
ANM (Additive Noise Model): \(Y = f(X) + \epsilon\); if independence holds, then \(X \to Y\)

7. Integration of Causal Inference and Machine Learning

Application	Method	Role
Fairness	Causal fairness	Identify discriminatory pathways, not just statistical bias
Explainability	Counterfactual explanations	"If feature X were different, would the prediction change?"
Distribution shift	Causal invariance	Maintain prediction stability across changing environments
Recommender systems	Deconfounding	Distinguish user preferences from exposure bias
Reinforcement learning	Causal world models	More efficient planning and transfer

References

"Causality" - Judea Pearl
"The Book of Why" - Judea Pearl & Dana Mackenzie
"Elements of Causal Inference" - Peters, Janzing, Schölkopf
"Causal Inference: What If" - Hernan & Robins
DoWhy / EconML Documentation