Skip to content

Causal Machine Learning

Introduction

Traditional machine learning captures correlations in data, while causal inference asks "why" -- identifying causal relationships between variables. Causal inference is the critical step from "prediction" to "intervention" and "counterfactual reasoning."

Related content: Causal Inference


1. Causation vs Correlation

1.1 Simpson's Paradox

A classic example showing how correlation can mislead causal judgment:

Group Drug Group Recovery Control Group Recovery Conclusion
Male 93% 87% Drug effective
Female 73% 69% Drug effective
Overall 78% 83% Drug ineffective?

Reason: women were more likely to take the drug, and women had lower baseline recovery rates -- gender is a confounder.

1.2 Pearl's Ladder of Causation

Level Question Type Mathematical Tool Example
1. Association What is observed? \(p(Y\|X)\) Is recovery higher among those who took the drug?
2. Intervention What if I do it? \(p(Y\|do(X))\) If everyone takes the drug, will recovery improve?
3. Counterfactual What if things had been different? \(p(Y_x\|X', Y')\) Would this patient have recovered without the drug?

2. Structural Causal Models (SCM)

2.1 Definition

A structural causal model consists of a triple \((U, V, F)\):

  • \(U\): exogenous variables (not explained by other variables in the model)
  • \(V\): endogenous variables (determined by structural equations)
  • \(F\): set of structural equations \(v_i = f_i(\text{pa}_i, u_i)\)

2.2 Example

U_X → X → Y ← U_Y
         ↑
         Z

Structural equations:
  X = f_X(U_X)
  Z = f_Z(X, U_Z)
  Y = f_Y(X, Z, U_Y)

An SCM simultaneously encodes:

  • Observational distribution \(p(X, Y, Z)\)
  • Interventional distribution \(p(Y | do(X=x))\)
  • Counterfactuals \(Y_{X=x}\)

3. Causal Graphs (DAG)

3.1 Basic Causal Graph Structures

Chain:     X → Z → Y    (Z is a mediator)
Fork:      X ← Z → Y    (Z is a confounder)
Collider:  X → Z ← Y    (Z is a collider)

d-separation: determines whether two variables are conditionally independent given certain other variables.

Rules:

  • Chain \(X \to Z \to Y\): conditioning on \(Z\), \(X \perp Y | Z\)
  • Fork \(X \leftarrow Z \to Y\): conditioning on \(Z\), \(X \perp Y | Z\)
  • Collider \(X \to Z \leftarrow Y\): without conditioning on \(Z\), \(X \perp Y\); conditioning on \(Z\), \(X \not\perp Y | Z\)

4. do-Calculus

4.1 The do Operator

\(do(X = x)\) represents an intervention -- forcefully setting \(X = x\) and severing all causal arrows pointing into \(X\).

\[ p(Y | do(X = x)) \neq p(Y | X = x) \]

Observational conditioning \(p(Y|X=x)\): the distribution of \(Y\) when \(X=x\) is observed (may be confounded)

Intervention \(p(Y|do(X=x))\): the distribution of \(Y\) after actively setting \(X=x\) (confounding eliminated)

4.2 Backdoor Criterion

A set of variables \(Z\) satisfies the backdoor criterion (relative to \(X \to Y\)) if:

  1. \(Z\) blocks all backdoor paths from \(X\) to \(Y\) (paths through the tail of arrows into \(X\))
  2. \(Z\) contains no descendant of \(X\)

When the backdoor criterion is satisfied:

\[ p(Y | do(X = x)) = \sum_z p(Y | X = x, Z = z) \, p(Z = z) \]

4.3 Frontdoor Criterion

When confounders cannot be directly controlled, if a mediator variable \(M\) exists:

\[ X \to M \to Y, \quad U \to X, \quad U \to Y \]

Frontdoor adjustment formula:

\[ p(Y | do(X = x)) = \sum_m p(M = m | X = x) \sum_{x'} p(Y | X = x', M = m) p(X = x') \]

5. Causal Effect Estimation

5.1 Treatment Effects

Concept Definition Meaning
ATE \(\mathbb{E}[Y(1) - Y(0)]\) Average Treatment Effect
ATT \(\mathbb{E}[Y(1) - Y(0) \| T=1]\) Average Treatment Effect on the Treated
CATE \(\mathbb{E}[Y(1) - Y(0) \| X=x]\) Conditional Average Treatment Effect
ITE \(Y_i(1) - Y_i(0)\) Individual Treatment Effect (unobservable)

The fundamental problem: counterfactuals are unobservable -- we cannot simultaneously observe the same individual with and without treatment.

5.2 Estimation Methods

Method Applicable Conditions Approach
Randomized Experiment (RCT) Random assignment possible Gold standard, direct comparison
Propensity Score Matching Ignorability assumption Match similar individuals
Inverse Probability Weighting (IPW) Ignorability assumption Weight by treatment probability
Instrumental Variables (IV) Instrumental variable exists Use exogenous variable for identification
Regression Discontinuity (RDD) Treatment has a threshold Near-random at the threshold
Difference-in-Differences (DID) Panel data Pre-post and between-group differences

5.3 Double Machine Learning

Proposed by Chernozhukov et al. (2018), combining machine learning with causal inference:

\[ \hat{\tau} = \frac{1}{n}\sum_{i=1}^{n}\left[\hat{Y}_i^{res} \cdot \hat{T}_i^{res} / \hat{T}_i^{res2}\right] \]

Steps:

  1. Fit ML model \(Y \sim X\), obtain residuals \(\hat{Y}^{res}\)
  2. Fit ML model \(T \sim X\), obtain residuals \(\hat{T}^{res}\)
  3. Estimate causal effect on the residuals
  4. Use cross-fitting to avoid overfitting bias
from econml.dml import DML
from sklearn.ensemble import GradientBoostingRegressor

dml = DML(
    model_y=GradientBoostingRegressor(),
    model_t=GradientBoostingRegressor(),
    model_final=LinearRegression()
)
dml.fit(Y, T, X=X, W=W)
ate = dml.ate(X)

6. Causal Discovery

Automatically discovering causal structure (learning causal graphs) from data.

6.1 Constraint-Based Methods

Algorithm Approach Output
PC Algorithm Conditional independence tests CPDAG (partial DAG)
FCI Allows latent variables PAG
GES Greedy search for optimal graph CPDAG

PC algorithm flow:

1. Start with a fully connected undirected graph
2. Conditional independence test: if X ⊥ Y | Z, remove X-Y edge
3. Orient collider structures: X → Z ← Y
4. Propagate orientations (avoid new colliders and cycles)

6.2 Score-Based Methods

Search for the DAG that maximizes a scoring function:

\[ \text{Score}(G) = \sum_{i} \text{Score}(X_i | \text{Pa}_G(X_i)) \]

Common scores: BIC, BGe, MDL.

6.3 Functional Model-Based Methods

Assume specific functional forms to identify causal direction:

  • LiNGAM: Linear Non-Gaussian Acyclic Model
  • ANM (Additive Noise Model): \(Y = f(X) + \epsilon\); if independence holds, then \(X \to Y\)

7. Integration of Causal Inference and Machine Learning

Application Method Role
Fairness Causal fairness Identify discriminatory pathways, not just statistical bias
Explainability Counterfactual explanations "If feature X were different, would the prediction change?"
Distribution shift Causal invariance Maintain prediction stability across changing environments
Recommender systems Deconfounding Distinguish user preferences from exposure bias
Reinforcement learning Causal world models More efficient planning and transfer

References

  • "Causality" - Judea Pearl
  • "The Book of Why" - Judea Pearl & Dana Mackenzie
  • "Elements of Causal Inference" - Peters, Janzing, Schölkopf
  • "Causal Inference: What If" - Hernan & Robins
  • DoWhy / EconML Documentation

评论 #