Causal Machine Learning
Introduction
Traditional machine learning captures correlations in data, while causal inference asks "why" -- identifying causal relationships between variables. Causal inference is the critical step from "prediction" to "intervention" and "counterfactual reasoning."
Related content: Causal Inference
1. Causation vs Correlation
1.1 Simpson's Paradox
A classic example showing how correlation can mislead causal judgment:
| Group | Drug Group Recovery | Control Group Recovery | Conclusion |
|---|---|---|---|
| Male | 93% | 87% | Drug effective |
| Female | 73% | 69% | Drug effective |
| Overall | 78% | 83% | Drug ineffective? |
Reason: women were more likely to take the drug, and women had lower baseline recovery rates -- gender is a confounder.
1.2 Pearl's Ladder of Causation
| Level | Question Type | Mathematical Tool | Example |
|---|---|---|---|
| 1. Association | What is observed? | \(p(Y\|X)\) | Is recovery higher among those who took the drug? |
| 2. Intervention | What if I do it? | \(p(Y\|do(X))\) | If everyone takes the drug, will recovery improve? |
| 3. Counterfactual | What if things had been different? | \(p(Y_x\|X', Y')\) | Would this patient have recovered without the drug? |
2. Structural Causal Models (SCM)
2.1 Definition
A structural causal model consists of a triple \((U, V, F)\):
- \(U\): exogenous variables (not explained by other variables in the model)
- \(V\): endogenous variables (determined by structural equations)
- \(F\): set of structural equations \(v_i = f_i(\text{pa}_i, u_i)\)
2.2 Example
U_X → X → Y ← U_Y
↑
Z
Structural equations:
X = f_X(U_X)
Z = f_Z(X, U_Z)
Y = f_Y(X, Z, U_Y)
An SCM simultaneously encodes:
- Observational distribution \(p(X, Y, Z)\)
- Interventional distribution \(p(Y | do(X=x))\)
- Counterfactuals \(Y_{X=x}\)
3. Causal Graphs (DAG)
3.1 Basic Causal Graph Structures
Chain: X → Z → Y (Z is a mediator)
Fork: X ← Z → Y (Z is a confounder)
Collider: X → Z ← Y (Z is a collider)
d-separation: determines whether two variables are conditionally independent given certain other variables.
Rules:
- Chain \(X \to Z \to Y\): conditioning on \(Z\), \(X \perp Y | Z\)
- Fork \(X \leftarrow Z \to Y\): conditioning on \(Z\), \(X \perp Y | Z\)
- Collider \(X \to Z \leftarrow Y\): without conditioning on \(Z\), \(X \perp Y\); conditioning on \(Z\), \(X \not\perp Y | Z\)
4. do-Calculus
4.1 The do Operator
\(do(X = x)\) represents an intervention -- forcefully setting \(X = x\) and severing all causal arrows pointing into \(X\).
Observational conditioning \(p(Y|X=x)\): the distribution of \(Y\) when \(X=x\) is observed (may be confounded)
Intervention \(p(Y|do(X=x))\): the distribution of \(Y\) after actively setting \(X=x\) (confounding eliminated)
4.2 Backdoor Criterion
A set of variables \(Z\) satisfies the backdoor criterion (relative to \(X \to Y\)) if:
- \(Z\) blocks all backdoor paths from \(X\) to \(Y\) (paths through the tail of arrows into \(X\))
- \(Z\) contains no descendant of \(X\)
When the backdoor criterion is satisfied:
4.3 Frontdoor Criterion
When confounders cannot be directly controlled, if a mediator variable \(M\) exists:
Frontdoor adjustment formula:
5. Causal Effect Estimation
5.1 Treatment Effects
| Concept | Definition | Meaning |
|---|---|---|
| ATE | \(\mathbb{E}[Y(1) - Y(0)]\) | Average Treatment Effect |
| ATT | \(\mathbb{E}[Y(1) - Y(0) \| T=1]\) | Average Treatment Effect on the Treated |
| CATE | \(\mathbb{E}[Y(1) - Y(0) \| X=x]\) | Conditional Average Treatment Effect |
| ITE | \(Y_i(1) - Y_i(0)\) | Individual Treatment Effect (unobservable) |
The fundamental problem: counterfactuals are unobservable -- we cannot simultaneously observe the same individual with and without treatment.
5.2 Estimation Methods
| Method | Applicable Conditions | Approach |
|---|---|---|
| Randomized Experiment (RCT) | Random assignment possible | Gold standard, direct comparison |
| Propensity Score Matching | Ignorability assumption | Match similar individuals |
| Inverse Probability Weighting (IPW) | Ignorability assumption | Weight by treatment probability |
| Instrumental Variables (IV) | Instrumental variable exists | Use exogenous variable for identification |
| Regression Discontinuity (RDD) | Treatment has a threshold | Near-random at the threshold |
| Difference-in-Differences (DID) | Panel data | Pre-post and between-group differences |
5.3 Double Machine Learning
Proposed by Chernozhukov et al. (2018), combining machine learning with causal inference:
Steps:
- Fit ML model \(Y \sim X\), obtain residuals \(\hat{Y}^{res}\)
- Fit ML model \(T \sim X\), obtain residuals \(\hat{T}^{res}\)
- Estimate causal effect on the residuals
- Use cross-fitting to avoid overfitting bias
from econml.dml import DML
from sklearn.ensemble import GradientBoostingRegressor
dml = DML(
model_y=GradientBoostingRegressor(),
model_t=GradientBoostingRegressor(),
model_final=LinearRegression()
)
dml.fit(Y, T, X=X, W=W)
ate = dml.ate(X)
6. Causal Discovery
Automatically discovering causal structure (learning causal graphs) from data.
6.1 Constraint-Based Methods
| Algorithm | Approach | Output |
|---|---|---|
| PC Algorithm | Conditional independence tests | CPDAG (partial DAG) |
| FCI | Allows latent variables | PAG |
| GES | Greedy search for optimal graph | CPDAG |
PC algorithm flow:
1. Start with a fully connected undirected graph
2. Conditional independence test: if X ⊥ Y | Z, remove X-Y edge
3. Orient collider structures: X → Z ← Y
4. Propagate orientations (avoid new colliders and cycles)
6.2 Score-Based Methods
Search for the DAG that maximizes a scoring function:
Common scores: BIC, BGe, MDL.
6.3 Functional Model-Based Methods
Assume specific functional forms to identify causal direction:
- LiNGAM: Linear Non-Gaussian Acyclic Model
- ANM (Additive Noise Model): \(Y = f(X) + \epsilon\); if independence holds, then \(X \to Y\)
7. Integration of Causal Inference and Machine Learning
| Application | Method | Role |
|---|---|---|
| Fairness | Causal fairness | Identify discriminatory pathways, not just statistical bias |
| Explainability | Counterfactual explanations | "If feature X were different, would the prediction change?" |
| Distribution shift | Causal invariance | Maintain prediction stability across changing environments |
| Recommender systems | Deconfounding | Distinguish user preferences from exposure bias |
| Reinforcement learning | Causal world models | More efficient planning and transfer |
References
- "Causality" - Judea Pearl
- "The Book of Why" - Judea Pearl & Dana Mackenzie
- "Elements of Causal Inference" - Peters, Janzing, Schölkopf
- "Causal Inference: What If" - Hernan & Robins
- DoWhy / EconML Documentation