Causal Report
Causal Report
Instructor
Prof. M. R. Srinivasan
Professor, Chennai Mathematical Institute
mrsvasan@[Link]
This work is based on a “Guided Project” where I read existing and latest literature in
the field of Causal Inference, modelling & learning.
My most sincere gratitude goes to Professor M.R. Srinivasan for his unwavering guidance
and unparalled support throughout time and course.
3
4
Contents
2 Randomized Experiments 15
2.1 Selection Bias: Conceptual Foundations . . . . . . . . . . . . . . . . . . . 16
2.2 RCT: Assumptions and Theorem . . . . . . . . . . . . . . . . . . . . . . 17
2.3 Sampling Framework for an RCT . . . . . . . . . . . . . . . . . . . . . . 18
2.4 Point Estimation of Group Means . . . . . . . . . . . . . . . . . . . . . . 18
2.5 Inference for the Average Treatment Effect . . . . . . . . . . . . . . . . . 18
2.6 The Delta Method and Relative Effectiveness . . . . . . . . . . . . . . . . 19
2.7 Covariate Adjustment and Treatment Effect . . . . . . . . . . . . . . . . 20
2.8 Limitations of RCTs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.9 Conclusion & Suggested Readings . . . . . . . . . . . . . . . . . . . . . . 22
3 Double ML 23
3.1 Frisch–Waugh–Lovell (FWL) Partialling-Out . . . . . . . . . . . . . . . . 24
3.2 Neyman Orthogonality. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
√
3.3 Asymptotic Theory: n-Normality of α̂ . . . . . . . . . . . . . . . . . . 27
3.4 Conditional Convergence in Growth Economics . . . . . . . . . . . . . . 28
3.5 Beyond a Single Treatment . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.6 Alternative Orthogonal Methods . . . . . . . . . . . . . . . . . . . . . . . 29
3.7 Practical Checklist for Empirical Implementation . . . . . . . . . . . . . 31
3.8 Notebooks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.9 Suggested Readings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
5
6 CONTENTS
5 Graphical Models 41
5.1 General DAG and SEM via an Example . . . . . . . . . . . . . . . . . . 41
5.2 Conditional Ignorability and Exogeneity . . . . . . . . . . . . . . . . . . 43
5.3 DAGs, SEMs, and d-Separation . . . . . . . . . . . . . . . . . . . . . . . 45
5.4 Intervention, Counterfactual DAGs, and SWIGs . . . . . . . . . . . . . . 48
5.5 The Backdoor Criterion . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.6 Notebook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.7 Suggested Readings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
Chapter 1
1. These potential outcomes are well-defined only under the Stable Unit Treatment
Value Assumption (SUTVA):
7
8 CHAPTER 1. INTRODUCTION TO CAUSAL INFERENCE
Yi = YiAi .
Because, for each individual, we can ever observe at most one of the two potential outcomes,
the pair (Yi0 , Yi1 ) is never jointly observed, making ∆i intrinsically unidentifiable. This is
not merely a statistical limitation but a logical one, often referred to as the fundamental
problem of causal inference.
Example. Zeus undergoes a heart transplant (A = 1) and dies on the fifth post-operative
day (Y = 1). Consequently we observe the treated potential outcome
1
YZeus = 1,
0
while the untreated potential outcome YZeus is fundamentally unobservable. Any statement
about what would have happened to Zeus without transplantation must rely on data
from other patients who did not receive a transplant, together with assumptions (e.g.
conditional ignorability) that make Zeus “exchangeable” with those patients. In this way,
information is borrowed from comparable untreated individuals to draw causal conclusions
about Zeus’s counterfactual outcome.
which answers, “What is the average effect among those who actually received
treatment?”
RD = P (Y 1 = 1) − P (Y 0 = 1).
This absolute scale aligns with notions such as the “number needed to treat” (NNT =
1/RD).
P (Y = 1 | A = a) versus P (Y a = 1) .
| {z } | {z }
observed conditional risk counterfactual (marginal) risk
P (Y = 1 | A = a) = P (Y a = 1)
holds only under strong conditions, notably unconfoundedness, also known as condi-
tional exchangeability, where treatment assignment is independent of potential outcomes.
Such conditions are naturally satisfied in randomized trials but are rarely guaranteed
in observational studies, making direct equivalence of these risks uncommon in practice.
Careful methods of adjustment, conditioning, or weighting are typically required to make
valid causal inferences from observational data.
Aspirin: Let’s take the example of aspirin treatment (A) and the chance of dying (Y )
among heart patients. Patients who are already at high risk are more likely to be given
aspirin. Because these patients are already more likely to die, we might see a higher death
rate among those who got aspirin: P (Y = 1 | A = 1) > P (Y = 1 | A = 0). At first glance,
this could wrongly make it seem like aspirin increases the risk of death. But that’s not
true—it’s just that sicker patients were more likely to get the drug. If we adjust for how
sick the patients were at the start (using something like a ”risk score”), or if we randomly
assign who gets aspirin and who doesn’t, we find the true story: aspirin actually helps.
After accounting for this confounding, we see that aspirin lowers the risk of death—it has
a protective, causal effect.
Thus, Raw associations in data are often misleading. Without appropriate analytical
methods or experimental designs that explicitly control for confounding, we risk drawing
incorrect conclusions. Robust causal inference requires thoughtfully severing all spurious
pathways and carefully distinguishing genuine causal effects from mere correlations.
1.4. RCT, EXCHANGEABILITY, AND CONSISTENCY 11
Y a ⊥⊥ A,
meaning that the potential outcomes are independent of treatment assignment. This
condition, combined with two other key assumptions—consistency,
Y = Y A,
which links the observed outcome to the potential outcome corresponding to the treatment
actually received, and positivity,
forms the standard identification triad. These assumptions allow us to estimate causal
effects from observed data in an RCT setting.
Blinding and Concealment: Blinding protects the study from differential behavior
or measurement that could arise if participants or researchers know the treatment assign-
ment. For example, caregivers might unconsciously deliver co-interventions differently, or
outcome assessors might introduce bias. Allocation concealment ensures that treatment
assignment is not known at the point of enrollment, preventing selection bias.
Thus, randomization is not just about flipping a fair coin—it involves a bundle of
practices designed to maintain the integrity of causal claims.
This indicates that sicker patients are more likely to be assigned to the treatment group,
making the overall (crude) comparison misleading. For instance, the crude mortality rates
are:
Treated: 7/13 vs Untreated: 3/7,
suggesting a higher risk of death in the treated group. However, when we stratify by L,
the mortality rates within each group are:
Within each stratum, the treatment effect disappears—mortality risk is identical. After
standardization (i.e., averaging the stratum-specific risks according to the distribution
of L), we find that the marginal (overall) risk ratio is actually 1. This illustrates how
conditioning on L can reveal the true causal story hidden beneath the crude associations.
1.6. CROSSOVER EXPERIMENTS AND PERIOD EFFECTS 13
Standard Error Implications. Ignoring the blocking structure in analysis can lead to:
1. Inflated standard error estimates, as the precision gains from stratification are lost;
2. Missed detection of effect modification, where the treatment effect might vary across
strata;
Positivity and Truncation. The success of IPW depends crucially on the positivity
assumption, which requires that:
That is, every individual must have a non-zero probability of receiving either treatment
or control, regardless of their covariates L. In practice, some values of π(Li ) may be
very close to 0 or 1, leading to extremely large weights and inflated standard errors. To
address this, analysts often perform weight truncation or weight stabilization, for
example, by capping weights at the 1st and 99th percentiles. This introduces a small
amount of bias but substantially reduces the variance and leads to more reliable inference.
A more serious problem arises with structural positivity violations, where certain
strata of L have no treated or no control observations. In such cases, it is fundamentally
impossible to estimate treatment effects in those regions of the covariate space. The
solution is to redefine the causal question and restrict attention to the population where
treatment comparisons are possible—“where the data support the effect.” We will see
more in Chapter 4.
Sandwich Variance and Bootstrap. Since the propensity score π(L) is typically
estimated from data (e.g., via logistic regression or machine learning), the resulting
weights wi are random variables and introduce additional uncertainty. To account for this,
standard variance formulas must be adjusted. Two common methods for robust inference
are:
Randomized Experiments
Y (d) = η0 + η1 d (2.1)
Here, η0 is a person-specific intercept capturing their baseline health endowment (e.g.,
genetics, early childhood environment), and η1 reflects the biological effect of marijuana
smoking on longevity. For simplicity and didactic purposes, we assume η1 = 0, implying
that marijuana smoking has no causal effect on lifespan. In other words, the only differences
in observed lifespans will arise due to differences in η0 , not due to the treatment itself.
Now suppose that smoking behavior is not randomly assigned but arises from an
individual’s latent propensity or predisposition to smoke. This is modeled as:
Y (1) = Y (0) = η0 ,
and hence the observed outcome is simply:
15
16 CHAPTER 2. RANDOMIZED EXPERIMENTS
Y = Y (D) = η0 .
However, in observational data, we do not observe η0 or ν directly. The only observed
data are the realized pairs (Y, D), where D is the observed treatment (smoking status),
and Y is the observed lifespan.
Apparent Health Gap. Even though smoking has no causal effect, selection into
smoking based on unobserved health characteristics distorts the observed comparison
of average lifespans between smokers and non-smokers. We can compute the expected
outcome (lifespan) for each group:
This toy example illustrates the fundamental problem of causal inference from obser-
vational data. The observed difference in means:
The inequality reflects the non-independence of D and Y (d). Let ∆bias denote
Among these, RCTs are the cleanest because they modify the data-generating process
itself rather than adjusting after the fact.
D ⊥⊥ Y (1), Y (0) .
Additionally, the positivity assumption ensures that both treatment arms are repre-
sented in the data:
0 < P (D = 1) < 1.
π = E[Y | D = 1] − E[Y | D = 0] = δ,
where π is the observed average difference between treated and control groups, and δ is
the average treatment effect (ATE). Thus, randomization ensures that selection bias
is eliminated by design.
δ = θ1 − θ0 ,
δ̂ = θ̂1 − θ̂0 ,
2.6. THE DELTA METHOD AND RELATIVE EFFECTIVENESS 19
where θ̂d is the sample mean of Y in treatment group d, as discussed previously. Assuming
finite variances within each treatment arm, the multivariate Central Limit Theorem (CLT)
applies:
√ θ̂0 − θ0 d
2
0 σ0 /p0 0
n −
→N , ,
θ̂1 − θ1 0 0 σ12 /p1
where:
1. σd2 = Var(Y | D = d) is the conditional variance,
2. pd = P(D = d) is the marginal probability of treatment arm d.
Applying the delta method (specifically for linear combinations), we derive the asymptotic
distribution of δ̂:
√ σ02 σ12
d
n(δ̂ − δ) −→ N 0, + .
p0 p1
A large-sample (asymptotic) 100(1 − α)% Wald confidence interval for δ is given by:
s
σ̂02 σ̂12
δ̂ ± zα/2 + ,
n0 n1
where zα/2 is the standard normal quantile corresponding to confidence level 1 − α, and
nd is the number of observations in group d.
This framework provides valid inference under random sampling and large-sample
approximations. In small samples or complex designs, bootstrap or robust methods may
be preferred.
Use Cases: This approach allows us to construct standard errors and confidence intervals
for multiplicative effect measures such as:
1. Benefit-Cost Ratios in economics,
This quantity captures how the causal effect of treatment varies across individuals or
subgroups characterized by different values of W . In essence, it generalizes the average
treatment effect (ATE) to a conditional level, allowing for treatment effect heterogeneity.
Under the assumption of random treatment assignment (e.g., in a randomized controlled
trial), the treatment indicator D ∈ {0, 1} is independent of potential outcomes conditional
on W . That is,
Y (1), Y (0) ⊥⊥ D | W.
This implies the equality:
π(W ) = δ(W ).
That is, the difference in conditional means from observed data correctly identifies the
causal effect for covariate strata.
Covariate adjustment is a powerful tool for improving efficiency and exploring het-
erogeneity in treatment effects. However, proper handling of missing covariate data is
essential to maintain the validity of causal conclusions.
Cost and power. Field trials can be logistically daunting and expensive, especially for
rare outcomes requiring huge n for adequate power. Cluster randomization eases logistics
but inflates variance via intraclass correlation.
External validity. An RCT estimates the ATE for its sampling frame, yet effects may
attenuate or amplify when scaled up. Transportability analysis confronts such “all else
equal” fallacies.
3. Imbens & Rubin (2015). Causal Inference for Statistics, Social, and Biomedical
Sciences. Cambridge UP.
Chapter 3
23
24 CHAPTER 3. DOUBLE ML
However, Lasso introduces a critical trade-off. The shrinkage bias it induces pulls estimated
coefficients toward zero, which invalidates standard inferential tools such as t-tests and
confidence intervals derived from OLS theory. This creates a fundamental tension: how
can we retain the predictive power and sparsity benefits of Lasso while also conducting
valid statistical inference on key parameters—such as the effect of a treatment or policy
variable—amid high-dimensional confounding? Answering this question leads us toward
double selection methods, debiased Lasso, and double machine learning, which
build on Lasso’s strengths while correcting for its limitations.
Y = αD + β ⊤ W + ε, (3.1)
where:
1. Y is the outcome variable of interest (e.g., economic growth, wage, blood pressure,
etc.),
Ỹ = αD̃ + ε.
By removing the linear effects of W from both the outcome Y and the regressor of interest
D, we isolate the component of D that is uncorrelated with the confounders W , allowing
us to estimate its effect on Y without bias from omitted variables. When the number of
controls p becomes large relative to the sample size n, classical OLS techniques break down.
The matrix (W ⊤ W )−1 becomes ill-conditioned or non-invertible, and even if inversion
is numerically feasible, the variance of α̂ can explode due to overfitting. As a result,
standard OLS-based residualization becomes unreliable or impossible.
3.2. NEYMAN ORTHOGONALITY. 25
The penalty loadings ψY j (often ψY j = 1/σ̂j ) scale the ℓ1 constraint so that units of
different regressors are comparable. Independently we fit a Lasso of D on W :
nX X o
⊤ 2
γ̂D = arg min (Di − γ Wi ) + λ2 ψDj |γj | .
γ
i j
Both λ1 and λp
2 can be chosen via modified cross-validation or the theoretically motivated
formula λ ≍ σ 2 log p/n. We then compute
These are the pieces of Y and D orthogonal, up to Lasso error, to the entire W space.
Regressing the residualized outcome on residualized treatment gives
P
Ďi Y̌i
α̂ = Pi 2 = (Ď⊤ Ď)−1 (Ď⊤ Y̌ ).
i Ďi
Intuition. The union of variables selected in Step 1 or Step 2 acts like double selection.
Any regressor important for predicting either Y or D is implicitly adjusted for. Conse-
quently the omitted variable bias of missing “weak but critical” controls is dramatically
reduced compared with “single” Lasso selection that screens only on Y .
where
Ỹ (η) = Y − η1⊤ W, D̃(η) = D − η2⊤ W.
Here, η = (η1 , η2 ) represents a collection of nuisance parameters, and η o denotes their true
values. The property of Neyman orthogonality requires that the estimator α(η) be
locally insensitive to small perturbations in η, i.e.,
∂η α(η o ) = 0.
Proof. Since α(η) is implicitly defined via the moment equation M (α, η) = 0 and the
implicit function theorem gives:
∂η M (α, η o ) = 0.
Recall that h i
M (α, η) = E (Ỹ (η) − αD̃(η))D̃(η) ,
with
Ỹ (η) = Y − η1⊤ W, D̃(η) = D − η2⊤ W.
Differentiating with respect to η1 we get,
h i
∂η1 Ỹ (η) = −W ⇒ ∂η1 M (α, η o ) = E −W · D̃(η o ) .
so
∂η1 M (α, η o ) = 0.
and the Derivative with respect to η2 gives,
√
3.3 Asymptotic Theory: n-Normality of α̂
Under appropriate sparsity and moment assumptions—such as the condition that the
sparsity level s satisfies (s log p ≪ n—Belloni, Chernozhukov, and Hansen (2014)) a
central limit theorem for the double machine learning estimator α̂ can be established.
Let D̃ denote the residual from the population-level projection of D on W , and let ε
denote the structural error from the partially linear model
Y = αD + β ⊤ W + ε, E[ε | D, W ] = 0.
A consistent plug-in estimator V̂ replaces the population expectations with their sample
analogues:
n
!−2 n
!
1X 2 1X 2 2
V̂ = D̃ D̃ ε̂ ,
n i=1 i n i=1 i i
where ε̂i = Yi − α̂ · Di − β̂ ⊤ Wi is the estimated residual from the final stage. Using this
variance estimator, a Wald-type 95% confidence interval for α is given by:
s
V̂
α̂ ± 1.96
n
This interval has asymptotically correct coverage, with error rate converging to zero at
the standard O(n−1/2 ) rate.
A key strength of this double machine learning procedure is its robustness to imperfect
model selection. The validity of inference for α does not require perfect variable selection
in the high-dimensional regressions of Y and D on W . Instead, it suffices that the
Lasso-based nuisance estimators achieve a sufficiently accurate approximation of the best
s-sparse linear predictors. This flexibility is due to the Neyman orthogonality property,
which guarantees that small estimation errors in the first stage do not translate into
first-order biases in the second-stage estimate of the target parameter.
28 CHAPTER 3. DOUBLE ML
Growthi = α InitialGDPi + β⊤ Wi + εi ,
where:
1. Growthi is the GDP growth rate for country i over a fixed period (e.g., 10 years),
In low-dimensional settings, ordinary least squares (OLS) would be the default estima-
tor. However, when the number of controls p is comparable to or exceeds the sample size n,
OLS becomes unstable and prone to overfitting. Moreover, the inclusion of many irrelevant
controls inflates variance, resulting in wide confidence intervals and reduced statistical
power. Double Lasso, on the other hand, reduces variance inflation and improves inference
validity in high-dimensional settings.
The OLS estimate is close to zero and statistically insignificant, as its confidence interval
includes zero. This reflects the high noise-to-signal ratio in the presence of many controls
relative to sample size. In contrast, the Double Lasso estimate is larger in magnitude and
statistically significant at conventional levels. Its tighter confidence interval excludes zero,
offering empirical support for the hypothesis of conditional convergence.
The negative and significant estimate of α under Double Lasso suggests that, holding
constant various institutional and structural characteristics, countries with lower initial
GDP indeed tend to grow faster. This aligns with the predictions of the Solow growth
model.
3.5. BEYOND A SINGLE TREATMENT 29
1. Regress the outcome Y on the full set of covariates (D, W ) using Lasso. Let
SbY ⊆ {1, . . . , p} denote the selected variables (excluding D).
2. Regress the treatment variable D on the covariates W using Lasso. Let SbD ⊆
{1, . . . , p} denote the selected variables.
The final model includes the treatment D and the union of selected controls Sb = SbY ∪ SbD .
An ordinary least squares (OLS) regression of Y on D and the selected variables WSb is
then performed to estimate the treatment effect. This estimator:
Another method is The Debiased Lasso, also known as the Desparsified Lasso,
addresses the regularization bias inherent in standard Lasso estimates. It constructs an
estimator with an asymptotic linear expansion by correcting the shrinkage in the Lasso
estimate. The procedure has been outlined below.
D̃(γ̂) = D − γ̂ ′ W.
where:
1. β̂Lasso is the Lasso estimate from regressing Y on the full covariate vector X,
2. Θ̂ is an estimate of the relevant row of the (pseudo-)inverse of the empirical Gram
matrix Σb = X ⊤ X/n.
This adjustment removes the bias from penalization and restores asymptotic normality:
√ d
→ N (0, σ 2 ),
n(α̂ − α) −
even in high-dimensional settings where p ≫ n. Confidence intervals and hypothesis tests
constructed from this estimator are asymptotically valid.
In the special case of a pure randomized controlled trial (RCT), the treatment D is
independent of the covariates W by design:
D ⊥⊥ W.
Therefore, the naive OLS regression of Y on D yields an unbiased estimate of the average
treatment effect (ATE), even without adjusting for W . While orthogonal methods like
Double Lasso or Debiased Lasso are not necessary to ensure unbiasedness in this setting,
they can still be employed to increase estimation efficiency (i.e., reduce variance) and
improve precision of inference when W explains a substantial amount of variation in Y .
While randomization solves the identification problem, orthogonal adjustments still play
a role in optimizing statistical performance.
1. Run a Lasso regression of Y on D and W (with an appropriate penalty λ) to obtain
β̂.
2. Run a Lasso regression of D on W (with a suitable λ) to obtain γ̂.
3. Construct the residualized treatment:
D̃(γ̂) = D − γ̂ ′ W.
Equivalently, the debiased estimator for a target coefficient α has the form:
n
1X ⊤ ⊤
α̂ = α̂Lasso + Θ̂ Xi Yi − Xi β̂Lasso ,
n i=1
where:
1. β̂Lasso is the Lasso estimate from regressing Y on the full covariate vector X,
This correction term serves to ”undo” the bias introduced by Lasso’s penalization.
(a) 10-fold Cross-Validation: Splits the data into 10 parts, trains on 9 and validates
on 1, cycling through all folds.
(b) Theory-Driven Plug-in Method: Based on theoretical guarantees that balance
false inclusion and exclusion probabilities (e.g., Belloni et al. plug-in).
4. Diagnostic Plots.
Visual inspection of residuals and influential observations can prevent misinterpreta-
tion. Recommended diagnostics include:
5. Sensitivity Analysis.
Examine robustness of the estimated treatment effect α̂ by varying the penalty
parameter λ by ±25%. If results change dramatically, your model may be fragile or
overfit. Stable estimates across a penalty range provide greater empirical confidence.
3.8 Notebooks
The Jupyter notebooks for the referenced experimentation are available here in Github.
2. Overlap (or positivity): This ensures that, for every covariate profile X, there
is a positive probability of observing both treated and untreated units. Formally,
0 < P (D = 1 | X) < 1 almost surely. Without this condition, comparisons across
treatment groups become ill-posed or undefined for certain strata.
When both of these conditions hold, researchers can recover the Average Treatment Effect
(ATE) using standard regression techniques or propensity-score reweighting—even without
the benefit of randomized experiments. What follows is a self-contained exposition of how
these tools work in practice.
33
34 CHAPTER 4. METRICS TO RECOVER CAUSAL EFFECTS
3. Y (1), Y (0): Potential outcomes—Y (1) is the outcome that would be observed if
the unit were treated, and Y (0) is the outcome if the unit were not treated. These
together define the response function d 7→ Y (d) for d ∈ {0, 1}.
D ⊥⊥ Y (d) | X.
and averaging over the covariate distribution yields the Average Treatment Effect
(ATE):
δ = E[δ(X)] = E[π(X)].
So under the assumptions of ignorability and overlap, causal effects can be identified
purely from observational data using regression or weighting methods that condition on
X, which we check next.
Suppose we posit the following linear model for the conditional expectation of the
outcome:
E[Y | D, X] = αD + β⊤ X, (4.1)
where
We then estimate the following linear regression model via ordinary least squares (OLS):
Y = αD + β⊤ X + ε,
To account for heterogeneous treatment effects, we can extend the linear model to
allow the treatment effect to vary with covariates by introducing interaction terms between
D and a centered version of the covariates X. The modified model is:
where:
2. α1 captures the average treatment effect across all covariate strata (ATE),
The validity of this linear regression approach depends on the correctness of the
specified functional form. If the linear model in Equation (4.1) does not represent the
true conditional expectation function (CEF), then the resulting estimate α̂ may be biased.
When the functional form is in doubt or the relationship between Y , D, and X is highly
non-linear, it is advisable to turn to more flexible estimation techniques such as non-
parametric regression or modern machine learning methods (e.g., random forests, boosted
trees, neural networks). These approaches can better capture complex patterns in the
data and help avoid misspecification bias.
Averaging both sides over X yields the identification of the average potential outcome
E[Y (d)].
δ = E[Y · H].
This formulation ensures that each individual’s observed outcome Y is weighted inversely
by their probability of receiving the treatment they actually received, thus correcting for
the non-random assignment. Similarly, we define the Conditional Average Treatment
Effect (CATE) as
δ(X) = E[Y · H | X].
This formulation highlights how causal effects can be recovered both at the population
level and at the covariate level, solely via reweighting based on estimated or known
treatment probabilities.
1. One individual may have substantial work experience and little formal education.
Even though their propensity scores are the same, their background differences might
influence the outcome variable (e.g., wages) in distinct ways. This situation highlights
a key limitation of relying solely on propensity scores for identification. Solution lies in
Double Machine Learning. By combining both:
we can more effectively de-noise the outcome variable. That is, we reduce residual variance
by accounting for detailed covariate patterns. This hybrid approach—central to double
machine learning—produces more efficient and precise estimates of the true treatment
effect by leveraging both the treatment selection mechanism and the outcome-generating
process.
38 CHAPTER 4. METRICS TO RECOVER CAUSAL EFFECTS
E[H | X] = 0,
where H = 1{D=1}
p(X)
− 1{D=0}
1−p(X)
is the Horvitz–Thompson transform. If this expectation
does not hold—i.e., if some function of X is predictive of H—then systematic differences
remain between the treatment and control groups even after reweighting. This implies a
violation of ignorability, as treatment assignment remains confounded.
2. Regress H on W .
If the regression shows predictive power (i.e., W significantly predicts H), then covariate
imbalance exists—suggesting that the treatment assignment was not successfully random-
ized or the propensity score model was misspecified. Covariate balance diagnostics are
essential validity checks for any causal analysis relying on ignorability. They are especially
critical in observational studies where treatment is not randomly assigned.
D Y (d)
In this DAG:
1. The node X represents observed pre-treatment covariates (e.g., age, education).
2. The node D is the actual treatment assignment (e.g., whether the subject receives a
drug).
6. The edge X → Y (d) encodes that these same characteristics may also affect the
outcome.
7. The edge d → Y (d) reflects that the potential outcome depends on the treatment
level.
D Y
3. The outcome Y is then determined by both the assigned treatment D and baseline
covariates X.
Recommended Readings
1. Rosenbaum, P. R., & Rubin, D. B. (1983). “The Central Role of the Propensity
Score in Observational Studies for Causal Effects.” Biometrika, 70(1), 41–55.
2. Imbens, G. W., & Rubin, D. B. (2015). Causal Inference for Statistics, Social, and
Biomedical Sciences: An Introduction. Cambridge University Press.
4. Chernozhukov, V., Wüthrich, K., Kumar, M., Semenova, V., Yadlowsky, S., & Zhu,
Y. (2023). Applied Causal Inference Powered by Machine Learning and AI. Available
at: [Link]
Chapter 5
Statistical data alone inform us about associations, but causal science asks what would
happen if we acted differently. Directed acyclic graphs (DAGs) encode qualitative
subject-matter knowledge—who can influence whom—while structural equation mod-
els (SEMs) supply quantitative functional relationships that generate the joint distribution.
Pearl (2000) showed that every (acyclic) SEM induces a DAG and, conversely, that the
graph together with independent disturbances suffices to reconstruct counterfactual out-
comes. We begin by exploring a fully linear and nonlinear, nonparametric formulation of
causal diagrams and their associated structural equation models (SEMs). These models
serve as powerful and flexible tools for uncovering the underlying structure necessary for
causal identification, enabling us to move beyond the confines of purely linear assumptions.
Within this framework, we define counterfactuals in a formal way—adhering to what
Judea Pearl describes as the ”First Law of Causal Inference”—which states that every
structural equation model naturally induces a system of counterfactual outcomes.
D Y
X
F
We now illustrate a causal diagram and its associated structural equation model (SEM)
41
42 CHAPTER 5. GRAPHICAL MODELS
using a real-world scenario: the impact of 401(k) eligibility on financial wealth. The
directed acyclic graph (DAG) below encodes the causal relationships among various
observed and unobserved variables involved in this context.
In the United States, a 401(k) plan is an employer-sponsored, defined-contribution,
personal pension (savings) account, as defined in section 401(k) of the U.S. Internal
Revenue Code. This causal graph represents the possible channels through which 401(k)
eligibility (D) may affect an individual’s net financial wealth (Y ). The interpretation of
each node is as follows:
6. U : General latent factors, which may include unmeasured personality traits or risk
preferences.
This DAG can be translated into a structural system of equations where each node is
generated as a function of its parent nodes and an associated noise term.
X = fX (U, εX ),
F = fF (U, εF ),
D = fD (X, F, εD ),
M = fM (D, X, F, εM ),
Y = fY (D, M, X, εY ).
Each equation specifies how a variable is generated as a function of its direct causes
(or ”parents” in the DAG) and an associated idiosyncratic error term. For example,
the covariates X are influenced by a latent factor U , which could represent unobserved
personal attributes such as financial literacy or long-term planning ability, along with a
noise component εX . Similarly, the firm-level variable F is also influenced by U and its
own noise εF , indicating that some unobserved traits may jointly affect both worker- and
firm-level characteristics. The treatment assignment D, indicating 401(k) eligibility, is
determined by observed covariates X, firm-level factors F , and a residual εD capturing
individual-level randomness in eligibility. The employer’s matching contribution M
depends on D, X, and F , suggesting that employer contributions may vary based on both
employee and firm characteristics. Finally, the outcome of interest Y , such as financial
5.2. CONDITIONAL IGNORABILITY AND EXOGENEITY 43
The connection between structural equation models (SEMs) and potential outcomes
is foundational in causal inference. The fact that an SEM implies the existence of
potential outcomes is sometimes called the First Law of Causal Inference. SEMs, or their
graphical representations as directed acyclic graphs (DAGs), encapsulate the contextual
and substantive knowledge about the causal relationships in a given problem. As a result,
they allow us to derive, rather than merely assume, important identification conditions
such as conditional ignorability.
For example, suppose we are interested in the effect of a binary treatment D on
an outcome Y , and we have observed covariates F and X. The SEM or DAG for the
problem may suggest that, after conditioning on F and X, the treatment assignment D is
independent of the potential outcome Y (d) for each value of d. Formally, this is written
as:
Y (d) ⊥⊥ D | F, X,
which implies the following equality of conditional expectations:
This property allows us to identify average causal (or treatment) effects by adjusting for
(i.e., conditioning on) F and X. Conditional ignorability (or exogeneity) can be justified
using both functional (structural) arguments based on SEMs and graphical arguments
based on d-separation and the backdoor criterion.
To provide a functional argument, consider the structural equations that define the
data-generating process. In the counterfactual (or potential outcome) setting where we
fix D = d, the relevant structural equations are:
M (d) = fM (d, F, X, ϵM ),
where M is a mediator and ϵY , ϵM are exogenous noise terms. The actual treatment
assignment is generated by
D = fD (F, X, U, ϵD ),
where U and ϵD are additional exogenous variables. Once we condition on F and X, the
distribution of Y (d) is determined solely by d, M (d), X, and their associated noise terms.
The realized value of D is not relevant for the distribution of Y (d) once F and X are
44 CHAPTER 5. GRAPHICAL MODELS
given. In other words, knowing D provides no additional information about Y (d) beyond
what is already known from F and X:
Y (d) ⊥⊥ D | F, X.
This structural argument demonstrates how the ignorability condition can be justified by
the functional relationships in the SEM.
The same conclusion can be reached using graphical criteria. In the counterfactual
DAG, the node Y (d) receives inputs from M (d), X, and the fixed value d. The treatment
variable D is still present in the graph and is generated by its usual parents (F , X, and
U ), but there is no direct arrow from D to Y (d).
Any path from D to Y (d) must pass through either F or X. For instance, typical
paths include:
1. D ← X → Y (d),
2. D ← F → M (d) → Y (d),
3. D ← F ← U → X → Y (d).
By conditioning on F and X, we block all such paths. In graphical terms, this is known
as d-separation: conditioning on a node severs the flow of information along any path
passing through that node. The Global Markov property then tells us that d-separation
implies conditional independence, to conclude that
Y (d) ⊥⊥ D | F, X.
2. Z blocks every backdoor path from D to Y (that is, every path that starts with an
arrow into D).
The first rule prevents us from blocking the causal effect of D on Y , while the second
ensures that all confounding paths are eliminated. In the context of the 401(k) example,
the relevant backdoor paths from D to Y run through F and X, such as:
1. D ← X → Y ,
2. D ← F → M → Y ,
3. D ← F ← U → X → Y .
By conditioning on both F and X, we block all such paths, ensuring that the observed
association between D and Y reflects only the causal effect of D on Y .
5.3. DAGS, SEMS, AND D-SEPARATION 45
1. The parents of a node Xj , denoted P aj , are all nodes with directed edges pointing
into Xj :
P aj := {Xk : Xk → Xj }.
2. The children of Xj , denoted Chj , are all nodes that Xj points to:
Chj := {Xk : Xj → Xk }.
3. The ancestors of Xj , denoted Anj , are all nodes from which there exists a directed
path to Xj , including Xj itself:
4. The descendants of Xj , denoted Dsj , are all nodes that can be reached by a
directed path starting from Xj :
A DAG can be associated with an acyclic structural equation model (ASEM), which
formalizes how each variable is generated as a function of its parents and some exogenous
noise. For each node j ∈ V , the structural equation is:
Xj := fj (P aj , ϵj ),
where each ϵj is a random disturbance (exogenous variable), and the collection (ϵj )j∈V is
assumed to be jointly independent. A linear ASEM is a special case where the structural
equations are linear in the parents:
fj (P aj , ϵj ) := fj′ P aj + ϵj ,
with fj′ being a vector of coefficients. In linear ASEMs, the independence assumption
on the errors can be weakened to mere uncorrelatedness. The structural potential
46 CHAPTER 5. GRAPHICAL MODELS
response process for each variable describes how the value of Xj would respond to
arbitrary assignments of its parents:
Xj (paj ) := fj (paj , ϵj ),
A key property of ASEMs associated with DAGs is the Markov factorization, which
states that the joint distribution of all variables factorizes as:
Y
p({xℓ }ℓ∈V ) = p(xℓ | paℓ ).
ℓ∈V
1. A directed path is a sequence of nodes connected by edges all pointing in the same
direction:
Xv1 → Xv2 → · · · → Xvm .
→ Xj ← .
2. π contains a collider (i → m ← j), and neither m nor any of its descendants are in
S.
Z D Y Z D Y
X C
(a) (b)
(X ⊥⊥d Y |S)G .
By the foundational result of Pearl and Verma, d-separation implies conditional indepen-
dence in the probability distribution generated by the DAG:
X ⊥⊥ Y | S.
Z X Y Z X Y
U U
(a) (b)
To make these concepts concrete, consider the following two graphical structures below.
In the first example, suppose the graph contains nodes Z, U , X, and Y , with edges such
that Z and U are parents of X, U is also a parent of Y , and there is a direct edge from Z
to Y . Here, the set S = {Z, U } d-separates X and Y , blocking all paths between them.
The Markov factorization yields:
implying that X ⊥⊥ Y | Z, U . In the second example, suppose the structure is such that Z
points to X, X and Y both point to U , and Z also points to Y . In this case, conditioning
on Z alone d-separates X and Y , and the factorization becomes:
which implies X ⊥⊥ Y | Z.
48 CHAPTER 5. GRAPHICAL MODELS
Consider a causal system represented by a directed acyclic graph (DAG) and its
associated structural equation model (SEM). Each node in the DAG corresponds to a
variable, and the directed edges represent causal relationships. To analyze the effect of
an intervention—such as setting a treatment variable Xj to a specific value xj —we must
construct a new graphical object that reflects this hypothetical scenario. The intervention
fix(Xj = xj ) transforms the original DAG into a counterfactual DAG, known as a Single
World Intervention Graph (SWIG). The SWIG is constructed by a node-splitting operation:
2. The intervention node Xa∗ inherits only the outgoing edges from Xj (i.e., eeai = eji
for all i) and has no incoming edges (e
eia = 0 for all i), reflecting that it is fixed by
the intervention.
3. The node Xj∗ inherits only the incoming edges from Xj (i.e., eeij = eij for all i) and
has no outgoing edges (eeji = 0 for all i), preserving its dependence on its original
causes.
4. All remaining edges are preserved: eeik = eik for all i and for all k ̸= j, k ̸= a, ensuring
the rest of the graph structure remains intact.
The resulting SWIG, denoted G(xe j ), contains a set of counterfactual variables {X ∗ }k∈V ∪
k
{Xa∗ }, where each variable is now interpreted as a function of the intervention. The
counterfactual SEM (CF-ASEM) associated with the SWIG defines the structural equations
for the counterfactual variables:
where P a∗k denotes the parents of Xk∗ in the modified edge set, and ϵk are the exogenous
noise terms. This construction ensures that the counterfactual variables are generated
consistently with the intervention. To illustrate, consider the following diagrams. The left
figure shows the original DAG with variables and their causal relationships. The right
figure shows the corresponding SWIG after intervening to set variable Xj to value xj :
5.4. INTERVENTION, COUNTERFACTUAL DAGS, AND SWIGS 49
Xj X6 Xj∗ X6∗
X3 X3∗
Xa∗ = xj
X4 X4
X1 X1
X5 X5
A central result is that the SWIG encodes the conditional independence relations
among counterfactual variables under the intervention. Specifically, suppose we relabel the
treatment node as D, and let Y be any descendant of D. Construct the SWIG induced
by fix(D = d), and let S be any subset of nodes common to both the original DAG and
the SWIG such that Y (d) is d-separated from D by S in the SWIG. Then:
Y (d) ⊥⊥ D | S.
for all s with p(s, d) > 0 and for all bounded functions g.
To further clarify, consider the following example inspired by Pearl’s work. The left
diagram shows the original DAG, and the right shows the SWIG after intervening to set
D = d:
d
Z2 X3 Y Z2 X3 Y (d)
X2 M X2
M (d)
Z1 X1 D
Z1 X1 D
(a) Original DAG
(b) SWIG after fix(D = d)
50 CHAPTER 5. GRAPHICAL MODELS
In this example, the goal is to estimate the causal effect of D on Y , i.e., the mapping
d 7→ Y (d). The set of variables
are valid adjustment sets that, when conditioned upon, block all backdoor paths between
D and Y (d) in the SWIG. Notably, conditioning on X2 alone is insufficient, as it opens a
path where X2 acts as a collider; adding X1 , X3 , Z1 , or Z2 blocks this path, ensuring the
required conditional ignorability.
Z ← D → Y,
Z(d) ← d → Y (d).
Here, both Z(d) and Y (d) are potential outcomes under the intervention D = d, and d
is a fixed value. In this counterfactual graph, there are no open paths from d to either
Z(d) or Y (d) except the direct arrows, and crucially, there is no path from Z(d) to Y (d)
that could induce spurious association. The key insight is that, under this counterfactual
representation, no adjustment is required to identify the causal effect of D on Y . Formally,
the empty set is a valid adjustment set:
Y (d) ⊥⊥ D.
This means that the average causal effect of D on Y can be identified without controlling
for any other variables—adjustment is unnecessary because there is no confounding to
remove. However, it is also true that Z is a valid control variable in the sense that
adjusting for Z does not introduce bias. This can be seen by considering a ”cross-world”
DAG (below) that combines both factual and counterfactual variables from the respective
structural equation models.
ϵz Z D Y ϵy
Z(d) d Y (d)
Y (d) ⊥⊥ D | Z
5.5. THE BACKDOOR CRITERION 51
holds as well. Nevertheless, while Z is a valid control in the sense that it blocks no
necessary paths and does not introduce bias, it is also superfluous: adjusting for Z does
not improve identification and can in fact reduce the precision of our estimates. Including
unnecessary variables in the adjustment set can lead to less efficient estimators, as it
increases variance without reducing bias. This underscores the practical value of the
counterfactual DAG approach: it not only identifies all valid adjustment sets but also
helps to avoid overadjustment by highlighting when adjustment is unnecessary.
As discussed before, a backdoor path is any path from the treatment D to the
outcome Y that starts with an arrow into D, indicating a potential confounding influence.
The backdoor criterion provides a graphical condition for identifying a set of variables
(an adjustment set) that, when conditioned upon, ensures the identification of the causal
effect of D on Y .
1. No element of S is a descendant of D.
(i) D ← X2 → Y
(ii) D ← X1 ← Z1 → X2 ← Z2 → X3 → Y
In this DAG, there are two notable backdoor paths from D to Y as shown above.
To apply the backdoor criterion, we must block all such paths by conditioning on an
appropriate set S. Conditioning on X2 alone blocks the inner backdoor path (i), since
X2 is a non-collider on this path and conditioning on it blocks the flow of association.
However, conditioning on X2 alone actually opens the outer backdoor path (ii), because X2
acts as a collider along that path. In general, conditioning on a collider (or its descendant)
opens a path that would otherwise be blocked, potentially introducing bias. Therefore, to
block the outer backdoor path as well, we must also condition on an additional variable
52 CHAPTER 5. GRAPHICAL MODELS
Z2 X3 Y
X2 M
Z1 X1 D
Figure 5.1: DAG illustrating backdoor paths from D to Y . Red: direct/inner backdoor
path. Blue: longer/outer backdoor path.
that lies on that path but is not a descendant of D—for example, X1 , X3 , Z1 , or Z2 . Thus,
valid adjustment sets include:
Each of these sets blocks all backdoor paths from D to Y and contains no descendants of
D.
5.6 Notebook
The Jupyter notebook is available here in Github.