


13 Feb 2026
Outline
Outline
Person 1: I was late to work today because my alarm didn’t ring.
Person 2: Today, the 7:45 am bus did not come. Perhaps there was a driver shortage.
Person 3: My uncle got diagnosed with lung cancer because he was a chain smoker.
| Concept | Meaning |
|---|---|
| Association | The connection between two variables that is mostly based on our presumptions and general observations. |
| Correlation | A statistical term that indicates the degree to which two variables move in coordination with one another. |
| Causation | It is a change in one variable that directly causes a change in another variable; it is a cause-and-effect relationship. |
For every individual \(i\) there are two Potential Outcomes:
\[ Y_{1i} = \text{outcome if treated} \]
\[ Y_{0i} = \text{outcome if not treated} \]
where \(d\) is the Treatment indicator (1 = treated, 0 = control).
But we only ever observe one of them:
\[ Y_i = d_i Y_{1i} + (1 - d_i) Y_{0i} \]
Fundamental Problem of Causal Inference:
We never observe both \(Y_1\) and \(Y_0\) for the same person at the same time.
| Name | \(d\) | \(Y_0\) | \(Y_1\) |
|---|---|---|---|
| Andy | 1 | . | 10 |
| Ben | 1 | . | 5 |
| Chad | 1 | . | 16 |
| Daniel | 1 | . | 3 |
| Edith | 0 | 5 | . |
| Frank | 0 | 7 | . |
| George | 0 | 8 | . |
| Hank | 0 | 10 | . |
Source: Cunningham (2021), Causal Inference: The Mixtape
The individual treatment effect is:
\[ \tau_i = Y_{1i} - Y_{0i} \]
But it is never observable for any single individual.
Therefore: causal inference focuses on averages:
\[ ATE = E[Y_1 - Y_0] \]
In data we estimate it using:
\[ ATE = E[Y | d = 1] - E[Y | d = 0] \]
Causal inference compares the outcome we observe with an outcome we cannot observe, and statistical methods try to approximate the missing counterfactual.
Outline
How to find the causal effect of treatment on the outcome?
“The issue of identification stemmed from the quest to know the attainability of economically meaningful relationships from statistical analysis of economic data” Duo 1993.
Can we just compare two individuals (one treated and one non-treated)? No!
Confounding (Omitted Variable Bias): When variables affect both treatment and outcome, we can’t separate causation from correlation.
Selection Bias: The sample or treatment groups are non-randomly selected.
Reverse Causality: Outcome affects treatment rather than treatment affecting outcome.
Measurement Error: When treatment, outcome, or confounders are measured incorrectly.
Simultaneity: It occurs when treatment and outcome influence each other at the same time.
Model Misspecification: Even if the causal structure is right, the statistical model can fail due to Wrong functional form, Omitted interactions and/or Nonlinearity ignored.
Sorting and Endogeneity: Sorting happens when individuals self-select into treatment in a way that is related to their potential outcomes. We say treatment assignment is endogenous because it is not independent of potential outcomes.
Unobservable Heterogeneity: When unit-specific unobserved characteristics drive both treatment and outcome.
Violation of SUTVA (Stable Unit Treatment Value Assumption): SUTVA needs two assumptions-(A) No Spillover-Your result should not change because someone else got (or didn’t get) the treatment, and (B) Consistency- The treatment we define is exactly the treatment people actually received.
Outline
Potential outcomes framework: \[Y_i(1), Y_i(0)\]
Treatment assignment: \[ T_i = \begin{cases} 1 & \text{if unit } i \text{ receives treatment} \\ 0 & \text{if unit } i \text{ is control} \end{cases} \]
Observed outcome: \[ Y_i = T_i Y_i(1) + (1 - T_i) Y_i(0) \]
RCT key feature: Random assignment \[ T_i \perp (Y_i(1), Y_i(0)) \]
Identification: \[ ATE = E[Y_i(1) - Y_i(0)] = E[Y_i | T_i = 1] - E[Y_i | T_i = 0] \]
In a randomized controlled trial, treatment \(T_i\) is randomly assigned.
\[ Y_i = \alpha + \beta T_i + \gamma X_i + \varepsilon_i \]
\[ \text{Cov}(T_i, \varepsilon_i) = 0 \]
because randomization ensures \(T_i\) is independent of unobserved factors.
Key idea: random assignment makes the treated and control groups comparable, so the coefficient on \(T_i\) can be interpreted causally.
Can we estimate a causal effect for the experimental sample?
Randomization ⇒ solves:
Does the effect generalize?
Non-compliance: people do not follow their assignment.
Attrition: Some outcomes are missing.
Outline
Identify causal effects when RCTs are infeasible.
\[ T_i \perp (Y_i(1), Y_i(0)) \text{ within some subpopulation} \]
Natural experiments motivate econometric tools: IV, DiD, Matching, Regression discontinuity.
Find a variable that affects treatment, but does not directly affect the outcome.
A policy, a rule, or an eligibility threshold that changes who receives the treatment, but does not change outcomes except through that treatment.
IV solves endogeneity when we cannot observe or adjust for all confounders.
\[ Y_i = \alpha + \beta T_i + \gamma X_i + \varepsilon_i \]
\[ \text{Cov}(T_i, \varepsilon_i) \neq 0 \]
We introduce an instrument \(Z_i\) (continuous or multi-valued).
\[ T_i = \pi_0 + \pi_1 Z_i + \pi_2 X_i + u_i \]
\[ \pi_1 \neq 0 \text{ (instrument relevance)} \]
\[ Y_i = \rho_0 + \rho_1 Z_i + \rho_2 X_i + v_i \]
\[ Y_i = \alpha + \beta \hat{T}_i + \gamma X_i + \varepsilon_i \]
where \(\hat{T}_i\) is the fitted value from the first stage.
Relevance \[ \text{Cov}(Z_i, T_i) \neq 0 \]
Exogeneity \[ Z_i \perp \varepsilon_i \]
(no direct effect of \(Z\) on \(Y\))
\[ \beta = \frac{\text{Cov}(Z_i, Y_i)}{\text{Cov}(Z_i, T_i)} \]
(the continuous IV analog of Wald)
causal effect of \(T\) on \(Y\) for compliers.
Does education increase wages?
\[ T = \text{years of education} \]
\[ Z = \text{distance to nearest college} \]
proximity affects education but not wages directly.
When we use DiD
Idea
Compare how outcomes evolve over time in a treated group versus a similar control group.
Why it works
If both groups were on similar trends before the intervention, the change in the treatment group relative to the control group reflects the causal effect.
What DiD does well - Handles unobserved factors that do not change over time. - Captures the impact of large reforms, laws, or shocks. - Works with repeated or panel data.
Key message
DiD isolates the effect of a treatment by comparing changes over time across groups.
Two groups (Treated/Control) and two periods (Before/After).
\[ Y_{it} = \alpha + \beta(T_{treat_i} \times P_{post_t}) + \lambda T_{treat_i} + \delta P_{post_t} + \varepsilon_{it} \]
coefficient on the interaction term.
\[ \beta = \text{DiD} \]
This captures the causal effect of treatment.
Parallel trends: \((Y_{T, pre} - Y_{C, pre}) = (Y_{T, post} - Y_{C, post})\) in absence of treatment
\[ Y_{it} = \alpha + \beta(T_{treat_i} \times P_{post_t}) + \gamma X_{it} + \varepsilon_{it} \]
Minimum wage increases only in Region A.
\[ Y_{it} = \alpha + \beta(\text{RegionA}_i \times \text{After}_t) + \varepsilon_{it} \]
\[ \text{DiD} = 5 - 1 = 4\text{pp} \]
The reform increased employment by 4pp.
Outline
where \(\mathbb{E}[\epsilon] = 0\).
We consider \(Y\) as the outcome variable, \(X = (X_1, X_2, \ldots, X_p)'\) as the vector of covariates, and \(\beta\) as the vector of parameters.
Our goal is to construct the best linear predictor of \(Y\) given \(X\), which is the linear function of \(X\) that minimizes the mean squared error (MSE):
\[ \beta = \arg\min_{\beta \in \mathbb{R}^p} E[(Y - \beta^{'} X)^2] \]
The mean squared error (MSE) of the best linear predictor is given by: \[ \text{MSE} = E[(Y - \beta^{'} X)^2] \]
The \(R^2\) of the best linear predictor is defined as: \[ R^2 = \frac{E[(\hat{\beta}^{'} X)^2]}{E[Y^2]} = 1 - \frac{E_n \epsilon^2}{E_n Y^2} \in [0, 1] \]
Interpretation:
When the number of predictors \(p\) is large relative to the number of observations \(n\), models can become overly complex and fit the noise in the training data rather than the underlying signal.
Consider an example where \(p = n\) and all \(X\) variables are independent standard normal random variables. In this case, we have
\[ \text{MSE}_{sample} = 0 \quad \text{and} \quad R^2_{sample} = 1 \]
WHY?
Overfitting Example


Predictive effects describe how our (population best linear) predictions change when a value of target regressor changes, holding all other regressors constant.
Specifically, we partition the vector of regressors \(X\) into two parts: the target regressor of interest \(D\) and the remaining regressors \(W\) (also called control variables or covariates).
\[ X = (D, W'), \]
\[ Y = \beta_1 D + \beta^{'}_2 W + \epsilon \]
How does the predicted value of \(Y\) change if \(D\) increases by a unit while remains unchanged?
WHY?
\[ Y = \gamma^{'}_{YW} W + \tilde{Y} \quad \Rightarrow \quad \tilde{Y} = Y - \hat{Y} = Y - \gamma^{'}_{YW} W \]
\[ D = \gamma^{'}_{DW} W + \tilde{D} \quad \Rightarrow \quad \tilde{D} = D - \hat{D} = D - \gamma^{'}_{DW} W \]
\[ \tilde{Y} = \beta_1 \tilde{D} + \tilde{\epsilon} \]
We can also show partialling-out procedure by partialling-out operation to both sides of our regression equation
\[ Y = \beta_1 D + \beta^{'}_2 W + \epsilon \]
to get
\[ \tilde{Y} = \beta_1 \tilde{D} + \beta^{'}_2 \tilde{W} + \tilde{\epsilon} \]
Which simplifies to
\[ \tilde{Y} = \beta_1 \tilde{D} + \epsilon \]
Why does \(\tilde{W}\) disappear in the partialled-out regression?
Why \(\tilde{\epsilon} = \epsilon\)?
Interpretation of \(\beta_1\) in Partialling-Out
\(\beta_1\) can be interpreted as the effect of \(D\) on \(Y\) after removing the influence of \(W\) as univariate linear regression of residualized \(Y\) on residualized \(D\).
Residuals are defined by partialling-out the linear effects of \(W\) from both \(Y\) and \(D\).
Can we use linear regression for partialling-out?
When \(p/n\) is large, using linear regression for partialling-out can lead to overfitting issues, resulting in biased estimates of \(\beta_1\).
To address this, we can use dimension reduction or regularization techniques, such as Lasso or Ridge regression, during the partialling-out steps.
Outline
\[ Y = \beta^{'} X + \epsilon, \quad \epsilon \perp X, \]
where \(\beta^{'} X\) is the population best linear predictor of \(Y\) given \(X\).
\(p\) is large, possibly larger than \(n\)

\[ X = P(W) = (P_1(W), P_2(W), \ldots, P_p(W))' \]
where the set of transformations \(P(W)\) can be very large, leading to a high-dimensional feature space
Why Do We Need Constructed Regressors?
\(\beta^{'} P(W)\) are nonlinear in \(W\) but still linear in parameters \(\beta\).


In the population, the best predictor of \(Y\) given \(W\) is
\[ g(W) = E[Y|W] \]
the conditional expectation of \(Y\) given \(W\). The function \(g(W)\) is called the regression function of \(Y\) on \(W\).
The conditional expectation funnction \(g(W)\) solves the best prediction problem
\[ \min_{m(W)} E[(Y - m(W))^2]. \]
Here we minimize the mean squared prediction error over all prediction rules \(m(W)\).
\(\beta^{'} P(W)\) is an approximation to best predictor \(g(W)\)
Using richer and more complex constructed regressors \(P(W)\) allows us to better approximate the true regression function \(g(W)\).
Classical linear regression can perform poorly in high-dimensional settings due to overfitting.
This is especially apparent when \(p \geq n\).
Regularization techniques, such as Lasso and Ridge regression, help mitigate overfitting by adding a penalty term to the loss function.
Lasso (Least Absolute Shrinkage and Selection Operator) regression adds a penalty to the loss function
Lasso constructs the estimator \(\hat{\beta}\) by solving the following penalized least squares problem:
\[ \min_{b \in \mathbb{R}^p} \left\{ \frac{1}{n} \sum_{i=1}^{n} (Y_i - b^{'} X_i)^2 + \lambda \sum_{j=1}^{p} |b_j| \right\} \]
The first term is the usual mean squared error, while the second term is called a penalty term.
The tuning parameter \(\lambda \geq 0\) controls the strength of the penalty.
Lasso performs both variable selection and regularization, shrinking some coefficients to exactly zero, effectively selecting a simpler model.
As long as \(\lambda > 0\), the introduction of the penalty term leads to a prediction rule that is less complex and less prone to overfitting compared to ordinary least squares.
The tuning parameter \(\lambda\) in Lasso regression controls the trade-off between fitting the training data well and keeping the model simple.
A larger \(\lambda\) increases the penalty for large coefficients, leading to a sparser model with more coefficients set to zero.
A smaller \(\lambda\) allows the model to fit the training data more closely, potentially leading to overfitting if \(p\) is large relative to \(n\).
Common methods for selecting \(\lambda\) include:
\[ \lambda = 2 c \hat{\sigma} \sqrt{n} \Phi^{-1}(1 - \alpha / (2p)) \]
where \(\hat{\sigma}\) is an estimate of the standard deviation of the error term, \(\Phi^{-1}\) is the inverse CDF of the standard normal distribution, and \(c > 1\) is a constant, and \(\alpha\) is a small significance level (e.g., 0.05).
Lasso shrinks relevant regressors towards zero and “underestimates” the absolute value of the coefficients.
Therefore, Lasso may not be ideal for inference about predictive effects or causal effects.

To mitigate the bias introduced by Lasso, we can use a two-step procedure called Post-Lasso estimation.
The Post-Lasso estimator is obtained by:
First, use Lasso regression to select a subset of relevant regressors (those with non-zero coefficients).
Then, fit an ordinary least squares regression using only the selected regressors from the first step.
Does Post-Lasso, \(\hat{\beta^{'}} X\), provide a good approximation to best linear prediction rule, \(\beta^{'} X\) ?
We will have \(s\) selected regressors after Lasso, also called effective dimension
To estimate the Post-Lasso estimator, we need \(n/s\) to be sufficiently large to avoid overfitting in the second step.
\[ \min_{b \in \mathbb{R}^p} \left\{ \frac{1}{n} \sum_{i=1}^{n} (Y_i - b^{'} X_i)^2 + \lambda \sum_{j=1}^{p} b_j^2 \right\} \]
\[ \min_{b \in \mathbb{R}^p} \left\{ \frac{1}{n} \sum_{i=1}^{n} (Y_i - b^{'} X_i)^2 + \lambda_1 \sum_{j=1}^{p} |b_j| + \lambda_2 \sum_{j=1}^{p} b_j^2 \right\} \]
\(\lambda_1\) controls the Lasso penalty, while \(\lambda_2\) controls the Ridge penalty.
Two tuning parameters could be selected via cross-validation or other hyperparameter optimization methods in machine learning, such as grid search or Bayesian optimization.
Choice of Regression Methods in Practice
The choice between Lasso, Ridge, and Elastic Net regression depends on the specific characteristics of the data and the goals of the analysis.
If we are interested in building the best prediction, we can tune each method via cross-validation and select the one with the lowest prediction error on test data.
Outline
\[ Y = \beta_1 D + \beta^{'}_2 W + \epsilon \]
If conditioning on \(W\) is sufficient to control for confounding between \(D\) and \(Y\), then \(\beta_1\) can be interpreted as the average causal effect of \(D\) on \(Y\).
Then, predictive effect of \(D\) on \(Y\) can answer the causal question:
What is the average change in \(Y\) when we intervene to increase \(D\) by one unit, holding \(W\) constant?
The key step is application of Lasso regression for partialling-out in the presence of high-dimensional covariates. Consider the following regression model:
\[ Y = \alpha D + \beta^{'} W + \epsilon, \]
where \(D\) is the target regressor of interest and \(W\) is a vector of \(p\) control variables. After partialling-out \(W\) from both \(Y\) and \(D\), we get
\[ \hat{\gamma}_{YW} = \arg\min_{\gamma \in \mathbb{R}^p} \left\{ \frac{1}{n} \sum_{i=1}^{n} (Y_i - \gamma^{'} W_i)^2 + \lambda \sum_{j=1}^{p} |\gamma_j| \right\}, \]
\[ \hat{\gamma}_{DW} = \arg\min_{\gamma \in \mathbb{R}^p} \left\{ \frac{1}{n} \sum_{i=1}^{n} (D_i - \gamma^{'} W_i)^2 + \lambda \sum_{j=1}^{p} |\gamma_j| \right\}. \]
\[ \tilde{Y} = Y - \hat{\gamma}_{YW}^{'} W, \quad \tilde{D} = D - \hat{\gamma}_{DW}^{'} W. \]
\[ \begin{align} \hat{\alpha} &= \arg\min_{\alpha \in \mathbb{R}} \frac{1}{n} \sum_{i=1}^{n} ({\tilde{Y}}_i - \alpha {\tilde{D}}_i)^2 \\ &= (E_n \tilde{D}^2)^{-1} E_n\tilde{D} \tilde{Y}. \end{align} \]
\[ V = (E_n \tilde{D}^2)^{-1} E_n (\tilde{D}^2 \hat{\epsilon}^2) (E_n \tilde{D}^2)^{-1}, \]
\[ \text{SE}(\hat{\alpha}) = \sqrt{V/n}. \]
\[ \left[\hat{\alpha} \pm 2 \times \sqrt{V/n} \right]. \]
Practical Example: A comparison of OLS and Double Lasso
For the relevant code and data, go to here
Here we consider the model: \[ Y = \sum_{j=1}^{p_1} \alpha_j D_j + \sum_{k=1}^{p_2} \beta_k W_k + \epsilon, \]
where the number of target regressors \(p_1\) can also be large, and the number of control variables \(p_2\) can be large as well.
There can be multiple policy variables of interest that we want to analyze simultaneously, such as the effects of different education policies on student outcomes.
We can be interested in heterogeneous treatment effects across different subgroups, which requires estimating multiple coefficients for each subgroup.
We can be interested in nonlinear effects of policies
For each \(j = 1, \ldots, p_1\), we can apply the one-by-one Double Lasso procedure to estimate \(\alpha_j\) while treating the other \(D_{-j}\) as part of the control variables.
\[ Y = \alpha_j D_j + \gamma_j{'} W_j + \epsilon, \quad W_j = ((D_{-j})^{'}, W^{'})^{'} \]
The double lasso provides a high quality estimate of \(\hat{\alpha} = (\alpha_j)_{j=1}^{p_1}\) for each \(j\) and we can construct confidence intervals for each \(\alpha_j\).
This allows us to make inference on multiple coefficients simultaneously, even in high-dimensional settings where \(p_1\) and \(p_2\) are large relative to \(n\).
Outline
A partially linear regression model:
\[ Y = \alpha D + {\color{#707C36}{g(W)}} + \epsilon, \quad E[\epsilon|D,W] = 0, \tag{1}\]
where \(Y\) is the outcome variable, \(D\) is the treatment variable, \(W\) is a vector of control variables.
The model allows a part of the regression function, \(\color{#707C36}{g(W)}\), to be fully nonlinear
However, the model is not fully general, because it imposes additivity in \(\color{#707C36}{g(W)}\) and \(D\)
Applying partialling-out to Equation 1, we obtain:
\[ \tilde{Y} = \alpha \tilde{D} + \epsilon, \]
where \(\tilde{Y}\) and \(\tilde{D}\) are the residuals left after predicting \(Y\) and \(D\) using \(W\).
\[ \tilde{Y} := Y - \ell(W) \quad \text{and} \quad \tilde{D} := D - m(W), \]
where \(\ell(W)\) and \(m(W)\) are conditional expectation functions of \(Y\) and \(D\) given \(W\), respectively.
\[ \ell(W) = E[Y|W], \quad m(W) = E[D|W]. \]
Split the data into random folds: \(\{1, \ldots, n\} = \cup_{k=1}^K I_k\). Compute ML estimators \(\hat{\ell}_{k}\) and \(\hat{m}_k\), leaving out the \(k\)-th fold of the data. Obtain the cross-fitted residuals for each fold \(i \in I_k\):
\[ \tilde{Y}_i = Y_i - \hat{\ell}_k(W_i), \quad \tilde{D}_i = D_i - \hat{m}_k(W_i). \]
Apply the ordinary least squares to \(\tilde{Y}_i\) on \(\tilde{D}_i\) to obtain the DML estimator \(\hat{\alpha}\):
\[ \hat{\alpha} = E_n[(\tilde{Y}_i-\alpha \tilde{D}_i) \tilde{D}_i] =0. \]
Construct confidence intervals for \(\alpha\):
\[ \left[\hat{\alpha} \pm 2 \times \sqrt{V/n} \right], \]
covers \(\alpha\) in approximately 95% of repeated samples, where \(V\) is the variance of the DML estimator \(\hat{\alpha}\).
\[ \frac{1}{K} \sum_{k=1}^K ||\hat{\ell_k}- \ell||_{L^2}^2 \quad \text{and} \quad \frac{1}{K} \sum_{k=1}^K ||\hat{m_k}- m||_{L^2}^2 \]
\[ Y = g(D, X) + \epsilon, \quad E[\epsilon|D,X] = 0, \tag{2}\]
\[ D = m(X) + \tilde{D}, \quad E[\tilde{D}|X] = 0, \tag{3}\]
Since \(D\) is not additively separable in the Equation 2, this model is more general than the partially linear regression model.
The average predictive effect (APE) of the binary treatment \(D\) on the outcome \(Y\) is defined as:
\[ \theta_0 = E[g(1, X) - g(0, X)] \]
which represents the average predictive effect of switching the treatment from 0 to 1, averaging over the distribution of covariates \(X\).
The confounding factors, \(X\), affect the policy variable via the propensity score \(m(X) = E[D|X]\) and affect the outcome variable via the regression function \(g(D, X)\).
ATE will be based on the relation:
\[ \theta_{0} = \mathbb{E}[\varphi_{0}(W)], \tag{4}\]
where
\[ \varphi_{0}(W) = g_{0}(1, X) - g_{0}(0, X) + \bigl(Y - g_{0}(D, X)\bigr) H_{0} \]
and
\[ H_{0} = \frac{\mathbf{1}(D = 1)}{m_{0}(X)} - \frac{\mathbf{1}(D = 0)}{1 - m_{0}(X)} \]
is the Horvitz-Thompson transformation.
Split the data into random folds: \(\{1, \ldots, n\} = \cup_{k=1}^K I_k\). Compute ML estimators \(\hat{g}_{k}\) and \(\hat{m}_k\), leaving out the \(k\)-th fold of the data, such that \(\epsilon \le \hat{m}_k \le 1-\epsilon\) for some small \(\epsilon > 0\). For each fold \(i \in I_k\), compute:
\[ \hat{\varphi}(W_i) = \hat{g}_{k}(1, X_i) - \hat{g}_{k}(0, X_i) + \bigl(Y_i - \hat{g}_{k}(D_i, X_i)\bigr) \hat{H}_i, \]
where \(\hat{H}_i = \frac{\mathbf{1}(D_i = 1)}{\hat{m}_k(X_i)} - \frac{\mathbf{1}(D_i = 0)}{1 - \hat{m}_k(X_i)}\).
Compute the estimator \(\theta_0 = E_n[\hat{\varphi}(W_i)]\).
Construct confidence intervals for \(\theta_0\):
\[ \left[\hat{\theta} \pm 2 \times \sqrt{\hat{V}/n} \right], \]
where \(\hat{V} = E_n[(\hat{\varphi}(W_i) - \hat{\theta})^2]\).
Consider a structural Equation Model (SEM) where:
\[ \begin{aligned} Y &:= f_{Y}(D, X, A, \varepsilon_{Y}) \\ D &:= f_{D}(Z, X, A, \varepsilon_{D}) \in \{0,1\}, \\ Z &:= f_{Z}(X, \varepsilon_{Z}) \in \{0,1\} \end{aligned} \]
where all \(\epsilon\) are all independent.
Suppose the instrument \(Z\) is is an offer to participate in a training program and that the treatment \(D\) is actual endogenous participation in the training program. Participation in the program may depend on unobservables, \(A\), such as motivation, which also affects the outcome \(Y\). The variable \(X\) captures observed covariates such as age, education, and work experience.
The model allows us to identify the local average treatment effect (LATE), defined as:
\[ \theta = E[Y(1) - Y(0)|D(1) > D(0)], \]
where \({D(1) > D(0)}\) defines the compliance event, where switching the instrument, \(Z\), from 0 to 1.
In the LATE model, \(\theta\) can be identified by the ratio of two statistical parameters,
\[ \theta_0 = \theta_1 / \theta_2, \tag{5}\]
where \[ \theta_1 = E[E[Y|Z=1, X] - E[Y|Z=0, X]], \]
and \[ \theta_2 = E[E[D|Z=1, X] - E[D|Z=0, X]]. \]
Equation 5 is equivalent to the below expression:
\[ \theta_0 = \frac{E[E[Y|Z=1, X] - E[Y|Z=0, X]]}{E[E[D|Z=1, X] - E[D|Z=0, X]]}. \]
This parameter is the ratio of the average predictive effect of \(Z\) on \(Y\) to the average predictive effect of \(Z\) on \(D\).
Define regression functions:
\[ \begin{aligned} \mu_0(Z, X) &= E[Y|Z, X] \\ m_0(Z, X) &= E[D|Z, X] \\ p_0(X) &= E[Z|X]. \end{aligned} \]
Therefore, nuance parameters are \(\eta = (\mu, m, p)\).
The DML estimator of \(\theta\) is given by:
\[ \psi(W; \theta, \eta) := \mu(1, X) - \mu(0, X) + H(p)\bigl(Y - \mu(Z, X)\bigr) - \bigl(m(1, X) - m(0, X) + H(p)\bigl(D - m(Z, X)\bigr)\bigr)\theta, \]
for \(W = (Y, D, X, Z)\) and
\[ H(p) := \frac{Z}{p(X)} - \frac{1 - Z}{1 - p(X)} . \]
Practical Example: Application of PLM, IRM and LATE to the 401(K) data
For the relevant code and data, go to here
The key to the success of the Double Lasso procedure is the property of Neyman orthogonality, which ensures that the estimation error from the first step (Lasso) does not bias the estimation of \(\alpha\) in the second step.
Neyman orthogonality means that the moment condition used to estimate \(\alpha\) is insensitive to small errors in the estimation of the nuisance parameters \(\gamma_{YW}\) and \(\gamma_{DW}\).
\[ \eta^{\circ} = (\gamma_{DW}^{'}, \gamma_{YW}^{'})^{'}. \]
\[ \partial_{\eta} \alpha(\eta^{\circ}) = 0. \]
\[ Y = 1 D + \beta^{'} W + \epsilon_Y, \quad W \sim N(0, I), \quad \epsilon_Y \sim N(0, 1) \]
\[ D = \gamma_{DW}^{'} W + \tilde{D}, \quad \tilde{D} \sim N(0, 1)/4 \]
We run 1000 simulations and compute the bias and standard deviation of the naive and orthogonal estimators. The results are as follows:
The reason that the naive estimator does not perform well is that it only selects controls, \(W_j\), that are strongly predictive of \(Y\) and omitting weak predictors of \(Y\) that could be strongly predictive of \(D\) leads to omitted variable bias in the estimation of \(\alpha\). This is called ommitted variable bias.
In contrast, the orthogonal estimator is designed to be robust to such selection mistakes, which is why it performs much better in terms of bias and standard deviation.
An alternative to the Double Lasso procedure is the Double Selection method, which also uses Lasso for variable selection but in a slightly different way.
This procedure is approximately equivalent to the partialling out approach and also relies on the principle of Neyman orthogonality to achieve valid inference on \(\alpha\).
We may also be interested in estimating group average treatment effects (GATEs) for a subgroup:
\[ \theta_0(x) = E[g(1, X) - g(0, X)|G = 1], \]
where \(G\) is is a group indicator defined in terms of covariates \(X\).
For example, we might be interested in the impact of a vaccine on teenagers, so \(G\) would be an indicator for \(13 \leq \text{age} \leq 19\).
DML estimation of GATEs can be done by modifying the estimation procedure for APEs/ATEs to focus on the subgroup defined by \(G=1\).
\[ \hat{\theta}(x) = \mathbb{E}[\varphi_{0}(W) | G=1) = \frac{E[\hat{\varphi}(W) G]}{\Pr(G=1)}. \]
Three Causal DAGs for analysis of the 401(K) example in which adjusting for covariates \(X\) is sufficient to control for confounding between \(D\) and \(Y\).