Empirical Economics

Lecture 7: Potential Outcomes and Difference-in-differences

Outline

Course Overview

Linear Model I
Linear Model II
Time Series and Prediction
Panel Data I
Panel Data II
Binary Outcome Data
Potential Outcomes and Difference-in-differences
Instrumental Variables

What do we do today?

Introduce the concept of potential outcomes and the fundamental problem of causal inference.
Look at difference-in-differences (DiD) as a key method to solve that problem.
Theoretically analyze the DiD estimator and link it to the average treatment effect on the treated, a key parameter of interest, through potential outcomes.
Introduce the implementation of DiD through linear regression.
Consider extensions of the basic DiD design.

Potential Outcomes

Correlation vs. Causality

The fundamental challenge in empirical work.
Correlation: Two variables move together.
- Example: Ice cream sales are positively correlated with crime rates.
Causation: A change in one variable causes a change in another.
- Does eating ice cream cause crime? Unlikely.
Confounding Variable: A third variable affects both.
- Hot weather increases both ice cream sales and the number of people outside (leading to more opportunities for crime).
Our goal is to isolate the causal effect, not just the correlation.

The Potential Outcomes Framework

Also known as the Rubin Causal Model.
Let’s think about the effect of a treatment (e.g., a job training program) on an individual i.
$Y_i(1)$: The potential outcome for unit i if they receive the treatment.
$Y_i(0)$: The potential outcome for unit i if they do NOT receive the treatment.

Example: Potential Outcomes

In the context where the treatment is a job training program:

$Y_i(1)$: A person’s earnings if they attend the program.

$Y_i(0)$: A person’s earnings if they do not attend the program.

The Individual Causal Effect

For any single individual $i$, the true causal effect of the treatment is the difference between their two potential outcomes:

Definition: Individual Causal Effect

\[\tau_i = Y_i(1) - Y_i(0)\]

This is the pure, unadulterated effect of the treatment on that one person.
Example: The increase in Person i’s earnings caused only by the training program.

From Individuals to Populations

The Average Treatment Effect (ATE)

Since we usually can’t measure the effect for every single individual, we focus on averages.
The Average Treatment Effect (ATE) is the average of the individual causal effects over the entire population.
- This tells us, “On average, what is the effect of this treatment for a person randomly drawn from the population?”
- For future reference, consider also the Average Treatment Effect on the treated population:

Definition: Average Treatment Effect

\[\text{ATE} = E[\tau_i] = E[Y_i(1) - Y_i(0)]\]

\[\text{ATT} = E[\tau_i | T_i=1] = E[Y_i(1) - Y_i(0) | T_i=1]\]

The Fundamental Problem of Causal Inference

This is the core challenge that all causal methods try to solve.
For any given unit i, we can only ever observe one of their potential outcomes.
- If person i takes the training program, we see $Y_i(1)$. We will never know what their earnings would have been without it, $Y_i(0)$.
- If person i does not take the program, we see $Y_i(0)$. We will never know $Y_i(1)$.
Causal inference is a missing data problem. The $Y_i(0)$ for the treated and the $Y_i(1)$ for the untreated are called counterfactuals.

Illustrating the Fundamental Problem

The following illustrates the data we have at our disposal.

Unit (i)	Attends Program?	Observed Earnings	$Y_i(1)$	$Y_i(0)$
Philippe	Yes (T=1)	$50,000	$50,000	???
Peter	No (T=0)	$40,000	???	$40,000
Joel	Yes (T=1)	$45,000	$45,000	???
Daron	No (T=0)	$60,000	???	$60,000

We can’t calculate $Y_i(1) - Y_i(0)$ for anyone!

Motivation for DiD

Why Simple Comparisons Fail

A naive approach might be to just compare the average earnings of those who attended the program to those who didn’t.

\[ \text{Difference-in-means} = E[Y_i | T_i=1] - E[Y_i | T_i=0] \]
This is almost always wrong. Why?
Because the people who choose to get treatment might be different from those who don’t in ways that also affect the outcome.
This is referred to as selection bias.

Difference in Means Example

In this setting:

Unit (i)	Attends Program?	Observed Earnings	$Y_i(1)$	$Y_i(0)$
Philippe	Yes (T=1)	$50,000	$50,000	???
Peter	No (T=0)	$40,000	???	$40,000
Joel	Yes (T=1)	$45,000	$45,000	???
Daron	No (T=0)	$60,000	???	$60,000

The difference-in-means is:

\[ \begin{align*} \text{Difference-in-means} &= E[Y_i | T_i = 1] - E[Y_i | T_i = 0] \\ &= \frac{50,000 + 45,000}{2} - \frac{40,000 + 60,000}{2} \\ &= 47,500 - 50,000 = -2,500 \end{align*} \]

Selection Bias: The Hidden Difference

The simple difference-in-means can be decomposed:

\[\begin{align*} \text{Difference-in-means} &= E[Y_i|T_i=1] - E[Y_i|T_i=0] \\ &= E[Y_i(1)|T_i=1] - E[Y_i(0)|T_i=0] \\ &= E[Y_i(1)|T_i=1] - \overbrace{E[Y_i(0)|T_i=1] + E[Y_i(0)|T_i=1]}^{\text{Subtract and add the same term.}} - E[Y_i(0)|T_i=0] \\ &= \underbrace{\left( E[Y_i(1)|T_i=1] - E[Y_i(0)|T_i=1] \right)}_{\text{Average Treatment Effect on the Treated (ATT)}} \\ &\quad \quad \hspace{3em} + \underbrace{\left( E[Y_i(0)|T=1] - E[Y_i(0)|T=0] \right)}_{\text{Selection Bias}} \end{align*}\]

The selection bias tells you the difference in the untreated potential outcomes between the treatment and control groups.

Selection Bias: Decomposition

Hence, we can make the following decomposition:

Theorem: Decomposition of Sample Average

\[E[Y_i|T_i=1] - E[Y_i|T_i=0] = ATT + \text{Selection Bias},\]

where $\text{Selection Bias} = E[Y_i(0)|T_i=1] - E[Y_i(0)|T_i=0].$

In words, selection bias is the difference in the no-treatment outcome between the treated and untreated groups.
Job Program Example: people who sign up for training ($T_i=1$) might be more motivated. Even without the program, their earnings $Y_i(0)$ might have been higher than the less motivated group ($T_i=0$).

Intro to DiD

Introduction to Differences-in-Differences (DiD)

The core idea in DiD is to use data from a pre-treatment period to account for selection bias.
- We assume that the “selection bias” (the difference between the groups) is constant over time.
We compare the change in the outcome over time for the treatment group to the change over time for a control group.
- The “difference in the differences” isolates the treatment effect.

The $2 \times 2$ DiD Setup

The classic setup involves two groups and two time periods.

	Before Period (Pre)	After Period (Post)
Treatment Group	$\hat{Y}_{T, Pre}$	$\hat{Y}_{T, Post}$
Control Group	$\hat{Y}_{C, Pre}$	$\hat{Y}_{C, Post}$

Treatment Group: A group that is exposed to the policy/treatment in the “After” period.
Control Group: A similar group that is not exposed to the treatment in either period.

Calculating the Simple DiD Estimator

We calculate two differences, then take the difference between them.

Manual Calculation of DiD Estimator

First Difference (Treatment Group): The change over time for the treated.

$\Delta_T = \hat{Y}_{T,Post} - \hat{Y}_{T, Pre}$
First Difference (Control Group): The change over time for the controls. This represents the “secular trend” – what would have happened without the treatment.

$\Delta_C = \hat{Y}_{C, Post} - \hat{Y}_{C, Pre}$
The Difference-in-Differences:

$\tau_{DiD} = \Delta_T - \Delta_C$

Example: DiD in a $2 \times 2$ Set-Up

Example: Card and Krueger (1994)

The study by Card and Alan Krueger (AER, 1994), titled “Minimum Wages and Employment: A Case Study of the Fast Food Industry in New Jersey and Pennsylvania,” is a landmark paper in labor economics.

It challenged the conventional wisdom that raising the minimum wage necessarily reduces employment.

The authors analyzed the impact of New Jersey’s 1992 minimum wage increase (from $4.25 to $5.05 per hour) by comparing employment changes at fast-food restaurants in New Jersey (where the wage rose) to those in Pennsylvania (where it remained unchanged).

Surprisingly, they found that employment in New Jersey increased by 13% relative to Pennsylvania, contradicting traditional economic predictions

Download the Card & Krueger (1994) data:

Code

library(tidyverse); library(fixest)
file_url <- "https://github.com/mca91/EconometricsWithR/blob/master/data/fastfood.dta?raw=true"
card_krueger_data <- foreign::read.dta(file_url) |> as_tibble()
head(card_krueger_data)
## # A tibble: 6 × 46
##   sheet chain co_owned state southj centralj northj   pa1   pa2 shore ncalls
##   <int> <int>    <int> <int>  <int>    <int>  <int> <int> <int> <int>  <int>
## 1    46     1        0     0      0        0      0     1     0     0      0
## 2    49     2        0     0      0        0      0     1     0     0      0
## 3   506     2        1     0      0        0      0     1     0     0      0
## 4    56     4        1     0      0        0      0     1     0     0      0
## 5    61     4        1     0      0        0      0     1     0     0      0
## 6    62     4        1     0      0        0      0     1     0     0      2
## # ℹ 35 more variables: empft <dbl>, emppt <dbl>, nmgrs <dbl>, wage_st <dbl>,
## #   inctime <dbl>, firstinc <dbl>, bonus <int>, pctaff <dbl>, meals <int>,
## #   open <dbl>, hrsopen <dbl>, psoda <dbl>, pfry <dbl>, pentree <dbl>,
## #   nregs <int>, nregs11 <int>, type2 <int>, status2 <int>, date2 <int>,
## #   ncalls2 <int>, empft2 <dbl>, emppt2 <dbl>, nmgrs2 <dbl>, wage_st2 <dbl>,
## #   inctime2 <int>, firstin2 <dbl>, special2 <int>, meals2 <int>, open2r <dbl>,
## #   hrsopen2 <dbl>, psoda2 <dbl>, pfry2 <dbl>, pentree2 <dbl>, nregs2 <int>, …

Compute $\tau_{DID}$

Code

# Treatment group before and after
treated <- card_krueger_data |>
  filter(state==1)|>
  summarize(treated_emp_before = mean(employment[time=='before'], na.rm=T),
            treated_emp_after = mean(employment[time=='after'], na.rm=T))

treated
## # A tibble: 1 × 2
##   treated_emp_before treated_emp_after
##                <dbl>             <dbl>
## 1               20.4              21.0

control <- card_krueger_data |>
  filter(state==0)|>
  summarize(control_emp_before = mean(employment[time=='before'], na.rm=T),
            control_emp_after = mean(employment[time=='after'], na.rm=T))

control
## # A tibble: 1 × 2
##   control_emp_before control_emp_after
##                <dbl>             <dbl>
## 1               23.3              21.2

tau_did <- treated$treated_emp_after - treated$treated_emp_before - (control$control_emp_after - control$control_emp_before)

tau_did
## [1] 2.753606

pre_treatment_mean <- card_krueger_data |> 
  filter(state==1, time=='before') |> 
  summarize(m = mean(employment, na.rm=T)) |>
  pull(m)

tau_did/pre_treatment_mean
## [1] 0.1347204

Download the Card & Krueger (1994) data

Code

import pandas as pd
import numpy as np

# Load data
file_url = "https://github.com/mca91/EconometricsWithR/blob/master/data/fastfood.dta?raw=true"
card_krueger_data = pd.read_stata(file_url)

# Preview data
print(card_krueger_data.head())
##    sheet  chain  co_owned  state  ...  pfry2  pentree2  nregs2  nregs112
## 0     46      1         0      0  ...    NaN      0.94     4.0       4.0
## 1     49      2         0      0  ...   0.89      2.35     4.0       4.0
## 2    506      2         1      0  ...   0.74      2.33     4.0       3.0
## 3     56      4         1      0  ...   0.79      0.87     2.0       2.0
## 4     61      4         1      0  ...   0.84      0.95     2.0       2.0
## 
## [5 rows x 46 columns]

Compute $\tau_{DID}$

Code

# Treatment group before and after
treated = card_krueger_data[card_krueger_data['state'] == 1]
treated_before = treated[treated['time'] == 'before']['employment'].mean()
treated_after = treated[treated['time'] == 'after']['employment'].mean()

print("\nTreated group:")
## 
## Treated group:
print(f"Before: {treated_before}")
## Before: 20.439407348632812
print(f"After: {treated_after}")
## After: 21.027429580688477

# Control group before and after
control = card_krueger_data[card_krueger_data['state'] == 0]
control_before = control[control['time'] == 'before']['employment'].mean()
control_after = control[control['time'] == 'after']['employment'].mean()

print("\nControl group:")
## 
## Control group:
print(f"Before: {control_before}")
## Before: 23.33116912841797
print(f"After: {control_after}")
## After: 21.165584564208984

# Calculate difference-in-differences
tau_did = (treated_after - treated_before) - (control_after - control_before)

print("\nDifference-in-differences estimate:")
## 
## Difference-in-differences estimate:
print(tau_did)
## 2.7536068

print("\nRelative to pre-treatment mean:")
## 
## Relative to pre-treatment mean:
tau_did / treated_before
## np.float32(0.13472047)

Download the Card & Krueger (1994) data

Code

* Load data from URL
copy "https://github.com/mca91/EconometricsWithR/blob/master/data/fastfood.dta?raw=true" fastfood.dta, replace
use fastfood.dta, clear

* Preview data
list in 1/5

Compute $\tau_{DID}$

Code

* Calculate difference-in-differences
sum fte if state == 1 & time == 1  // treated before
local treated_before = r(mean)
sum fte if state == 1 & time == 2  // treated after
local treated_after = r(mean)
sum fte if state == 0 & time == 1  // control before
local control_before = r(mean)
sum fte if state == 0 & time == 2  // control after
local control_after = r(mean)

local did = (`treated_after' - `treated_before') - (`control_after' - `control_before')
display "Difference-in-differences estimate: " `did'

display "Relative to pre-treatment mean:" `did / treated_before'

Visualization

Deconstructing the DiD Graph

The solid blue line is the observed trend for the control group.
The solid green line is the observed outcome for the treatment group.
The dotted orange line is the counterfactual for the treatment group, constructed by assuming its trend would have been parallel to the control group’s trend.
The DiD effect is the vertical distance between the actual outcome for the treatment group and its counterfactual outcome in the post-period.

DiD Under Potential Outcomes

The observed outcome, $Y_{it}$, is determined by the unit’s group status and the time period.
For the treated group ($D_i=1$): They are untreated at $t=0$ and treated at $t=1$.
- $Y_{i0} = Y_{i0}(0)$ for $t=0$ (pre-treatment)
- $Y_{i1} = Y_{i1}(1)$ for $t=1$ (post-treatment)
For the control group ($D_i=0$): They are never treated.
- $Y_{i0} = Y_{i0}(0)$ for $t=0$
- $Y_{i1} = Y_{i1}(0)$ for $t=1$

Example Interpretation

Example: DiD Under Potential Outcomes

Let the treatment $D_i$ be a job training program. Let the outcome $Y_i$ be monthly income. Let the treated group $D_i=1$ consist of Philippe, and the control group $D_i=0$ consist of Peter.

For the treated group:
- Pre-treatment ($t=0$): $Y_{i0} = Y_{i0} (0)$
- Philippe’s observed income before training is his potential income without training.
- Post-treatment ($t=1$): $Y_{i1} = Y_{i1} (1)$
- Philippe’s observed income after training is his potential income with training.
For the control group:
- Pre-treatment ($t=0$): $Y_{i0} = Y_{i0} (0)$
- Peter’s observed income is his potential income without training.
- Post-treatment ($t=1$): $Y_{i1} = Y_{i1} (0)$
- Peter’s observed income is his potential income without training, as he remains untreated.

Estimand: ATT

Formally, the ATT is the difference between the treated group’s outcome at $t=1$ and what their outcome would have been at $t=1$ if they had not been treated.

\[ \text{ATT} = E[Y_{i1}(1) | D_i=1] - E[Y_{i1}(0) | D_i=1] \]
The first term, $E[Y_{i1}(1) | D_i=1]$, is observed as the average outcome for the treated group in the post-period, $E[Y_{i1} | D_i=1]$.
The second term, $E[Y_{i1}(0) | D_i=1]$, is the counterfactual.
- It is the unobservable average outcome for the treated group had they not received the treatment.
- The entire goal of the DiD strategy is to find a way to estimate this term.

Parallel Trends

The DiD estimator is valid under the parallel trends assumption.
- This assumption states that, in the absence of treatment, the average outcome for the treated group would have changed over time by the same amount as the average outcome for the control group.
- Mathematically, this is expressed in terms of the potential outcome under no-treatment, $Y_{it}(0)$:

Definition: Parallel Trends

\[ E[Y_{i1}(0) - Y_{i0}(0) | D_i=1] = E[Y_{i1}(0) - Y_{i0}(0) | D_i=0] \qquad(1)\]

The left side is the counterfactual change for the treated group.
The right side is the observed change for the control group, since for them $Y_{it} = Y_{it}(0)$.
- This assumption allows us to use the control group to identify the counterfactual trend for the treated group.

Derivation of the DiD Estimator

We rearrange the parallel trends assumption from Equation 1 to solve for our unobserved counterfactual:

\[ \begin{align} \label{eq:cf} E[Y_{i1}(0) | D_i=1] = &\underbrace{E[Y_{i0}(0) | D_i=1]}_{\text{Treated pre-treatment}} + \\ &\underbrace{\left( E[Y_{i1}(0) | D_i=0] - E[Y_{i0}(0) | D_i=0] \right)}_{\text{Change in control group}} \end{align} \]

This equation shows how we construct the counterfactual: we take the treated group’s initial level and add the change experienced by the control group.

Derivation of the DiD Estimator

Substitute this expression for the counterfactual back into the definition of ATT:

\[ \begin{align} \text{ATT} &= E[Y_{i1}(1) | D_i=1] - E[Y_{i1}(0) | D_i=1] \\ &= E[Y_{i1}(1) | D_i = 1] - \\ &\quad (\underbrace{E[Y_{i0}(0) | D_i=1]}_{\text{Treated pre-treatment}} + \underbrace{\left( E[Y_{i1}(0) | D_i=0] - E[Y_{i0}(0) | D_i=0] \right)}_{\text{Change in control group}}) \\ &= E[Y_{i1}(1) | D_i=1] - \\ &\quad \underbrace{\left( E[Y_{i0}(0) | D_i=1] + E[Y_{i1}(0) | D_i=0] - E[Y_{i0}(0) | D_i=0] \right)}_{\text{Expression for counterfactual}} \\ \end{align} \]

Replacing Potential Outcomes with Observables

Finally, we replace the potential outcomes with their observable counterparts to obtain $\widehat{\tau_{\text{DID}}}$:

\[ \begin{align} \text{ATT} &= \color{red}{E[Y_{i1}(1) | D_i=1]} - \\ &\quad \biggl( \color{blue}{E[Y_{i0}(0) | D_i=1]} + \color{orange}{E[Y_{i1}(0) | D_i=0]} - \color{green}{E[Y_{i0}(0) | D_i=0]} \biggr) \\ &= \color{red}{E[Y_{i1} | D_i = 1 ]} - \\ &\quad \biggl( \color{blue}{E[Y_{i0} | D_i = 1]} + \color{orange}{E[Y_{i1} | D_i = 0]} - \color{green}{E[Y_{i0} | D_i = 0]}\biggr) \\ &= \widehat{\tau_{DID}} \end{align} \]
This is the famous “difference-in-differences” formula. It identifies the ATT under the crucial assumption of parallel trends.

DiD in Regression

DiD using a Regression Framework

We can estimate the exact same 2x2 DiD using a simple OLS regression. This is more powerful and flexible.

Definition: DiD in a Regression Framework

$Y_{it} = \beta_0 + \beta_1 D_i + \beta_2 Post_t + \beta_3(D_i × Post_t) + u_{it}$

$Y_{it}$: Outcome for unit i at time t.
$D_i$: A dummy variable = 1 if unit i is in the treatment group, 0 otherwise.
$Post_t$: A dummy variable = 1 if the period is “Post”, 0 otherwise.
$D_i \times Post_t$: An interaction term.

Interpretation of Coefficients (1/2)

Let’s break down what each $\beta$ represents:

$Y_{it} = \beta_0 + \beta_1 D_i + \beta_2 Post_t + \beta_3(D_i \times Post_t) + u_{it}$

$\beta_0$: The average outcome for the Control Group in the Pre-Period ($\text{Post}_i=0, D_i=0$).
- $E[Y_i | D_i=0, Post_t=0] = \beta_0$
$\beta_1$: The average pre-existing difference between the treatment and control groups in the pre-period. This is the selection bias.
- $E[Y_i | D_i=1, Post_t=0] = \beta_0 + \beta_1$
- So, $\beta_1 = E[Y | D_i=1, Post_t=0] - E[Y_i | D_i=0, Post_t=0]$

Interpretation of Coefficients (2/2)

Let’s break down what each $\beta$ represents:

$Y_{it} = \beta_0 + \beta_1 D_i + \beta_2 Post_t + \beta_3(D_i \times Post_t) + u_{it}$

$\beta_2$: The average change in the outcome for the Control Group from the pre- to the post-period. This is the secular trend.
- $E[Y_i | D_i=0, Post_t=1] = \beta_0 + \beta_2$
- So, $\beta_2 = E[Y_i | D_i=0, Post_t=1] - E[Y_i | D_i=0, Post_t=0]$
$\beta_3$: This is the DiD estimator. It’s the additional change in the outcome for the treatment group, above and beyond the secular trend.
- It is the causal effect of interest.
- $\beta_3 = (E[Y_i|D_i=1,Post_t=1] - E[Y_i|D_i=1,Post_t=0]) - (E[Y_i|D_i=0,Post_t=1] - E[Y_i|D_i=0,Post_t=0])$
- Hence $\beta_3$ is $\tau_{DID}$

Example: DiD in a Regression Framework

Example: Card and Krueger (1994)

The Card and Krueger (1994) estimate can also be recovered using the specification we have seen above. Recall that $\tau_{DID}=2.75$.

R
Python
Stata

Code

model <- lm(employment ~ state*time, data = card_krueger_data)
summary(model)
## 
## Call:
## lm(formula = employment ~ state * time, data = card_krueger_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -21.166  -6.439  -1.027   4.473  64.561 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       23.331      1.072  21.767   <2e-16 ***
## state             -2.892      1.194  -2.423   0.0156 *  
## timeafter         -2.166      1.516  -1.429   0.1535    
## state:timeafter    2.754      1.688   1.631   0.1033    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.406 on 790 degrees of freedom
##   (26 observations deleted due to missingness)
## Multiple R-squared:  0.007401,   Adjusted R-squared:  0.003632 
## F-statistic: 1.964 on 3 and 790 DF,  p-value: 0.118

Code

import pyfixest as pf
model = pf.feols("employment ~ time + state + time*state", data=r.card_krueger_data)
model.summary()
## ###
## 
## Estimation:  OLS
## Dep. var.: employment, Fixed effects: 0
## Inference:  iid
## Observations:  794
## 
## | Coefficient         |   Estimate |   Std. Error |   t value |   Pr(>|t|) |   2.5% |   97.5% |
## |:--------------------|-----------:|-------------:|----------:|-----------:|-------:|--------:|
## | Intercept           |     23.331 |        1.072 |    21.767 |      0.000 | 21.227 |  25.435 |
## | time[T.after]       |     -2.166 |        1.516 |    -1.429 |      0.154 | -5.141 |   0.810 |
## | state               |     -2.892 |        1.194 |    -2.423 |      0.016 | -5.235 |  -0.549 |
## | time[T.after]:state |      2.754 |        1.688 |     1.631 |      0.103 | -0.561 |   6.068 |
## ---
## RMSE: 9.382 R2: 0.007

Advantages of the Regression Framework

Why bother with regression instead of just calculating the four means?

Standard Errors: Regression automatically provides standard errors, t-statistics, and p-values for your DiD estimate ($\beta_3$), allowing for statistical inference.
Adding Covariates: It is easy to add control variables to the model to increase precision and make the parallel trends assumption more plausible.
Flexibility: The framework is easily extended to more complex scenarios (more groups, more time periods, etc.).

Adding Covariates to the DiD Model

We can add a vector of control variables, $X$, to the regression.

\[ Y_{it} = \beta_0 + \beta_1Treat_i + \beta_2Post_t + \beta_3(Treat_i \times Post_t) + \gamma'X_{it} + u_{it} \]
The purpose is to control for observable characteristics that might differ between the groups and affect trends in the outcome.
This helps strengthen the parallel trends assumption. It becomes “parallel trends conditional on X”.
Example: When studying a state-level policy, you might control for state GDP, population size, etc.

Testing the Parallel Trends Assumption

We can’t prove the assumption, but we can build evidence for it. This requires data from multiple pre-treatment periods.
Method 1: Visual Inspection (Most Common)
- Plot the average outcomes for the treatment and control groups over time.
- Visually check if their trends were parallel in the periods before the treatment was introduced.

Statistical Tests for Parallel Trends

If you have multiple pre-treatment periods, you can run a “placebo” test.
- The idea is to run a DiD analysis using only pre-treatment data. For instance, define a fake “treatment” in period $t-2$ and use $t-3$ as the “pre” period.
In a regression: interact the treatment group dummy with time-period dummies for each pre-treatment period:

DiD Placebo Test

\[ \begin{align} Y_{it} = &\beta_0 + \beta_1 \text{Treat}_i + \beta_2 \text{Post}_t + \dots + \\ &\delta_{-2} (\text{Treat}_i \times \text{PrePeriod}_{-2}) + \delta_{-1} (\text{Treat}_i \times \text{PrePeriod}_{-1}) + \\ &\beta_3(\text{Treat}_i \times \text{Post}_t) + u_{it} \end{align} \]

Result: The coefficients $\delta_{-2}$ and $\delta_{-1}$ should be small and statistically insignificant (not different from zero). This shows there was no pre-existing differential trend.

Event Studies

What is an Event Study?

A more general case of the difference-in-difference model is an event study
To measure the economic impact of a specific, identifiable event on an outcome of interest.

Example: Event Studies

The “Event” could be:
- A company-specific event: Merger announcement, CEO change, earnings surprise.
- A policy change: A new tax law, a change in minimum wage.
- A natural event: A major hurricane, a pandemic.
The “Outcome” is often:
- A firm’s stock price (most common in finance).
- A firm’s accounting performance (sales, profits).
- A macroeconomic variable (unemployment, inflation).

The central challenge is to determine what the outcome would have been if the event had not occurred.

Panel Event Study Model With Controls

The canonical event study model uses firm and time fixed effects and replaces the single interaction term with a series of dummies for time relative to the event.

Definition: Panel Event Study Model (with Control Subjects)

\[ y_{it} = \alpha_i + \lambda_t + \sum_{k=-K}^{L} \delta_k D_{it}^k + u_{it} \]

$\alpha_i$: Firm Fixed Effects. These absorb all time-invariant differences between firms.
$\lambda_t$: Time Fixed Effects. These absorb all shocks or trends common to all firms in a given year t.
$D_{it}^k$: A dummy variable equal to 1 if firm i in year t is k periods away from its event date. k is the relative time or event time.
$\delta_k$ are the key coefficients. They measure the average change in the outcome for treated firms k periods away from the event, relative to the control group.

Example Dataset

A dataset for an event study would look as follows:
- Note: For simplicity, only dummy columns for $k = -2, -1, 0, 1$ are included - a full model would include dummies for all relevant pre/post periods (e.g., $k \leq 3$ and $k \geq 2$).

Firm ID (`i`)	Year (`t`)	Event Year (`E_i`)	Relative Time (`k = t - E_i`)	Outcome (`y_it`)	Dummy `k=-2`	Dummy `k=-1`	Dummy `k=0`	Dummy `k=1`
1 (Control)	2021	NA	NA	2.0	0	0	0	0
1 (Control)	2022	NA	NA	2.1	0	0	0	0
1 (Control)	2023	NA	NA	2.2	0	0	0	0
1 (Control)	2024	NA	NA	2.3	0	0	0	0
1 (Control)	2025	NA	NA	2.4	0	0	0	0
2 (Treated)	2021	2023	-2	2.5	1	0	0	0
2 (Treated)	2022	2023	-1	2.6	0	1	0	0
2 (Treated)	2023	2023	0	5.5	0	0	1	0
2 (Treated)	2024	2023	1	5.8	0	0	0	1
2 (Treated)	2025	2023	2	6.0	0	0	0	0
3 (Treated)	2021	2024	-3	3.1	0	0	0	0
3 (Treated)	2022	2024	-2	3.3	1	0	0	0
3 (Treated)	2023	2024	-1	3.4	0	1	0	0
3 (Treated)	2024	2024	0	7.1	0	0	1	0
3 (Treated)	2025	2024	1	7.5	0	0	0	1
4 (Treated)	2021	2024	-3	2.9	0	0	0	0
4 (Treated)	2022	2024	-2	3.0	1	0	0	0
4 (Treated)	2023	2024	-1	3.2	0	1	0	0
4 (Treated)	2024	2024	0	6.8	0	0	1	0
4 (Treated)	2025	2024	1	7.2	0	0	0	1

Interpretation of the Results

After running the regression, you will get a set of coefficients $\delta_k$ which are typically plotted on a graph:
Testing Pre-Trends ($k < 0$): The coefficients $\delta_{-K}, \dots, \delta_{-2}$ test the crucial parallel trends assumption. If the model is valid, these coefficients should be close to zero and not statistically significant.¹
The Effect at Impact ($k = 0$): $\delta_0$ shows the immediate effect of the treatment in the event period itself.
Dynamic Post-Treatment Effects ($k > 0$): $\delta_1, \delta_2, \dots, \delta_L$ show how the treatment effect evolves over time after the event. It might grow, shrink, or stay constant.

Normalizing Coefficients

The standard two-way fixed effects (TWFE) regression model for an event study, where the period immediately preceding the event ($t=-1$) is the normalized baseline, is given by:

\[ Y_{it} = \alpha_i + \lambda_t + \sum_{k=m, k \neq -1}^{M} \beta_k D_{it}^k + \varepsilon_{it} \]
The condition $k \neq -1$ means that we include a dummy variable for every period in the window except for the period right before the event. The model implicitly sets $\beta_{-1} = 0$.
Therefore, every other coefficient $\beta_k$ is interpreted as the effect in period $k$ relative to the effect in period $t=-1$.¹

Event Study Visualization

What Happens Without a Control Group?

Imagine you removed the control firm (Firm 1) from the dataset.
- Now, in the year 2023, the only firms you have are Firm 2 (which just got treated, k=0) and Firms 3 & 4 (which are pre-treatment, k=-1).
The regression would see that Firm 2’s outcome went up in 2023. But it has no way to know:
- Was it because of the treatment ($\delta_0$)?
- Or was 2023 just a great year for everyone ($\lambda_{2023}$)?
Without the control group, the effect of being in the year 2023 ($\lambda_{2023}$) is perfectly collinear with the effect of being treated in 2023 ($\delta_0$).
The model cannot distinguish between them, and the regression will fail or produce meaningless results.¹

Event Studies Without Control Subjects

Event Study Without Control Subjects

There are, however, also event studies that do not need control firms.
These tend to make use of an estimation window: a “clean” period before the event.
This is used to establish a baseline or “normal” behavior for the outcome variable. Typically, ~120 to ~250 days before the event window.
The period immediately surrounding (or after) the event date where we expect to see an impact. For example, from 5 days before to 5 days after the announcement ([-5, +5]).

Specification

Definition: Event Study (Without Control Group)

The model you can estimate with only treated units is:

\[ y_{it} = \alpha_i + \sum_{k=-K}^{L} \delta_k D_{it}^k + u_{it} \]
$y_{it}$: The outcome for firm i at time t.
$\alpha_i$: Firm Fixed Effects. These absorb all stable, time-invariant differences between the treated firms.
$D_{it}^k$: A dummy variable = 1 if firm i in year t is k periods away from its event date.
$\delta_k$ are the key coefficients. They measure the average outcome for a firm k periods from its event, relative to the outcome in the omitted baseline period (usually k=-1).
There are no $\lambda_t$ (time fixed effects). You cannot include them.
- If you did, the model could not be estimated because your event-time dummies ($D^k$) would be perfectly predicted by the combination of firm and time fixed effects.

Example Required Dataset (No Control Group)

Let’s use the same staggered adoption scenario as before, but we’ll remove the control firm (Firm 1).

Firm ID (`i`)	Year (`t`)	Event Year (`E_i`)	Relative Time (`k = t - E_i`)	Outcome (`y_it`)	Dummy `k=-2`	Dummy `k=-1` (Baseline)	Dummy `k=0`	Dummy `k=1`
2 (Treated)	2021	2023	-2	2.5	1	0	0	0
2 (Treated)	2022	2023	-1	2.6	0	1	0	0
2 (Treated)	2023	2023	0	5.5	0	0	1	0
2 (Treated)	2024	2023	1	5.8	0	0	0	1
2 (Treated)	2025	2023	2	6.0	0	0	0	0
3 (Treated)	2021	2024	-3	3.1	0	0	0	0
3 (Treated)	2022	2024	-2	3.3	1	0	0	0
3 (Treated)	2023	2024	-1	3.4	0	1	0	0
3 (Treated)	2024	2024	0	7.1	0	0	1	0
3 (Treated)	2025	2024	1	7.5	0	0	0	1
4 (Treated)	2021	2024	-3	2.9	0	0	0	0
4 (Treated)	2022	2024	-2	3.0	1	0	0	0
4 (Treated)	2023	2024	-1	3.2	0	1	0	0
4 (Treated)	2024	2024	0	6.8	0	0	1	0
4 (Treated)	2025	2024	1	7.2	0	0	0	1

The Problem with This Approach

Estimating this model means your results for $\delta_k$ are highly susceptible to bias from confounding factors.
The coefficient $\delta_k$ now measures the sum of two things:
- The true causal effect of the treatment at relative time k.
- Any and all other unobserved shocks or trends that happened to occur at relative time k.

Example: Confounding Factor

Let’s say a major economic boom started in 2023.

For Firm 2, its outcome y jumps in 2023. The model will attribute this entire jump to the treatment ($\delta_0$) because it has no control group to learn that all firms (even untreated ones) would have seen a jump in 2023.
Your estimate of the treatment effect will be severely biased upwards.

The Core Idea: Abnormal Returns

An event study without control group works by isolating the “abnormal” part of an outcome’s movement over time.
In Finance, this is usually done in the context of firms and stock returns.

Definition: Abnormal Return

\[ \text{Abnormal Return}_{it} = \text{Actual Return}_{it} - \text{Normal Return}_{it} \]

Actual Return ($R_{it}$): The observed outcome for firm $i$ on day $t$. This is the raw data.
Normal Return ($E[R_{it}]$): The expected return for firm $i$ on day $t$, conditional on the event not happening. This is our counterfactual.

We need a model to estimate the Normal Return.

Estimating Normal Returns

The Normal Return is estimated using data from the estimation window.

Examples of Normal Return Models

Constant Mean Return Model: Assumes the normal return is just the firm’s average return during the estimation period: $E[R_{it}] = \bar{R}_i$

Market Model: Assumes the firm’s return is related to the overall market return (e.g., S&P 500).

We run an OLS regression (one for each firm) using data only from the estimation window: $R_{it} = \alpha_i + \beta_i R_{mt} + e_{it}$
Here, $R_{mt}$ is the return on the market index. We get estimates for $\hat{\alpha}_i$ and $\hat{\beta}_i$.
The Normal Return for any day $t$ in the event window is then predicted as: $\widehat{E[R_{it}]} = \hat{\alpha}_i + \hat{\beta}_i R_{mt}$.

Calculating Abnormal Returns (AR)

Once we have our estimate of the Normal Return, we can calculate the Abnormal Return (AR) for each firm $i$ on each day $t$ in the event window.

\[ AR_{it} = R_{it} - (\hat{\alpha}_i + \hat{\beta}_i R_{mt}) \]

$AR_{it} > 0$ suggests positive news or impact.
$AR_{it} < 0$ suggests negative news or impact.
$AR_{it} \approx 0$ suggests no impact.

Aggregating Abnormal Returns (CAR)

We are usually interested in the overall effect, not just one firm on one day. We are interested in two things:

Aggregation of Abnormal Returns

1. Average Abnormal Return (AAR): Average the abnormal returns across all $N$ firms for a single day $t$ in the event window. \[ AAR_t = \frac{1}{N} \sum_{i=1}^{N} AR_{it} \]

This gives us the average effect on a specific day relative to the event (e.g., the effect on day t=+1).

Aggregating Abnormal Returns (CAR) (Cont.)

Aggregation of Abnormal Returns

2. Cumulative Average Abnormal Return (CAAR or CAR): Sum the AARs over a period of time within the event window (from $t_1$ to $t_2$). \[ CAR(t_1, t_2) = \sum_{t=t_1}^{t_2} AAR_t \]

This tells us the total cumulative impact of the event over a specific window. For example, $CAR(-1, +1)$ measures the total effect from the day before to the day after the event.

Visualizing Event Study Results

The standard way to present results is a plot of the Average Abnormal Return (AAR) over the event window.

Hypothesis Testing

Standard $t$-statistics from regression estimates can be used to test hypotheses on AAR’s on particular days.
We might also be interested in hypothesis testing of CAR’s.
- Is the observed CAR just random noise, or is there a real effect?
- The null hypothesis is typically $H_0: CAR(t_1, t_2) = 0$.
- Standard errors are calculated based on the variance of the returns in the estimation window.
- If the t-statistic is large enough (and p-value is small), we conclude the event had a statistically significant impact.

Assumptions and Caveats in Event Studies

Event studies without a control group have various caveats and assumptions:
- Efficient Markets (for stock studies): The model assumes prices react quickly and rationally to new information.
- No Confounding Events: The event window must be “clean” of other major, contemporaneous events that could also affect the outcome.
- Correct Event Date: The analysis is sensitive to using the right date of the information release.
- Model Specification: The results can depend on the model chosen for normal returns (e.g., Market Model vs. Fama-French 3-Factor Model).
- Stable Estimation Window: The relationship between the firm and the market (the $\beta$) must be stable between the estimation and event windows.

Potential Pitfalls

Potential Pitfalls (1/3): Ashenfelter’s Dip

A famous issue where the parallel trends assumption is violated in a specific way.
The Ashenfelter’s dip is when there is a notable drop in the outcome for the treatment group just before treatment.

Example: Ashenfelter’s Dip

Individuals’ earnings often drop right before they enter a job training program (e.g., due to job loss).

This makes it look like the program had a huge effect, but it’s really just a recovery to a normal level.

Potential Pitfalls (2/3): Policy Anticipation & Spillovers

Anticipation Effects: If people know a policy is coming, they may change their behavior before it’s officially implemented. This contaminates the “pre” period and violates parallel trends.
- Example: A firm hires fewer people in anticipation of a minimum wage hike.
Spillover Effects: The treatment “spills over” and affects the control group.
- SUTVA (Stable Unit Treatment Value Assumption) states that the outcome of any unit depends only on its own treatment status.
- Example: A job fair in one city (treatment) draws workers from a neighboring city (control), affecting the control city’s labor market. Your control group is no longer a valid counterfactual.

Potential Pitfalls (3/3): Other Limitations

Functional Form: The basic DiD model assumes the treatment effect is a constant, additive shift.
Data Requirements: Requires panel data (tracking the same units over time) or repeated cross-sectional data.
Bad Control Group: The entire method relies on finding a credible control group that satisfies the parallel trends assumption. This is often the hardest part.

Summary

What did we do today?

Potential Outcomes:
- We have seen an introduction to potential outcomes notation. Potential outcomes define outcomes under the treatment and under the control at the individual level.
The ATE and the ATT:
- We have seen two estimands – things we would like to estimate – of interest: the ATT and the ATE. We have discovered assumptions leading to identification of one of those in a cross-sectional context
Difference-in-difference:
- We switched to a two-period setting, which offers a more credible assumption of identifying the ATT: parallel trends.
- DiD provides a powerful and intuitive way to estimate causal effects by controlling for time-invariant unobserved differences (selection bias).

What did we do? (Cont.)

DiD in Regression:
- We then saw that the DID estimator can be implemented in a regression framework. A regression framework is useful for S.E.s and inclusion of covariates.
Testing parallel trends:
- In DiD, this assumption is everything. You must convince yourself and your audience that it is plausible.
Multi-period and staggered adoption:
- We have seen extensions of the DiD framework to a multi-period design closely resembling an event study, and we have focused on getting rid of “bad comparisons” in a staggered adoption setting.

	Before Period (Pre)	After Period (Post)
Treatment Group	\(\hat{Y}_{T, Pre}\)	\(\hat{Y}_{T, Post}\)
Control Group	\(\hat{Y}_{C, Pre}\)	\(\hat{Y}_{C, Post}\)

Empirical Economics

Outline

Course Overview

What do we do today?

Potential Outcomes

Correlation vs. Causality

The Potential Outcomes Framework

The Individual Causal Effect

From Individuals to Populations

The Average Treatment Effect (ATE)

The Fundamental Problem of Causal Inference

Illustrating the Fundamental Problem

Motivation for DiD

Why Simple Comparisons Fail

Difference in Means Example

Selection Bias: The Hidden Difference

Selection Bias: Decomposition

Intro to DiD

Introduction to Differences-in-Differences (DiD)

The \(2 \times 2\) DiD Setup

Calculating the Simple DiD Estimator

Example: DiD in a \(2 \times 2\) Set-Up

Example: DiD in a \(2 \times 2\) Set-Up

Visualization

Deconstructing the DiD Graph

DiD Under Potential Outcomes

DiD Under Potential Outcomes

Example Interpretation

Estimand: ATT

Parallel Trends

Derivation of the DiD Estimator

Derivation of the DiD Estimator

Replacing Potential Outcomes with Observables

DiD in Regression

DiD using a Regression Framework

Interpretation of Coefficients (1/2)

Interpretation of Coefficients (2/2)

Example: DiD in a Regression Framework

Advantages of the Regression Framework

Adding Covariates to the DiD Model

Testing the Parallel Trends Assumption

Statistical Tests for Parallel Trends

Event Studies

What is an Event Study?

Panel Event Study Model With Controls

Example Dataset

Interpretation of the Results

Normalizing Coefficients

Event Study Visualization

What Happens Without a Control Group?

Event Studies Without Control Subjects

Event Study Without Control Subjects

Specification

Example Required Dataset (No Control Group)

The Problem with This Approach

The Core Idea: Abnormal Returns

Estimating Normal Returns

Calculating Abnormal Returns (AR)

Aggregating Abnormal Returns (CAR)

Aggregating Abnormal Returns (CAR) (Cont.)

Visualizing Event Study Results

Hypothesis Testing

Assumptions and Caveats in Event Studies

Potential Pitfalls

Potential Pitfalls (1/3): Ashenfelter’s Dip

Potential Pitfalls (2/3): Policy Anticipation & Spillovers

Potential Pitfalls (3/3): Other Limitations

Summary

What did we do today?

What did we do? (Cont.)

The End