| OLS | IV | |
|---|---|---|
| * p < 0.1, ** p < 0.05, *** p < 0.01 | ||
| Institutional Quality | 0.522*** | 0.944*** |
| (0.061) | (0.157) | |
| Constant | 4.660*** | 1.910* |
| (0.409) | (1.027) | |
| Num.Obs. | 64 | 64 |
| R2 Adj. | 0.533 | 0.469 |
Lecture 8: Instrumental Variables
We would really like you to take the time to fill out the course evaluation form. Your feedback is very important to us and helps us improve the course for future students.
You can fill it in anonymously on https://entry.caracal.uu.nl/44007 under “Course Evaluations” or scan the QR code:
\[ \newcommand{\Yobs}{Y^{\text{obs}}} \newcommand{\Ycf}{Y^{\text{cf}}} \newcommand{\E}{\mathbb{E}} \newcommand{\indep}{\perp \!\!\! \perp} \newcommand{\Cov}{\text{Cov}} \newcommand{\Var}{\text{Var}} \newcommand{\Prp}{\text{Pr}} \]
Our fundamental goal is often to estimate the causal effect of a variable \(X\) on an outcome \(Y\).
Consider the simple linear model:
\[ Y_i = \beta_0 + \beta_1 X_i + u_i \]
We want our estimate of \(\beta_1\), which we call \(\hat{\beta}_1\), to be an unbiased and consistent estimate of the true causal effect.
For Ordinary Least Squares (OLS) to be unbiased, we need the regressor \(X\) to be uncorrelated with the error term \(\epsilon\): \(E[X_i u_i] = 0 \text{ or } Cov(X_i, u_i) = 0\)
Endogeneity occurs when the exogeneity assumption is violated:
\[ Cov(X_i, u_i) \neq 0 \]
When \(X\) is endogenous, OLS is biased and inconsistent. Our \(\hat{\beta}_1\) does not converge to the true \(\beta_1\), even with infinite data.
Why does this happen?
Omitted Variable Bias (OVB)
Simultaneity / Reverse Causality
Measurement Error
Let the true model be:
\[ Y_i = \beta_0 + \beta_1 X_i + \beta_2 W_i + u_i \]
The omitted variable \(W_i\) is an unobserved variable (e.g., ability).
Suppose we estimate the simple model:
\[ Y_i = \alpha_0 + \alpha_1 X_i + \epsilon_i \quad (\text{where } \epsilon_i = \beta_2 W_i + u_i) \]
Then, the OLS estimate for \(\alpha_1\) will be:
\[ \hat{\alpha}_1 = \beta_1 + \beta_2 \cdot \frac{Cov(X_i, W_i)}{Var(X_i)} \]
The bias is \(\beta_2 \cdot \frac{Cov(X_i, W_i)}{Var(X_i)}\).
It is non-zero if \(W\) affects \(Y\) (\(\beta_2 \neq 0\)) and is correlated with \(X\) (\(Cov(X,W) \neq 0\)).
Example: Wages, Education and Ability
Let \(Y_i\): The wage of an individual i, \(X_i\): The years of education, \(W_i\): The innate ability (e.g., intelligence, work ethic), which is impossible for the researcher to observe, \(u_i\): The random unobserved error term, representing other factors affecting wages (e.g., luck, measurement error). The true model for an individual’s wage is:
\[ \text{Wage}_i = \beta_0 + \beta_1 \text{Education}_i + \beta_2 \text{Ability}_i + u_i \]
Here, \(\beta_1\) is the true causal effect of an additional year of education on wages, holding ability constant. However, since we cannot observe ability (\(W_i\)), we are forced to estimate a simpler model:
\[ \text{Wage}_i = \alpha_0 + \alpha_1 \text{Education}_i + \epsilon_i \]
In this estimated model, the new error term \(\epsilon_i\) now contains both the original error and the effect of the omitted ability variable (\(\epsilon_i = \beta_2 \text{Ability}_i + u_i\)). The OLS estimate for the effect of education, \(\hat{\alpha}_1\), will be biased as shown by the formula:
\[ \hat{\alpha}_1 = \beta_1 + \beta_2 \cdot \frac{Cov(\text{Education}_i, \text{Ability}_i)}{Var(\text{Education}_i)} \]
The bias is the second term, \(\beta_2 \cdot \frac{Cov(\text{Education}_i, \text{Ability}_i)}{Var(\text{Education}_i)}\), and it is non-zero because:
Because both \(\beta_2\) and the covariance term are positive, our OLS estimate \(\hat{\alpha}_1\) will overestimate the true return to education (\(\beta_1\)). The model mistakenly attributes the wage premium from higher ability to the additional years of education.
If \(X\) is endogenous, we can’t use its correlation with \(Y\) to identify \(\beta_1\).
The idea in IV is to find a third variable \(Z\), the instrument, that can isolate the “good” part of the variation in \(X\).
An instrument is a variable that is correlated with the endogenous regressor \(X\), but is not correlated with the error term \(\epsilon\).
For a variable \(Z\) to be a valid instrument for \(X\) in the model \(Y_i = \beta_0 + \beta_1 X_i + \epsilon_i\), it must satisfy two conditions:
Let’s consider a binary treatment \(X_i \in \{0, 1\}\).
The individual causal effect is \(\tau_i = Y_i(1) - Y_i(0)\).
For each individual \(i\), we can only observe one of the two potential outcomes:
\[ Y_i^{\text{obs}} = X_i Y_i(1) + (1-X_i) Y_i(0) \]
We never observe \(Y_i(1)\) and \(Y_i(0)\) for the same person.
Now let’s introduce a binary instrument \(Z_i \in \{0, 1\}\).
We need to define potential outcomes for the treatment itself, based on the instrument:
And we also have potential outcomes for \(Y\): \(Y_i(x)\), which depend on the treatment, which in turn depends on the instrument.
| Type | If \(Z_i=0\), \(X_i(0)\) is | If \(Z_i=1\), \(X_i(1)\) is | Description |
|---|---|---|---|
| Complier | 0 | 1 | (Does what they are encouraged to do) |
| Always-Taker | 1 | 1 | (Takes treatment regardless of encouragement) |
| Never-Taker | 0 | 0 | (Never takes treatment regardless) |
| Defier | 1 | 0 | (Does the opposite of encouragement) |
Assumption: Treatment Assignment in IV
With these four assumptions, IV does not estimate the Average Treatment Effect (\(ATE = E[Y_i(1) - Y_i(0)]\)) for the whole population.
Instead, IV estimates the Local Average Treatment Effect (LATE)
Theorem: LATE (Imbens & Angrist, 1994)
\[ \beta_{IV} = E[Y_i(1) - Y_i(0) \mid X_i(1) > X_i(0)] \]
This is the average treatment effect only for the subpopulation of compliers.
This is a crucial insight: The causal effect we get from IV is specific to the group of people who are induced into treatment by the instrument. It may not generalize to Always-Takers or Never-Takers.
The simplest IV estimator is the Wald estimator, used when both the instrument \(Z\) and the treatment \(X\) are binary.
\[ \hat{\beta}_{\text{Wald}} = \frac{E[Y | Z=1] - E[Y | Z=0]}{E[X | Z=1] - E[X | Z=0]} \]
Interpretation:
The Wald estimator is simply the sample-analogue of this formula.
Why does the Wald formula gives us the LATE?
The numerator is \(E[Y | Z=1] - E[Y | Z=0]\)
This difference in outcomes is only driven by the Compliers switching from \(X=0\) to \(X=1\).
For everyone else (Always/Never-Takers), their treatment status doesn’t change. So the numerator can be rewritten as:
\[ \text{Numerator} = \Pr(\text{Complier}) \cdot \E[Y(1) - Y(0) \mid \text{Complier}]=\Pr(\text{Complier})\cdot LATE. \]
The denominator is \(\E[X | Z=1] - \E[X | Z=0] = \Pr(\text{Complier})\)
Together:
\[ \hat{\beta}_{\text{Wald}} = \frac{\Pr(\text{Complier}) \cdot \text{LATE}}{\Pr(\text{Complier})} = \text{LATE} \]
Let’s prove that the Wald estimator identifies the LATE under the four key IV assumptions (Independence, Exclusion, Relevance, Monotonicity).
\[ \text{Wald} = \frac{\E[Y | Z=1] - \E[Y | Z=0]}{\E[X | Z=1] - \E[X | Z=0]} \]
We will show that \(\E[X | Z=1] - \E[X | Z=0] = \Prp(\text{Complier})\).
\[ \begin{align*} \E[X | Z=z] &= \E[X_i(z) | Z=z] && \text{(Def. of observed X)} \\ &= \E[X_i(z)] && \text{(by Independence assumption)} \end{align*} \]
So the denominator is \(\E[X_i(1)] - \E[X_i(0)]\). Let’s expand this using the Law of Total Expectation over the population types (C=Complier, A=Always-Taker, N=Never-Taker). We assume Monotonicity, so there are no Defiers.
\[ \begin{align*} \E[X_i(1)] &= \E[X_i(1)|C]\Prp(C) + \E[X_i(1)|A]\Prp(A) + \E[X_i(1)|N]\Prp(N) \\ &= (1)\Prp(C) + (1)\Prp(A) + (0)\Prp(N) = \Prp(C) + \Prp(A) \\ \E[X_i(0)] &= \E[X_i(0)|C]\Prp(C) + \E[X_i(0)|A]\Prp(A) + \E[X_i(0)|N]\Prp(N) \\ &= (0)\Prp(C) + (1)\Prp(A) + (0)\Prp(N) = \Prp(A) \end{align*} \]
Therefore, the denominator is:
\[ (\Prp(C) + \Prp(A)) - \Prp(A) = \Prp(C) \]
The numerator can be interpreted as the Intent-to-Treat for Compliers.
The numerator is \(\E[Y | Z=1] - \E[Y | Z=0]\). Again, we use Independence to state that:
\[ \E[Y | Z = z] = \E[Y_i(X_i(z)) | Z = z] = \E[Y_i(X_i(z))] \]
Dissecting these two terms gives:
\[ \begin{align*} \E[Y_i(X_i(1))] &= \E[Y_i(X_i(1))|C]\Prp(C) + \E[Y_i(X_i(1))|A]\Prp(A) + \E[Y_i(X_i(1))|N]\Prp(N) \\ &= \E[Y_i(1)|C]\Prp(C) + \E[Y_i(1)|A]\Prp(A) + \E[Y_i(0)|N]\Prp(N) \\ \E[Y_i(X_i(0))] &= \E[Y_i(X_i(0))|C]\Prp(C) + \E[Y_i(X_i(0))|A]\Prp(A) + \E[Y_i(X_i(0))|N]\Prp(N) \\ &= \E[Y_i(0)|C]\Prp(C) + \E[Y_i(1)|A]\Prp(A) + \E[Y_i(0)|N]\Prp(N) \end{align*} \]
Now, we take the difference.
\[ \begin{align*} \text{Numerator} &= \left( \E[Y_i(1)|C]\Prp(C) + \dots \right) - \left( \E[Y_i(0)|C]\Prp(C) + \dots \right) \\ &= \E[Y_i(1)|C]\Prp(C) - \E[Y_i(0)|C]\Prp(C) \\ &= \Prp(C) \cdot \left( \E[Y_i(1)|C] - \E[Y_i(0)|C] \right) \\ &= \Prp(\text{Complier}) \cdot \E[Y_i(1) - Y_i(0) | \text{Complier}] \\ &= \Prp(\text{Complier}) \cdot \text{LATE} \end{align*} \]
The Exclusion Restriction is implicitly used because \(Y\) only depends on \(X\), not \(Z\).
There fore, we have shown:
\[ \textbf{Numerator} = \Prp (\text{Complier}) \times \text{LATE} \\ \textbf{Denominator} = \Prp (\text{Complier}) \]
And the Wald estimator is:
\[ \hat{\beta}_{\text{Wald}} = \frac{\E[Y | Z=1] - \E[Y | Z=0]}{\E[X | Z=1] - \E[X | Z=0]} = \frac{\Prp(\text{Complier}) \times \text{LATE}}{\Prp(\text{Complier})} = \text{LATE} \]
This shows that the simple ratio of the reduced form effect to the first-stage effect precisely isolates the average causal effect for the specific subpopulation whose treatment status is actually manipulated by the instrument.
Example: AJR (2001)
\(Y\): Economic growth (Log GDP per capita)
\(X\): Institutional quality (average protection against expropriation risk)
\(Z\): Log of Settler Mortality Rates
Endogeneity: A simple OLS regression of Log GDP per capita on a measure of Institutional Quality is likely biased.
Their theory: Settler Mortality \(\rightarrow\) European Settlements \(\rightarrow\) Early Institutions \(\rightarrow\) Current Institutions \(\rightarrow\) Current Economic Performance
Persistence of Institutions: these initial institutions, whether inclusive or extractive, have persisted to the present day.
| OLS | IV | |
|---|---|---|
| * p < 0.1, ** p < 0.05, *** p < 0.01 | ||
| Institutional Quality | 0.522*** | 0.944*** |
| (0.061) | (0.157) | |
| Constant | 4.660*** | 1.910* |
| (0.409) | (1.027) | |
| Num.Obs. | 64 | 64 |
| R2 Adj. | 0.533 | 0.469 |
Let’s return to our linear model \(Y_i = \beta_0 + \beta_1 X_i + u_i\) where \(\Cov(X_i, u_i) \neq 0\). We have an instrument \(Z\) such that \(\Cov(Z_i, X_i) \neq 0\) and \(\Cov(Z_i, u_i) = 0\).
The IV estimator for \(\beta_1\) is given by:
\[ \hat{\beta}_1^{\text{IV}} = \frac{\Cov(Z, Y)}{\Cov(Z, X)} \]
Notice the similarity to the Wald estimator. The numerator is the covariance of the instrument and outcome (reduced form), and the denominator is the covariance of the instrument and the endogenous variable (first stage).
This formula is very instructive. Let’s compare it to the OLS estimator in a simple regression.
OLS Estimator:\(\hat{\beta}_1^{\text{OLS}} = \frac{\Cov(X, Y)}{\Cov(X, X)} = \frac{\Cov(X, Y)}{\Var(X)}\).
IV Estimator: \(\hat{\beta}_1^{\text{IV}} = \frac{\Cov(Z, Y)}{\Cov(Z, X)}\) IV replaces the “bad” variation in \(X\) with the “good” variation in \(Z\).
What if we have multiple instruments or other exogenous control variables (\(W\))?
Let the structural model be \(Y_i = \beta_0 + \beta_1 X_i + \gamma' W_i + u_i\) (\(X_i\) is endogenous, \(W_i\) are exogenous controls).
Let the instruments be \(Z_1, Z_2, ..., Z_k\).
The 2SLS procedure works in two steps…
First Stage Regression: Regress the endogenous variable \(X\) on all the instruments \(Z\) and all other exogenous controls \(W\).
\[ X_i = \pi_0 + \pi_1 Z_{1i} + ... + \pi_k Z_{ki} + \delta' W_i + \nu_i \]
From this regression, we obtain the for \(X\), which we call \(\hat{X}_i\).
\[ \hat{X}_i = \hat{\pi}_0 + \hat{\pi}_1 Z_{1i} + ... + \hat{\pi}_k Z_{ki} + \hat{\delta}' W_i \]
This \(\hat{X}_i\) is the part of the variation in \(X\) that is explained by our exogenous variables.
Second Stage Regression: Regress the outcome variable \(Y\) on the endogenous variable \(\hat{X}\) and the other exogenous controls \(W\).1
\[ Y_i = \beta_0 + \beta_1 \hat{X}_i + \gamma' W_i + \zeta_i \]
The coefficient \(\hat{\beta}_1\) from this second stage regression is our 2SLS estimate.
Since we used \(\hat{X}_i\) instead of \(X_i\), we have purged the endogeneity, and our estimate for \(\beta_1\) is now consistent.
The credibility of any IV analysis rests entirely on the quality of the instrument.
A good instrument must be both relevant and valid (exogenous).
The exclusion restriction is a very strong assumption. You must provide a convincing story for why your instrument \(Z\) could not possibly affect \(Y\) except through its effect on \(X\).
Often, instruments come from:
What happens if the relevance condition is only barely met? In other words, if \(\Cov(Z,X)\) is very close to zero?
If an instrument is weak, even a tiny violation of the exclusion restriction (a tiny \(\Cov(Z, \epsilon)\)) can lead to a very large bias in the IV estimate.
\[ \text{Bias}(\hat{\beta}_{IV}) \approx \frac{\Cov(Z, \epsilon)}{\Cov(Z, X)} \]
A small denominator leads to a large bias! Furthermore, the variance of the IV estimator will be very large.
We can and should test for instrument relevance.
\[ X_i = \pi_0 + \pi_1 Z_{1i} + ... + \pi_k Z_{ki} + (\text{controls}) + \nu_i \]
We perform an F-test on the joint significance of all instruments:
\[ H_0: \pi_1 = \pi_2 = ... = \pi_k = 0 \]
A first-stage F-statistic greater than 10 is often used as a benchmark to indicate that instruments are not weak (Stock, Wright, & Yogo, 2002). \(F < 10\) is a serious red flag.
Research Question: What is the causal effect of an additional year of schooling on wages?
Outcome (\(Y\)): Log weekly wages.
The Instrument (\(Z\)): Quarter of Birth.
Why is Quarter of Birth a valid instrument?
Institutional Detail: In the US, compulsory schooling laws required students to attend school until they turned 16 or 17.
IV Assumptions:
OLS Result: Found that an extra year of school was associated with about a 7.5% increase in wages.
2SLS Result: Using quarter of birth as an instrument, they found a very similar return to schooling, also around 7.5%.
Interpretation (LATE):
Let’s say we want to estimate the price elasticity of demand for avocados.
The starting point is a so-called structural model.
In market equilibrium, \(Q_i^d = Q_i^s = Q_i\), and the price \(P_i\) adjusts to make this happen. This means price \(P_i\) is determined by both supply and demand factors.
If there is a positive demand shock (\(u_i > 0\)), the demand curve shifts right. This leads to a higher equilibrium price and a higher quantity.
Therefore, \(P_i\) is positively correlated with the demand error term \(u_i\).
This violates the core OLS assumption that regressors are uncorrelated with the error term: \(Cov(P_i, u_i) \neq 0\).
Running OLS of \(Q\) on \(P\) will give a biased estimate of \(\alpha_1\).
Example: Good Instrument
Let \(Z_i\) be a measure of growing conditions (e.g., rainfall in avocado-growing regions). Good weather increases supply.
Demand Curve (Unchanged): Weather doesn’t directly affect how many avocados a person wants to buy at a given price. \[ Q^d_i = \alpha_0 + \alpha_1 P_i + u_i \]
Supply Curve (with Instrument): Weather directly affects the quantity supplied. \[ Q^s_i = \beta_0 + \beta_1 P_i + \delta Z_i + v_i \]
Remember that an instrument needs to satisfy two key conditions:
How do we use our instrument \(Z\) and the exclusion restriction to find \(\alpha_1\)?
Our goal is to estimate \(\alpha_1\) from the demand equation: \[ Q^d_i = \alpha_0 + \alpha_1 P_i + u_i \]
The problem is that \(Cov(P_i, u_i) \neq 0\). But we have assumed \(Cov(Z_i, u_i) = 0\). Let’s use that!
Step 1: Take the covariance of the demand equation with our instrument \(Z_i\).
\[ Cov(Z_i, Q^d_i) = Cov(Z_i, \alpha_0 + \alpha_1 P_i + u_i) \]
Step 2: Use the properties of covariance to expand the right side.
\[ Cov(Z_i, Q^d_i) = Cov(Z_i, \alpha_0) + Cov(Z_i, \alpha_1 P_i) + Cov(Z_i, u_i) \]
Step 3: Simplify.
Step 4: See what’s left. \[ Cov(Z_i, Q_i) = \alpha_1 Cov(Z_i, P_i) \]
Step 5: Solve for our parameter of interest, \(\alpha_1\). \[ \alpha_1 = \frac{Cov(Z_i, Q^d_i)}{Cov(Z_i, P_i)} \]
This is the instrumental variables estimator for \(\alpha_1\).
The expression \(\frac{Cov(Z_i, Q_i)}{Cov(Z_i, P_i)}\) has a beautiful and intuitive interpretation. Let’s look at two simple regressions.
Let’s look at the ratio of these two coefficients:
\[ \frac{\hat{\pi}_{q1}}{\hat{\pi}_{p1}} = \frac{ \frac{Cov(Z_i, Q_i)}{Var(Z_i)} }{ \frac{Cov(Z_i, P_i)}{Var(Z_i)} } = \frac{Cov(Z_i, Q_i)}{Cov(Z_i, P_i)} \]
This is exactly our IV estimator!
\[ \hat{\alpha}_{1, IV} = \frac{\text{Reduced Form Effect}}{\text{First Stage Effect}} = \frac{\text{Effect of Z on Q}}{\text{Effect of Z on P}} \]
The structural parameter \(\alpha_1\) is identified because it can be expressed as the ratio of two estimable parameters from simple regressions. We have used the exogenous variation from the instrument to isolate the causal effect of price on quantity demanded.
# 1. Install and load necessary packages
#install.packages(c("fixest", "AER"))
library(fixest)
library(AER)
# 2. Estimate the IV model
# The formula is read as: y regressed on controls,
# with x being instrumented by z.
iv_model_r <- feols(y ~ controls | x ~ z, data = df)
# 3. Display the results
summary(iv_model_r)# 1. Install necessary packages
# !pip install pyfixest pandas statsmodels
# 2. Import libraries and load data
import numpy as np
import pandas as pd
from pyfixest.estimation import feols
import statsmodels.api as sm # Used to easily load the R dataset
# 3. Estimate the IV model using the R-style formula
# The formula is identical to the one used in R
iv_model_py = feols('y ~ controls | x ~ z', data=df)
# 4. Display the results
iv_model_py.summary()* 1. Load the data
use "dataset.csv", clear
* 2. Estimate the IV model
* Syntax: ivregress 2sls depvar [exog_vars] (endog_var = instrument_vars)
ivregress 2sls y controls (x = z)
* To get robust standard errors, which is often the default in R/Python packages
ivregress 2sls y controls (x = z), robust# 1. Estimate the IV model
iv_model_r <- feols(y ~ controls | x ~ z, data = df)
# 2. Use the fitstat() function to extract various versions of first stage F statistics:
fitstat(iv_model_r, "cd") # For the Craig-Donald F Stat.
fitstat(iv_model_r, "kp") # For the Kleibergen-Paap F Stat.
fitstat(iv_model_r, "ivf") # For the standard F Stat.from pyfixest.estimation import feols
import statsmodels.api as sm # Used to easily load the R dataset
# 1. Estimate the IV model using the R-style formula
iv_model_py = feols('y ~ controls | x ~ z', data=df)
# 2. Look at F stat
# You can access the F-Statistic of the first stage via the _f_stat_1st_stage attribute:
iv_model_py._f_stat_1st_stage
# 3. Via the IV_Diag method, you can compute additional IV Diagnostics, as the effective F-statistic following Olea & Pflueger (2013):
iv_model_py.IV_Diag()
iv_model_py._eff_FProblem: Endogeneity (\(\Cov(X, \epsilon) \neq 0\)) makes OLS biased and inconsistent for estimating causal effects.
Interpretation: IV estimates the Local Average Treatment Effect (LATE) - the causal effect for the subpopulation of “compliers” whose behavior is changed by the instrument.
Estimation: Use the Wald estimator for simple cases, and Two-Stage Least Squares (2SLS) for the general case.
In Practice: Finding a valid instrument is the hardest part. Always check for weak instruments (First-stage F-statistic) and be prepared to defend your exclusion restriction.
Empirical Economics: Lecture 8 - Instrumental Variables