Solutions Tutorial 8

Estimating the AJR Model

The results:

Code
import statsmodels.formula.api as smf
import statsmodels.api as sm
from statsmodels.iolib.summary2 import summary_col
from statsmodels.sandbox.regression.gmm import IV2SLS
import pandas as pd

colonial_data = pd.read_csv('../tutorials/datafiles/colonial_origins.csv', decimal=',', sep=";")
# Model 1: OLS
ols_model = smf.ols('logpgp95 ~ avexpr + africa', data=colonial_data).fit()

# Model 2: Reduced Form
reduced_form = smf.ols('logpgp95 ~ logem4 + africa', data=colonial_data).fit()

# Model 3: First Stage
first_stage = smf.ols('avexpr ~ logem4 + africa', data=colonial_data).fit()

# Model 4: 2SLS / IV
iv_model = IV2SLS(
  endog=colonial_data['logpgp95'],
  exog=sm.add_constant(colonial_data[['africa', 'avexpr']]),
  instrument=sm.add_constant(colonial_data[['africa', 'logem4']])).fit()

print(summary_col([ols_model, reduced_form, first_stage, iv_model], stars=True, model_names=['OLS', 'Reduced Form', 'First Stage (DV=avexpr)', 'IV']))

========================================================================
                  OLS     Reduced Form First Stage (DV=avexpr)     IV   
------------------------------------------------------------------------
Intercept      5.6397***  10.3221***   9.1419***                        
               (0.4123)   (0.3852)     (0.6721)                         
avexpr         0.4225***                                       0.8023***
               (0.0572)                                        (0.1891) 
africa         -0.7830*** -0.5962**    -0.2913                 -0.3625  
               (0.1688)   (0.2304)     (0.4020)                (0.2933) 
logem4                    -0.4313***   -0.5375***                       
                          (0.0912)     (0.1591)                         
const                                                          2.9877** 
                                                               (1.3271) 
R-squared      0.6600     0.5289       0.2764                  0.4145   
R-squared Adj. 0.6489     0.5135       0.2526                  0.3953   
========================================================================
Standard errors in parentheses.
* p<.1, ** p<.05, ***p<.01
Code
print("First Stage F Stat: 10.270")
First Stage F Stat: 10.270
  1. Verify the IV Estimate: The IV estimate is the ratio of the reduced form coefficient to the first stage coefficient for the instrument.
  • Reduced Form Coefficient (logem4 on logpgp95): -0.43
  • First Stage Coefficient (logem4 on avexpr): -0.53

\[ \text{Calculated IV Estimate} = \frac{\text{Reduced Form}}{\text{First Stage}} = \frac{-0.43}{-0.53} \approx 0.81 \]

This calculated value of 0.81 is indeed very close to the 2SLS coefficient for avexpr reported in column (4), which is 0.80. The minor difference is due to rounding in the table.

  1. Compare OLS and 2SLS: The 2SLS estimate (0.80) is almost double the OLS estimate (0.42). This implies that the OLS estimate is biased downwards. According to the omitted variable bias formula, this would happen if the omitted factors (captured in the error term) that negatively affect GDP are positively correlated with the quality of institutions. For example, a country’s history of wars and conflicts might be unfavorable to long-term growth (negative effect on GDP) and might be positively correlated with institutional quality, since war often requires organization, bureaucracy and fiscal capacity (positive correlation with avexpr). This would cause OLS to underestimate the true positive effect of institutions.

  2. Assess Instrument Strength: The instrument, settler mortality (logem4), is not weak. The rule of thumb is that the first-stage F-statistic on the excluded instrument should be greater than 10. Here, the F-statistic is 10.27, which is above this threshold, indicating a sufficiently strong relationship between settler mortality and the quality of institutions.

Colonial Origins IV Strategy

  1. Relevance: The first-stage logic follows a clear causal chain:
  1. High Settler Mortality: In places where European settlers faced high mortality rates (e.g., from malaria, yellow fever), they were discouraged from establishing long-term settlements.
  2. Extractive Institutions: Consequently, the colonial powers set up “extractive” institutions designed to transfer resources from the colony to the metropole as quickly as possible, with little emphasis on private property rights or checks and balances on government power.
  3. Persistence: These extractive institutions have persisted long after independence. Conversely, in places with low mortality rates, Europeans settled in large numbers and demanded institutions that protected their rights and property, leading to the development of “inclusive” institutions that also persisted. Thus, historical settler mortality is strongly correlated with the quality of institutions today.
  1. Exclusion Restriction: For settler mortality to be a valid instrument, it must affect current GDP per capita only through its effect on institutions. It cannot have a direct effect on today’s GDP or affect it through any other channel. For example, the historical disease environment that caused high mortality must not be the primary cause of poor health or low productivity in the country today.

  2. LATE and “Compliers”: In this context, the “compliers” are the countries whose choice of institutional path (extractive vs. inclusive) was causally determined by the settler mortality rates they faced. The LATE represents the average treatment effect of having good institutions for this specific group of countries. It is the causal effect of institutions for countries whose institutional path was effectively “assigned” by their disease environment during the colonial era.

Defending and Challenging the Exclusion Restriction

  1. Argument to Defend: The most compelling defense is that the diseases that determined settler mortality in the 18th and 19th centuries (primarily malaria and yellow fever) are no longer the primary determinants of health and wealth in the 21st century. Due to medical advances (like quinine and modern medicine) and public health infrastructure, the lethality of this historical disease environment has been massively reduced. Therefore, it is plausible that this historical factor has no significant direct impact on modern GDP, and its influence is only felt through the persistent institutional structures it helped to create.

  2. Argument to Challenge (Violation): A plausible channel that violates the exclusion restriction is the persistence of the disease environment itself. If the factors that led to high settler mortality historically (e.g., a tropical climate suitable for malaria-carrying mosquitoes) are also responsible for a high disease burden today, then settler mortality could be affecting current GDP through both institutions and current public health. A high contemporary disease burden directly reduces labor productivity and human capital, meaning the instrument would have an effect on the outcome that is not channeled through institutions, thus violating the exclusion restriction.

Estimating and Interpreting 2SLS Results

import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols
from statsmodels.iolib.summary2 import summary_col
from statsmodels.sandbox.regression.gmm import IV2SLS
url = 'https://github.com/basm92/ee_website/raw/refs/heads/master/tutorials/datafiles/CARD.DTA'
data = pd.read_stata(url)
# 1. Dummy for education > 12
data['treated'] = np.where(data['educ']>12, 1, 0)

# 2. Wald estimator
numerator = data.query('nearc4 == 1').wage.mean() - data.query('nearc4 == 0').wage.mean()
denominator = data.query('nearc4 == 1').treated.mean() - data.query('nearc4 == 0').treated.mean()
wald = numerator/denominator

# 3. Estimate 2SLS
iv_model = IV2SLS(
  endog=data['wage'],
  exog=sm.add_constant(data['treated']),
  instrument=sm.add_constant(data['nearc4'])
  ).fit()

print(summary_col([iv_model], stars=True))

==========================
                   wage   
--------------------------
const          207.6927***
               (65.8448)  
treated        731.4036***
               (129.4895) 
R-squared      -1.3523    
R-squared Adj. -1.3531    
==========================
Standard errors in
parentheses.
* p<.1, ** p<.05, ***p<.01
# 4. Estimate the First Stage
first_stage = ols("treated ~ nearc4", data=data).fit()
print(summary_col([first_stage], stars=True))

========================
                treated 
------------------------
Intercept      0.4222***
               (0.0161) 
nearc4         0.1219***
               (0.0194) 
R-squared      0.0129   
R-squared Adj. 0.0126   
========================
Standard errors in
parentheses.
* p<.1, ** p<.05,
***p<.01
# 6. Check and Verify Estimate of the Reduced Form
iv = iv_model.params['treated']
fs = first_stage.params['nearc4']
## According to the theory, IV=Reduced Form/First Stage
print("IV * First Stage =", iv*fs)
IV * First Stage = 89.17950691220824
rf = ols("wage ~ nearc4", data=data).fit()
print("Reduced Form = ", rf.params['nearc4'])
Reduced Form =  89.17950691217489
library(haven); library(fixest); library(dplyr)
url <- 'https://github.com/basm92/ee_website/raw/refs/heads/master/tutorials/datafiles/CARD.DTA'
data <- read_stata(url)

# 1. Dummy for education > 12
data <- data |>
  mutate(treated = if_else(educ > 12, 1, 0))

# 2. Wald Estimator
numerator <- mean(data$wage[data$nearc4 == 1]) - mean(data$wage[data$nearc4 == 0])
denominator <- mean(data$treated[data$nearc4==1]) - mean(data$treated[data$nearc4 == 0])
wald <- numerator / denominator

# 3. Estimate 2SLS
iv_model <- feols(wage ~ 1 | treated ~ nearc4, data = data)
iv_model
TSLS estimation - Dep. Var.: wage
                  Endo.    : treated
                  Instr.   : nearc4
Second stage: Dep. Var.: wage
Observations: 3,010
Standard-errors: IID 
            Estimate Std. Error t value   Pr(>|t|)    
(Intercept)  207.693    65.8448 3.15428 1.6248e-03 ** 
fit_treated  731.404   129.4895 5.64836 1.7710e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 403.2   Adj. R2: 0.024626
F-test (1st stage), treated: stat = 39.301, p = 4.151e-10, on 1 and 3,008 DoF.
                 Wu-Hausman: stat = 58.481, p = 2.744e-14, on 1 and 3,007 DoF.
# 4. Estimate the First Stage
first_stage <- feols(treated ~ nearc4, data = data)
first_stage
OLS estimation, Dep. Var.: treated
Observations: 3,010
Standard-errors: IID 
            Estimate Std. Error  t value   Pr(>|t|)    
(Intercept) 0.422153   0.016063 26.28176  < 2.2e-16 ***
nearc4      0.121929   0.019449  6.26909 4.1512e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 0.496737   Adj. R2: 0.012569
# 6. Check and Verify Estimate of the Reduced Form
coef(iv_model)[2] * coef(first_stage)[2]
fit_treated 
   89.17951 
rf <- feols(wage ~ nearc4, data = data)
coef(rf)[2]
  nearc4 
89.17951 
* 1. Load data and create a dummy variable for education > 12
* Load data from the URL
use "https://github.com/basm92/ee_website/raw/refs/heads/master/tutorials/datafiles/CARD.DTA", clear

* Generate the 'treated' dummy variable where education is greater than 12
gen treated = (educ > 12)

* 2. Calculate the Wald estimator
* Calculate the mean wage for individuals near a 4-year college
quietly summarize wage if nearc4 == 1
scalar mean_wage_nearc4 = r(mean)

* Calculate the mean wage for individuals not near a 4-year college
quietly summarize wage if nearc4 == 0
scalar mean_wage_not_nearc4 = r(mean)

* Calculate the mean of the 'treated' variable for individuals near a 4-year college
quietly summarize treated if nearc4 == 1
scalar mean_treated_nearc4 = r(mean)

* Calculate the mean of the 'treated' variable for individuals not near a 4-year college
quietly summarize treated if nearc4 == 0
scalar mean_treated_not_nearc4 = r(mean)

* Calculate the numerator and denominator for the Wald estimator
scalar numerator = mean_wage_nearc4 - mean_wage_not_nearc4
scalar denominator = mean_treated_nearc4 - mean_treated_not_nearc4

* Calculate and display the Wald estimator
scalar wald = numerator / denominator
display "Wald Estimator: " wald

* 3. Estimate 2SLS (Two-Stage Least Squares)
* The endogenous variable is 'treated', and the instrument is 'nearc4'
ivregress 2sls wage (treated = nearc4)
estimates store iv_model

* Display the 2SLS regression results
estimates table iv_model, star 

* Extract the IV coefficient for 'treated'
scalar iv_coef = _b[treated]
display iv_coef

* 4. Estimate the First Stage regression
regress treated nearc4
estimates store first_stage

* Display the first stage regression results
estimates table first_stage, star

* 5. Check and Verify the Estimate of the Reduced Form
* Extract the first-stage coefficient for 'nearc4'
scalar fs_coef = _b[nearc4]

* According to the theory, IV = Reduced Form / First Stage
display "IV * First Stage = " iv_coef * fs_coef

* Estimate the reduced form regression
regress wage nearc4
display "Reduced Form = " _b[nearc4]

Estimate and Interpret the First Stage: The first stage of the 2SLS estimation isolates the relationship between the instrument (nearc4) and the endogenous treatment (treated). It answers the question: “Does growing up near a college actually make people more likely to get more than a high school education?”

The first-stage regression is: \(\text{treated} = \beta_0 + \beta_1 \text{nearc4} + u\)

Estimation and Interpretation: Running this simple regression reveals that the coefficient (\(\beta_1\)) on nearc4 is approximately 0.12.

This coefficient is positive and statistically significant. The interpretation is as follows: Growing up near a four-year college increases the probability of obtaining more than a high school education by about 12 percentage points, on average.

This result confirms that the instrument is “relevant”—it has a meaningful impact on the treatment decision.

Interpret the 2SLS Estimate: The 2SLS estimate provides the causal effect of the treatment on the outcome. For this analysis, it is the Wald Estimator. The regression of wage on treated using nearc4 as an instrument yields a coefficient of approximately 731.

Interpretation: This 2SLS coefficient is a Local Average Treatment Effect (LATE). It estimates the causal return to education for a specific subgroup of the population known as “compliers.” In this context, compliers are the individuals who pursued more than 12 years of education because they grew up near a college, and would not have done so otherwise.

The interpretation is:

For the group of people who were induced to get more than a high school education by having a college nearby, doing so caused their wages to increase by approximately 731 dollar on average.

This is a causal estimate because the IV strategy isolates the variation in education that is random (due to college proximity) rather than driven by confounding factors like innate ability.

Calculate and Interpret the Reduced Form Coefficient: The reduced form equation directly relates the outcome variable to the instrumental variable, showing the instrument’s total effect on the outcome.

The reduced form equation is: \(\text{wage} = \pi_0 + \pi_1 \text{nearc4} + v\).

The Wald estimator has a precise relationship with the first stage and the reduced form: 2SLS Estimate = (Reduced Form Coefficient) / (First Stage Coefficient)

Therefore, we can calculate the reduced form coefficient by rearranging the formula: Reduced Form Coefficient = 2SLS Estimate * First Stage Coefficient.

Interpretation: This calculated coefficient (\(\pi_1\) of 89 is the estimate you would get if you directly regressed wage on nearc4. The interpretation is:

On average, individuals who grew up near a four-year college earn approximately 89$ higher wages than those who did not. This effect captures the entire causal chain of events: growing up near a college makes people more likely to get more education, which in turn leads to higher wages.

Designing an Alternative IV Study

A potential alternative instrument for the quality of institutions is Legal Origin.1

  1. Relevance: The legal system a country inherited from its colonizer (e.g., British Common Law vs. French Civil Law) is a fundamental component of its modern institutional framework. Different legal traditions have different approaches to protecting property rights and limiting state power. Therefore, a country’s legal origin is likely to be strongly correlated with its score on “protection against expropriation risk”.
  2. Validity (Exclusion Restriction): The argument for validity is that the identity of a country’s colonizer was, from the perspective of the colonized country, a historical accident. Whether a country was colonized by Britain versus France should not affect its long-run economic development today through any channel other than the persistent legal and political institutions that were established.
  3. Potential Weaknesses: The exclusion restriction is highly debatable. The identity of the colonizer (e.g., Britain) could be correlated with many other factors that affect growth, such as the introduction of a specific language (English), culture, or integration into different global trade networks. If these other factors have a direct impact on GDP, then legal origin would not be a valid instrument.

Deriving the IV Estimator

Given the model: \[ Y_i = \beta_0 + \beta_1 X_i + \epsilon_i \] And the two core IV assumptions: 1. Relevance: \(Cov(Z_i, X_i) \neq 0\) 2. Exclusion Restriction: \(Cov(Z_i, \epsilon_i) = 0\)

We can derive the estimator for \(\beta_1\).

Step 1: Take the covariance of the entire model equation with the instrument \(Z_i\). \[ Cov(Z_i, Y_i) = Cov(Z_i, \beta_0 + \beta_1 X_i + \epsilon_i) \] Step 2: Use the additive property of covariance to expand the right side. \[ Cov(Z_i, Y_i) = Cov(Z_i, \beta_0) + Cov(Z_i, \beta_1 X_i) + Cov(Z_i, \epsilon_i) \] Step 3: Simplify each term.

  • The covariance with a constant is zero: \(Cov(Z_i, \beta_0) = 0\).
  • Constants can be factored out of covariance: \(Cov(Z_i, \beta_1 X_i) = \beta_1 Cov(Z_i, X_i)\).
  • By the exclusion restriction assumption: \(Cov(Z_i, \epsilon_i) = 0\).

Step 4: Substitute the simplified terms back into the equation. \[ Cov(Z_i, Y_i) = 0 + \beta_1 Cov(Z_i, X_i) + 0 \] \[ Cov(Z_i, Y_i) = \beta_1 Cov(Z_i, X_i) \] Step 5: Solve for \(\beta_1\). \[ \hat{\beta}_1^{\text{IV}} = \frac{Cov(Z, Y)}{Cov(Z, X)} \]

The Wald Estimator

  1. Calculate Key Quantities:
  • \(E[X | Z=1]\) (Treatment rate when encouraged): Always-Takers (20%) + Compliers (40%) = 0.6
  • \(E[X | Z=0]\) (Treatment rate when not encouraged): Only Always-Takers (20%) = 0.2
  • \(E[Y | Z=1]\) (Avg. outcome when encouraged): (0.40 * 30) [Never-Takers] + (0.20 * 50) [Always-Takers] + (0.40 * 60) [Compliers, Treated] = 12 + 10 + 24 = 46
  • \(E[Y | Z=0]\) (Avg. outcome when not encouraged): (0.40 * 30) [Never-Takers] + (0.20 * 50) [Always-Takers] + (0.40 * 40) [Compliers, Not Treated] = 12 + 10 + 16 = 38
  1. Compute the Wald Estimator: \[ \hat{\beta}_{\text{Wald}} = \frac{E[Y | Z=1] - E[Y | Z=0]}{E[X | Z=1] - E[X | Z=0]} = \frac{46 - 38}{0.6 - 0.2} = \frac{8}{0.4} = 20 \] The estimated treatment effect is 20.

  2. LATE: The Local Average Treatment Effect is the average effect of the treatment specifically for the subpopulation of Compliers. For this group, the average outcome with treatment is 60, and the average outcome without treatment is 40. \[ \text{LATE} = E[Y(1) - Y(0) | \text{Complier}] = 60 - 40 = 20 \] The LATE is 20, which is exactly what the Wald estimator calculates.

The Danger of Weak Instruments

  1. Intuitive Explanation: The approximate bias of the IV estimator is \(Bias(\hat{\beta}_{IV}) \approx \frac{Cov(Z, \epsilon)}{Cov(Z, X)}\).
  • The numerator, \(Cov(Z, \epsilon)\), represents the degree to which the exclusion restriction is violated.
  • The denominator, \(Cov(Z, X)\), represents the strength (relevance) of the instrument. A “weak instrument” means the denominator is very close to zero. Even if the instrument is almost perfect (the numerator is a tiny, non-zero number), dividing a small number by a very small number results in a large ratio. Therefore, a weak instrument dramatically magnifies any small violation of the exclusion restriction, leading to a large bias in the final estimate.
  1. Testing in Practice: Researchers test for weak instruments by examining the first-stage regression (the regression of the endogenous variable on the instrument(s) and controls). They use the F-statistic that tests the joint significance of all excluded instruments. The common rule of thumb is that an F-statistic greater than 10 indicates that the instruments are not weak.

Understanding 2SLS

  1. First Stage Regression: The purpose of the first stage is to predict the endogenous variable using only exogenous information.
  • Dependent Variable: The endogenous variable, \(X_i\) (police numbers).
  • Independent Variables: All exogenous variables in the model, which include the instruments (\(Z_{1i}\) for firefighters and \(Z_{2i}\) for election year) and any exogenous controls (\(W_i\) for poverty level).
  • Equation: \(X_i = \pi_0 + \pi_1 Z_{1i} + \pi_2 Z_{2i} + \gamma_1 W_i + \nu_i\)
  1. Second Stage Regression: The second stage uses the predicted values from the first stage to estimate the causal effect.
  • Equation: \(Y_i = \beta_0 + \beta_1 \hat{X}_i + \gamma_1 W_i + \text{error}_i\)
  • Key Difference: The regressor of interest is no longer the original endogenous variable \(X_i\), but its predicted value, \(\hat{X}_i\). This predicted value, \(\hat{X}_i\), is a linear combination of the exogenous variables only. By construction, it is uncorrelated with the structural error term \(\epsilon_i\), which solves the endogeneity problem and allows for a consistent estimate of \(\beta_1\).
  1. Standard Errors: You should not perform 2SLS manually because the standard errors calculated by OLS in the second stage are incorrect. The regressor \(\hat{X}_i\) is an estimate generated from the first stage, and it has its own variance and uncertainty. A manual OLS regression in the second stage fails to account for this additional uncertainty from the first stage. This causes the reported standard errors to be too small, t-statistics to be too large, and confidence intervals to be too narrow, leading to an overstatement of statistical significance. Specialized software commands (ivreg, IV2SLS, ivregress) are designed to calculate the correct standard errors for the entire two-stage process.

Footnotes

  1. Porta, R. L., Lopez-de-Silanes, F., Shleifer, A., & Vishny, R. W. (1998). Law and finance. Journal of political economy, 106(6), 1113-1155.↩︎