This calculated value of 0.81 is indeed very close to the 2SLS coefficient for avexpr reported in column (4), which is 0.80. The minor difference is due to rounding in the table.
Compare OLS and 2SLS: The 2SLS estimate (0.80) is almost double the OLS estimate (0.42). This implies that the OLS estimate is biased downwards. According to the omitted variable bias formula, this would happen if the omitted factors (captured in the error term) that negatively affect GDP are positively correlated with the quality of institutions. For example, a country’s history of wars and conflicts might be unfavorable to long-term growth (negative effect on GDP) and might be positively correlated with institutional quality, since war often requires organization, bureaucracy and fiscal capacity (positive correlation with avexpr). This would cause OLS to underestimate the true positive effect of institutions.
Assess Instrument Strength: The instrument, settler mortality (logem4), is not weak. The rule of thumb is that the first-stage F-statistic on the excluded instrument should be greater than 10. Here, the F-statistic is 10.27, which is above this threshold, indicating a sufficiently strong relationship between settler mortality and the quality of institutions.
Colonial Origins IV Strategy
Relevance: The first-stage logic follows a clear causal chain:
High Settler Mortality: In places where European settlers faced high mortality rates (e.g., from malaria, yellow fever), they were discouraged from establishing long-term settlements.
Extractive Institutions: Consequently, the colonial powers set up “extractive” institutions designed to transfer resources from the colony to the metropole as quickly as possible, with little emphasis on private property rights or checks and balances on government power.
Persistence: These extractive institutions have persisted long after independence. Conversely, in places with low mortality rates, Europeans settled in large numbers and demanded institutions that protected their rights and property, leading to the development of “inclusive” institutions that also persisted. Thus, historical settler mortality is strongly correlated with the quality of institutions today.
Exclusion Restriction: For settler mortality to be a valid instrument, it must affect current GDP per capita only through its effect on institutions. It cannot have a direct effect on today’s GDP or affect it through any other channel. For example, the historical disease environment that caused high mortality must not be the primary cause of poor health or low productivity in the country today.
LATE and “Compliers”: In this context, the “compliers” are the countries whose choice of institutional path (extractive vs. inclusive) was causally determined by the settler mortality rates they faced. The LATE represents the average treatment effect of having good institutions for this specific group of countries. It is the causal effect of institutions for countries whose institutional path was effectively “assigned” by their disease environment during the colonial era.
Defending and Challenging the Exclusion Restriction
Argument to Defend: The most compelling defense is that the diseases that determined settler mortality in the 18th and 19th centuries (primarily malaria and yellow fever) are no longer the primary determinants of health and wealth in the 21st century. Due to medical advances (like quinine and modern medicine) and public health infrastructure, the lethality of this historical disease environment has been massively reduced. Therefore, it is plausible that this historical factor has no significant direct impact on modern GDP, and its influence is only felt through the persistent institutional structures it helped to create.
Argument to Challenge (Violation): A plausible channel that violates the exclusion restriction is the persistence of the disease environment itself. If the factors that led to high settler mortality historically (e.g., a tropical climate suitable for malaria-carrying mosquitoes) are also responsible for a high disease burden today, then settler mortality could be affecting current GDP through both institutions and current public health. A high contemporary disease burden directly reduces labor productivity and human capital, meaning the instrument would have an effect on the outcome that is not channeled through institutions, thus violating the exclusion restriction.
# 6. Check and Verify Estimate of the Reduced Formiv = iv_model.params['treated']fs = first_stage.params['nearc4']## According to the theory, IV=Reduced Form/First Stageprint("IV * First Stage =", iv*fs)
IV * First Stage = 89.17950691220824
rf = ols("wage ~ nearc4", data=data).fit()print("Reduced Form = ", rf.params['nearc4'])
# 6. Check and Verify Estimate of the Reduced Formcoef(iv_model)[2] *coef(first_stage)[2]
fit_treated
89.17951
rf <-feols(wage ~ nearc4, data = data)coef(rf)[2]
nearc4
89.17951
* 1. Load data and create a dummy variablefor education > 12* Load data from the URLuse"https://github.com/basm92/ee_website/raw/refs/heads/master/tutorials/datafiles/CARD.DTA", clear* Generate the 'treated' dummy variable where education is greater than 12gen treated = (educ > 12)* 2. Calculate the Wald estimator* Calculate the mean wage for individuals near a 4-year collegequietlysummarize wage if nearc4 == 1scalar mean_wage_nearc4 = r(mean)* Calculate the mean wage for individuals not near a 4-year collegequietlysummarize wage if nearc4 == 0scalar mean_wage_not_nearc4 = r(mean)* Calculate the meanof the 'treated' variablefor individuals near a 4-year collegequietlysummarize treated if nearc4 == 1scalar mean_treated_nearc4 = r(mean)* Calculate the meanof the 'treated' variablefor individuals not near a 4-year collegequietlysummarize treated if nearc4 == 0scalar mean_treated_not_nearc4 = r(mean)* Calculate the numerator and denominator for the Wald estimatorscalar numerator = mean_wage_nearc4 - mean_wage_not_nearc4scalar denominator = mean_treated_nearc4 - mean_treated_not_nearc4* Calculate and display the Wald estimatorscalar wald = numerator / denominatordisplay"Wald Estimator: " wald* 3. Estimate 2SLS (Two-Stage Least Squares)* The endogenous variable is 'treated', and the instrument is 'nearc4'ivregress 2sls wage (treated = nearc4)estimatesstore iv_model* Display the 2SLS regression resultsestimatestable iv_model, star* Extract the IV coefficient for 'treated'scalar iv_coef = _b[treated]display iv_coef* 4. Estimate the First Stage regressionregress treated nearc4estimatesstore first_stage* Display the first stage regression resultsestimatestable first_stage, star* 5. Check and Verify the Estimate of the Reduced Form* Extract the first-stage coefficient for 'nearc4'scalar fs_coef = _b[nearc4]* According to the theory, IV = Reduced Form / First Stagedisplay"IV * First Stage = " iv_coef * fs_coef* Estimate the reduced form regressionregress wage nearc4display"Reduced Form = " _b[nearc4]
Estimate and Interpret the First Stage: The first stage of the 2SLS estimation isolates the relationship between the instrument (nearc4) and the endogenous treatment (treated). It answers the question: “Does growing up near a college actually make people more likely to get more than a high school education?”
Estimation and Interpretation: Running this simple regression reveals that the coefficient (\(\beta_1\)) on nearc4 is approximately 0.12.
This coefficient is positive and statistically significant. The interpretation is as follows: Growing up near a four-year college increases the probability of obtaining more than a high school education by about 12 percentage points, on average.
This result confirms that the instrument is “relevant”—it has a meaningful impact on the treatment decision.
Interpret the 2SLS Estimate: The 2SLS estimate provides the causal effect of the treatment on the outcome. For this analysis, it is the Wald Estimator. The regression of wage on treated using nearc4 as an instrument yields a coefficient of approximately 731.
Interpretation: This 2SLS coefficient is a Local Average Treatment Effect (LATE). It estimates the causal return to education for a specific subgroup of the population known as “compliers.” In this context, compliers are the individuals who pursued more than 12 years of education because they grew up near a college, and would not have done so otherwise.
The interpretation is:
For the group of people who were induced to get more than a high school education by having a college nearby, doing so caused their wages to increase by approximately 731 dollar on average.
This is a causal estimate because the IV strategy isolates the variation in education that is random (due to college proximity) rather than driven by confounding factors like innate ability.
Calculate and Interpret the Reduced Form Coefficient: The reduced form equation directly relates the outcome variable to the instrumental variable, showing the instrument’s total effect on the outcome.
The reduced form equation is:\(\text{wage} = \pi_0 + \pi_1 \text{nearc4} + v\).
The Wald estimator has a precise relationship with the first stage and the reduced form: 2SLS Estimate = (Reduced Form Coefficient) / (First Stage Coefficient)
Therefore, we can calculate the reduced form coefficient by rearranging the formula: Reduced Form Coefficient = 2SLS Estimate * First Stage Coefficient.
Interpretation: This calculated coefficient (\(\pi_1\) of 89 is the estimate you would get if you directly regressed wage on nearc4. The interpretation is:
On average, individuals who grew up near a four-year college earn approximately 89$ higher wages than those who did not. This effect captures the entire causal chain of events: growing up near a college makes people more likely to get more education, which in turn leads to higher wages.
Designing an Alternative IV Study
A potential alternative instrument for the quality of institutions is Legal Origin.1
Relevance: The legal system a country inherited from its colonizer (e.g., British Common Law vs. French Civil Law) is a fundamental component of its modern institutional framework. Different legal traditions have different approaches to protecting property rights and limiting state power. Therefore, a country’s legal origin is likely to be strongly correlated with its score on “protection against expropriation risk”.
Validity (Exclusion Restriction): The argument for validity is that the identity of a country’s colonizer was, from the perspective of the colonized country, a historical accident. Whether a country was colonized by Britain versus France should not affect its long-run economic development today through any channel other than the persistent legal and political institutions that were established.
Potential Weaknesses: The exclusion restriction is highly debatable. The identity of the colonizer (e.g., Britain) could be correlated with many other factors that affect growth, such as the introduction of a specific language (English), culture, or integration into different global trade networks. If these other factors have a direct impact on GDP, then legal origin would not be a valid instrument.
Deriving the IV Estimator
Given the model: \[ Y_i = \beta_0 + \beta_1 X_i + \epsilon_i \] And the two core IV assumptions: 1. Relevance: \(Cov(Z_i, X_i) \neq 0\) 2. Exclusion Restriction: \(Cov(Z_i, \epsilon_i) = 0\)
We can derive the estimator for \(\beta_1\).
Step 1: Take the covariance of the entire model equation with the instrument \(Z_i\). \[ Cov(Z_i, Y_i) = Cov(Z_i, \beta_0 + \beta_1 X_i + \epsilon_i) \]Step 2: Use the additive property of covariance to expand the right side. \[ Cov(Z_i, Y_i) = Cov(Z_i, \beta_0) + Cov(Z_i, \beta_1 X_i) + Cov(Z_i, \epsilon_i) \]Step 3: Simplify each term.
The covariance with a constant is zero: \(Cov(Z_i, \beta_0) = 0\).
Constants can be factored out of covariance: \(Cov(Z_i, \beta_1 X_i) = \beta_1 Cov(Z_i, X_i)\).
By the exclusion restriction assumption: \(Cov(Z_i, \epsilon_i) = 0\).
Step 4: Substitute the simplified terms back into the equation. \[ Cov(Z_i, Y_i) = 0 + \beta_1 Cov(Z_i, X_i) + 0 \]\[ Cov(Z_i, Y_i) = \beta_1 Cov(Z_i, X_i) \]Step 5: Solve for \(\beta_1\). \[ \hat{\beta}_1^{\text{IV}} = \frac{Cov(Z, Y)}{Cov(Z, X)} \]
LATE: The Local Average Treatment Effect is the average effect of the treatment specifically for the subpopulation of Compliers. For this group, the average outcome with treatment is 60, and the average outcome without treatment is 40. \[ \text{LATE} = E[Y(1) - Y(0) | \text{Complier}] = 60 - 40 = 20 \] The LATE is 20, which is exactly what the Wald estimator calculates.
The Danger of Weak Instruments
Intuitive Explanation: The approximate bias of the IV estimator is \(Bias(\hat{\beta}_{IV}) \approx \frac{Cov(Z, \epsilon)}{Cov(Z, X)}\).
The numerator, \(Cov(Z, \epsilon)\), represents the degree to which the exclusion restriction is violated.
The denominator, \(Cov(Z, X)\), represents the strength (relevance) of the instrument. A “weak instrument” means the denominator is very close to zero. Even if the instrument is almost perfect (the numerator is a tiny, non-zero number), dividing a small number by a very small number results in a large ratio. Therefore, a weak instrument dramatically magnifies any small violation of the exclusion restriction, leading to a large bias in the final estimate.
Testing in Practice: Researchers test for weak instruments by examining the first-stage regression (the regression of the endogenous variable on the instrument(s) and controls). They use the F-statistic that tests the joint significance of all excluded instruments. The common rule of thumb is that an F-statistic greater than 10 indicates that the instruments are not weak.
Understanding 2SLS
First Stage Regression: The purpose of the first stage is to predict the endogenous variable using only exogenous information.
Dependent Variable: The endogenous variable, \(X_i\) (police numbers).
Independent Variables: All exogenous variables in the model, which include the instruments (\(Z_{1i}\) for firefighters and \(Z_{2i}\) for election year) and any exogenous controls (\(W_i\) for poverty level).
Key Difference: The regressor of interest is no longer the original endogenous variable \(X_i\), but its predicted value, \(\hat{X}_i\). This predicted value, \(\hat{X}_i\), is a linear combination of the exogenous variables only. By construction, it is uncorrelated with the structural error term \(\epsilon_i\), which solves the endogeneity problem and allows for a consistent estimate of \(\beta_1\).
Standard Errors: You should not perform 2SLS manually because the standard errors calculated by OLS in the second stage are incorrect. The regressor \(\hat{X}_i\) is an estimate generated from the first stage, and it has its own variance and uncertainty. A manual OLS regression in the second stage fails to account for this additional uncertainty from the first stage. This causes the reported standard errors to be too small, t-statistics to be too large, and confidence intervals to be too narrow, leading to an overstatement of statistical significance. Specialized software commands (ivreg, IV2SLS, ivregress) are designed to calculate the correct standard errors for the entire two-stage process.
Footnotes
Porta, R. L., Lopez-de-Silanes, F., Shleifer, A., & Vishny, R. W. (1998). Law and finance. Journal of political economy, 106(6), 1113-1155.↩︎