here() starts at /home/bas/Documents/git/ee_website
Solutions Tutorial 5
The Omitted Variable Bias Problem in Pooled OLS
The Pooled OLS model ignores the panel structure of the data and can suffer from omitted variable bias if unobserved individual characteristics are correlated with the regressors. The Fixed Effects model is designed to solve this problem.
- Load the
wagepandataset (and declare it as a panel dataset usingnras the individual identifier andyearas the time identifier). - Estimate a Pooled OLS model to predict
lwageusingexper(experience),expersq(experience squared), andunion. - Estimate a Fixed Effects (“within”) model using the same variables.
- Present the results of both models side-by-side in a single table.
- Interpret the difference: Pay close attention to the coefficient for the
unionvariable. Why is the estimated return to union membership different in the FE model compared to the Pooled OLS model? What does this suggest about how union members may differ from non-union members in ways not captured by the other variables?
The coefficient for union decreases in magnitude by about 50% when estimated by fixed effects (0.167 vs. 0.083). When comparing differences in wages for the same person when a person switches union membership. This comparison is arguably more sensible than comparing union members and non-union members, since these might differ in ways that are not captured by the experience control variables. For example, union workers might work disproportionally in higher paid professions, such as aviation.
import pandas as pd
import pyfixest as pf
# 1. Load the `wagepan` dataset (and declare it as a panel dataset using `nr` as the individual identifier and `year` as the time identifier).
# Declaring as a panel not necessary in Python
wagepan = pd.read_stata("../tutorials/datafiles/WAGEPAN.DTA")
# 2. Estimate a **Pooled OLS** model to predict `lwage` using `exper` (experience), `expersq` (experience squared), and `union`.
pooled_ols = pf.feols("lwage ~ exper + expersq + union", data = wagepan)
# 3. Estimate a **Fixed Effects ("within")** model using the same variables.
fixed_ef = pf.feols("lwage ~ exper + expersq + union | nr", data = wagepan)
# 4. Present the results of both models side-by-side in a single table.
pf.etable([pooled_ols, fixed_ef])| lwage | ||
|---|---|---|
| (1) | (2) | |
| coef | ||
| exper | 0.133*** (0.011) |
0.122*** (0.011) |
| expersq | -0.007*** (0.001) |
-0.004*** (0.001) |
| union | 0.167*** (0.018) |
0.083*** (0.023) |
| Intercept | 1.103*** (0.035) |
|
| fe | ||
| nr | - | x |
| stats | ||
| Observations | 4360 | 4360 |
| S.E. type | iid | by: nr |
| R2 | 0.073 | 0.619 |
| R2 Within | - | 0.177 |
| Significance levels: * p < 0.05, ** p < 0.01, *** p < 0.001. Format of coefficient cell: Coefficient (Std. Error) | ||
# 1. Load the `wagepan` dataset (and declare it as a panel dataset using `nr` as the individual identifier and `year` as the time identifier).
library(haven); library(fixest)
# Declaring as a panel not necessary in Python
wagepan <- read_dta("../tutorials/datafiles/WAGEPAN.DTA")
# 2. Estimate a **Pooled OLS** model to predict `lwage` using `exper` (experience), `expersq` (experience squared), and `union`.
pooled_ols <- feols(lwage ~ exper + expersq + union, data = wagepan)
# 3. Estimate a **Fixed Effects ("within")** model using the same variables.
fixed_ef <- feols(lwage ~ exper + expersq + union | nr, data = wagepan)
# 4. Present the results of both models side-by-side in a single table.
etable(list(pooled_ols, fixed_ef)) model 1 model 2
Dependent Var.: lwage lwage
Constant 1.103*** (0.0353)
exper 0.1325*** (0.0105) 0.1219*** (0.0082)
expersq -0.0071*** (0.0007) -0.0045*** (0.0006)
union 0.1673*** (0.0181) 0.0833*** (0.0193)
Fixed-Effects: ------------------- -------------------
nr No Yes
_______________ ___________________ ___________________
S.E. type IID IID
Observations 4,360 4,360
R2 0.07278 0.61913
Within R2 -- 0.17672
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
* 1. Load the dataset
use "../tutorials/datafiles/WAGEPAN.DTA", clear
* 2. Estimate a Pooled OLS model
regress lwage exper expersq union
* Store the results for table
estimates store pooled_ols
* 3. Estimate a Fixed Effects ("within") model
xtset nr // declare panel structure with nr as ID variable
xtreg lwage exper expersq union, fe
* Store the results for table
estimates store fixed_ef
* 4. Present results side-by-side
estimates table pooled_ols fixed_ef, b(%9.4f) se(%9.4f) stats(N r2 r2_a)
* Alternative with more formatting options (requires estout package)
* ssc install estout, replace // install if needed
esttab pooled_ols fixed_ef, cells(b(star fmt(4)) se(par fmt(4))) ///
stats(N r2 r2_a, fmt(%9.0g %9.3f %9.3f)) ///
mtitle("Pooled OLS" "Fixed Effects") ///
star(* 0.10 ** 0.05 *** 0.01)Choosing Your Model
While the Fixed Effects (FE) model is robust to correlation between unobserved effects and regressors, the Random Effects (RE) model is more efficient if its key assumption (that this correlation is zero) holds. The Hausman test is the standard tool for making this choice.
- Using the
wagepanpanel data, estimate a Random Effects model wherelwageis a function of the time-varying predictorsmarriedandunion, and the (mostly) time-invariant predictoreduc.1 - Estimate the corresponding Fixed Effects model with the same variables.2
- Perform a Hausman test to formally compare the two models.
- Conclude: Based on the p-value of the test, which model should you choose? Clearly state the null hypothesis of the Hausman test and explain what your result implies about the relationship between the unobserved individual effects (\(\alpha_i\)) and the explanatory variables in your model.
The null hypothesis is consistency of both the RE and FE estimators, and the alternative hypothesis is that only the FE estimator is consistent. The test rejects the null hypothesis, implying there is a correlation between the individual \(\alpha_i\)’s and the explanatory variables in our model. Hence, the RE model suffers from omitted variable bias and its estimates cannot be trusted.
import pandas as pd
import pyfixest as pf
import numpy as np
from scipy.stats import chi2
from linearmodels import RandomEffects
from linearmodels import PanelOLS
from linearmodels.panel import compare
# Load example dataset (or use your own panel data)
wagepan = pd.read_stata("../tutorials/datafiles/WAGEPAN.DTA")
wagepan_panel = wagepan.set_index(['nr', 'year'])
# 1. Fit Random Effects Model
formula = 'lwage ~ married + union + educ'
re_model = RandomEffects.from_formula(formula, wagepan_panel).fit()
print(re_model)
## RandomEffects Estimation Summary
## ================================================================================
## Dep. Variable: lwage R-squared: 0.7260
## Estimator: RandomEffects R-squared (Between): 0.9535
## No. Observations: 4360 R-squared (Within): 0.0492
## Date: Wed, Oct 29 2025 R-squared (Overall): 0.9140
## Time: 12:55:14 Log-likelihood -1945.1
## Cov. Estimator: Unadjusted
## F-statistic: 3847.9
## Entities: 545 P-value 0.0000
## Avg Obs: 8.0000 Distribution: F(3,4357)
## Min Obs: 8.0000
## Max Obs: 8.0000 F-statistic (robust): 3847.9
## P-value 0.0000
## Time periods: 8 Distribution: F(3,4357)
## Avg Obs: 545.00
## Min Obs: 545.00
## Max Obs: 545.00
##
## Parameter Estimates
## ==============================================================================
## Parameter Std. Err. T-stat P-value Lower CI Upper CI
## ------------------------------------------------------------------------------
## married 0.2388 0.0162 14.698 0.0000 0.2069 0.2706
## union 0.1028 0.0190 5.4174 0.0000 0.0656 0.1401
## educ 0.1280 0.0015 86.360 0.0000 0.1251 0.1309
## ==============================================================================
# 2. Fit Fixed Effects Model (This time also through linearmodels)
formula = 'lwage ~ married + union + EntityEffects' # educ cannot be estimated
fe_model = PanelOLS.from_formula(formula, wagepan_panel).fit()
print(fe_model)
## PanelOLS Estimation Summary
## ================================================================================
## Dep. Variable: lwage R-squared: 0.0498
## Estimator: PanelOLS R-squared (Between): 0.1392
## No. Observations: 4360 R-squared (Within): 0.0498
## Date: Wed, Oct 29 2025 R-squared (Overall): 0.1353
## Time: 12:55:14 Log-likelihood -1647.6
## Cov. Estimator: Unadjusted
## F-statistic: 99.998
## Entities: 545 P-value 0.0000
## Avg Obs: 8.0000 Distribution: F(2,3813)
## Min Obs: 8.0000
## Max Obs: 8.0000 F-statistic (robust): 99.998
## P-value 0.0000
## Time periods: 8 Distribution: F(2,3813)
## Avg Obs: 545.00
## Min Obs: 545.00
## Max Obs: 545.00
##
## Parameter Estimates
## ==============================================================================
## Parameter Std. Err. T-stat P-value Lower CI Upper CI
## ------------------------------------------------------------------------------
## married 0.2417 0.0177 13.675 0.0000 0.2070 0.2763
## union 0.0700 0.0207 3.3798 0.0007 0.0294 0.1107
## ==============================================================================
##
## F-test for Poolability: 7.9689
## P-value: 0.0000
## Distribution: F(544,3813)
##
## Included effects: Entity
# 3. Perform a **Hausman test** to formally compare the two models.
# Manual Hausman Test
# Extract coefficients and covariance matrices
beta_fe, beta_re = fe_model.params, re_model.params[:2]
cov_fe, cov_re = fe_model.cov, re_model.cov.iloc[:2, :2]
# Compute difference and covariance
diff = beta_fe - beta_re
cov_diff = cov_fe - cov_re # FE covariance is "more efficient"
# Hausman test statistic
hausman_stat = diff.T @ np.linalg.inv(cov_diff) @ diff
df = len(diff) # degrees of freedom
p_value = 1 - chi2.cdf(hausman_stat, df)
print(f"Hausman Statistic: {hausman_stat:.4f}")
## Hausman Statistic: 15.8755
print(f"P-value: {p_value:.4f}")
## P-value: 0.0004#1. Using the `wagepan` panel data, estimate a **Random Effects** model where `lwage` is a function of the time-varying predictors `married` and `union`, and the (mostly) time-invariant predictor `educ`.
library(haven); library(fixest); library(plm)
wagepan <- read_dta('../tutorials/datafiles/WAGEPAN.DTA')
formula <- 'lwage ~ married + union + educ'
random <- plm(formula, data=wagepan, index=c("nr", "year"), model="random") #random model
summary(random)
## Oneway (individual) effect Random Effect Model
## (Swamy-Arora's transformation)
##
## Call:
## plm(formula = formula, data = wagepan, model = "random", index = c("nr",
## "year"))
##
## Balanced Panel: n = 545, T = 8, N = 4360
##
## Effects:
## var std.dev share
## idiosyncratic 0.1426 0.3776 0.574
## individual 0.1057 0.3251 0.426
## theta: 0.6202
##
## Residuals:
## Min. 1st Qu. Median 3rd Qu. Max.
## -4.526174 -0.160645 0.025922 0.206346 1.649449
##
## Coefficients:
## Estimate Std. Error z-value Pr(>|z|)
## (Intercept) 0.6306824 0.1029879 6.1239 9.134e-10 ***
## married 0.2328783 0.0161908 14.3834 < 2.2e-16 ***
## union 0.0991208 0.0189042 5.2433 1.577e-07 ***
## educ 0.0758092 0.0086334 8.7809 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Total Sum of Squares: 667.92
## Residual Sum of Squares: 622.56
## R-Squared: 0.067925
## Adj. R-Squared: 0.067283
## Chisq: 317.443 on 3 DF, p-value: < 2.22e-16
#2. Estimate the corresponding **Fixed Effects** model with the same variables.
fixed <- plm(formula, data=wagepan, index=c("nr", "year"), model="within") #fixed model
summary(fixed)
## Oneway (individual) effect Within Model
##
## Call:
## plm(formula = formula, data = wagepan, model = "within", index = c("nr",
## "year"))
##
## Balanced Panel: n = 545, T = 8, N = 4360
##
## Residuals:
## Min. 1st Qu. Median 3rd Qu. Max.
## -4.116348 -0.137546 0.016979 0.180612 1.555540
##
## Coefficients:
## Estimate Std. Error t-value Pr(>|t|)
## married 0.241684 0.017673 13.6750 < 2.2e-16 ***
## union 0.070044 0.020724 3.3798 0.0007326 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Total Sum of Squares: 572.05
## Residual Sum of Squares: 543.54
## R-Squared: 0.049837
## Adj. R-Squared: -0.086221
## F-statistic: 99.9981 on 2 and 3813 DF, p-value: < 2.22e-16
#3. Perform a **Hausman test** to formally compare the two models.
phtest(fixed,random) #Hausman test
##
## Hausman Test
##
## data: formula
## chisq = 13.453, df = 2, p-value = 0.001199
## alternative hypothesis: one model is inconsistentuse "../tutorials/datafiles/WAGEPAN.DTA", clear
xtset nr // declare panel structure with nr as ID variable
* 1. Random Effects
xtreg lwage married union educ, re
* Store the results for test
estimates store re
* 2. Estimate a Fixed Effects ("within") model
xtreg lwage married union educ, fe
* Store results
estimates store fe
* 3. Hausman test
hausman fe re Time-Invariant Variables
A critical drawback of the Fixed Effects model is that it cannot estimate the coefficients of time-invariant variables. The “within” transformation, which subtracts the individual-specific mean, wipes out any variable that is constant over time for an individual.
- Attempt to estimate a Fixed Effects model for
lwagethat includes both time-varying (union,married) and time-invariant (black,hisp) predictors. - Observe the model output. What happens to the coefficients for the
blackandhispvariables?
The coefficients cannot be estimated, and are dropped from the model.
- Now, estimate a Random Effects model using the exact same set of predictors.
- Explain the difference: Why can the RE model provide estimates for
blackandhispwhile the FE model cannot? Relate your answer directly to the underlying mathematical transformation of each estimator.
The RE model is able to estimate the effects of time-invariant variables, while the FE model is not, due to collinearity of the time-invariant variables with the fixed effects. In other words, the FE model cannot provide estimates due to a lack of within variation.
import pandas as pd
import pyfixest as pf
import matplotlib.pyplot as plt
from linearmodels import RandomEffects
# Load data
df = pd.read_stata("../tutorials/datafiles/WAGEPAN.DTA")
# Estimate FE model
fe = pf.feols("lwage ~ union + married + black + hisp | nr", data=df)/home/bas/Documents/git/ee_website/.venv/lib/python3.13/site-packages/pyfixest/estimation/feols_.py:2703: UserWarning:
2 variables dropped due to multicollinearity.
The following variables are dropped: ['black', 'hisp'].
warnings.warn(
fe.summary()###
Estimation: OLS
Dep. var.: lwage, Fixed effects: nr
Inference: CRV1
Observations: 4360
| Coefficient | Estimate | Std. Error | t value | Pr(>|t|) | 2.5% | 97.5% |
|:--------------|-----------:|-------------:|----------:|-----------:|-------:|--------:|
| union | 0.070 | 0.025 | 2.785 | 0.006 | 0.021 | 0.119 |
| married | 0.242 | 0.022 | 10.990 | 0.000 | 0.198 | 0.285 |
---
RMSE: 0.353 R2: 0.56 R2 Within: 0.05
# Estimate RE model
wagepan_panel = df.set_index(['nr', 'year'])
re_model = RandomEffects.from_formula("lwage ~ union + married + black + hisp", wagepan_panel).fit()
print(re_model) RandomEffects Estimation Summary
================================================================================
Dep. Variable: lwage R-squared: 0.1478
Estimator: RandomEffects R-squared (Between): 0.3875
No. Observations: 4360 R-squared (Within): 0.0402
Date: Wed, Oct 29 2025 R-squared (Overall): 0.3723
Time: 12:55:14 Log-likelihood -2218.4
Cov. Estimator: Unadjusted
F-statistic: 188.85
Entities: 545 P-value 0.0000
Avg Obs: 8.0000 Distribution: F(4,4356)
Min Obs: 8.0000
Max Obs: 8.0000 F-statistic (robust): 188.85
P-value 0.0000
Time periods: 8 Distribution: F(4,4356)
Avg Obs: 545.00
Min Obs: 545.00
Max Obs: 545.00
Parameter Estimates
==============================================================================
Parameter Std. Err. T-stat P-value Lower CI Upper CI
------------------------------------------------------------------------------
union 0.1350 0.0217 6.2207 0.0000 0.0925 0.1776
married 0.3354 0.0183 18.296 0.0000 0.2995 0.3714
black 1.3864 0.1256 11.037 0.0000 1.1402 1.6327
hisp 1.4340 0.1083 13.239 0.0000 1.2216 1.6463
==============================================================================
wagepan <- read_dta('../tutorials/datafiles/WAGEPAN.DTA')
formula <- 'lwage ~ union + married + black + hisp'
#1 Estimate FE.
fixed <- plm(formula, data=wagepan, index=c("nr", "year"), model="within") #fixed model
summary(fixed)
## Oneway (individual) effect Within Model
##
## Call:
## plm(formula = formula, data = wagepan, model = "within", index = c("nr",
## "year"))
##
## Balanced Panel: n = 545, T = 8, N = 4360
##
## Residuals:
## Min. 1st Qu. Median 3rd Qu. Max.
## -4.116348 -0.137546 0.016979 0.180612 1.555540
##
## Coefficients:
## Estimate Std. Error t-value Pr(>|t|)
## union 0.070044 0.020724 3.3798 0.0007326 ***
## married 0.241684 0.017673 13.6750 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Total Sum of Squares: 572.05
## Residual Sum of Squares: 543.54
## R-Squared: 0.049837
## Adj. R-Squared: -0.086221
## F-statistic: 99.9981 on 2 and 3813 DF, p-value: < 2.22e-16
# 2. Random Effects
random <- plm(formula, data=wagepan, index=c("nr", "year"), model="random") #random model
summary(random)
## Oneway (individual) effect Random Effect Model
## (Swamy-Arora's transformation)
##
## Call:
## plm(formula = formula, data = wagepan, model = "random", index = c("nr",
## "year"))
##
## Balanced Panel: n = 545, T = 8, N = 4360
##
## Effects:
## var std.dev share
## idiosyncratic 0.1426 0.3776 0.541
## individual 0.1212 0.3481 0.459
## theta: 0.642
##
## Residuals:
## Min. 1st Qu. Median 3rd Qu. Max.
## -4.47729 -0.16103 0.02310 0.21081 1.64228
##
## Coefficients:
## Estimate Std. Error z-value Pr(>|z|)
## (Intercept) 1.545371 0.020615 74.9638 < 2.2e-16 ***
## union 0.098757 0.019126 5.1635 2.423e-07 ***
## married 0.232461 0.016381 14.1908 < 2.2e-16 ***
## black -0.118913 0.050842 -2.3389 0.01934 *
## hisp -0.055309 0.044639 -1.2390 0.21534
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Total Sum of Squares: 657.23
## Residual Sum of Squares: 622.67
## R-Squared: 0.052587
## Adj. R-Squared: 0.051717
## Chisq: 241.73 on 4 DF, p-value: < 2.22e-16* Load data
use "../tutorials/datafiles/WAGEPAN.DTA", clear
* Declare panel structure
xtset nr year
* 1. Fixed Effects estimation
xtreg lwage union married black hisp, fe
estimates store fixed
* 2. Random Effects estimation
xtreg lwage union married black hisp, re
estimates store random
* (Optional) Display both results for comparison
estimates table fixed random, b se stats(N r2)Interpreting the Hausman Test Statistic
In mathematical terms, the null hypothesis (\(H_0\)) of the Hausman test is that the unobserved individual-specific effects, \(\alpha_i\), are not correlated with the explanatory variables, \(x_{it}\).
This can be formally stated as:
\(H_0: Cov(\alpha_i, x_{it}) = 0\)
Under this null hypothesis, both the Fixed Effects and Random Effects estimators are consistent, but the Random Effects estimator is more efficient. The alternative hypothesis (\(H_A\)) is that a correlation does exist, meaning \(Cov(\alpha_i, x_{it}) \neq 0\). Under the alternative, only the Fixed Effects estimator remains consistent.
A large, statistically significant test statistic from a Hausman test leads to the rejection of the null hypothesis. This directly implies that the Random Effects estimates are inconsistent due to the following chain of logic:
- Consistency of the Estimators:
- Fixed Effects (FE): The FE estimator is consistent whether the individual effects (\(\alpha_i\)) are correlated with the regressors or not. It achieves consistency by eliminating the \(\alpha_i\) term from the equation entirely (through de-meaning or first-differencing).
- Random Effects (RE): The RE estimator’s consistency critically depends on the assumption that the individual effects (\(\alpha_i\)) are uncorrelated with the regressors. This is the very condition being tested by the null hypothesis. If this assumption is violated, the RE model suffers from omitted variable bias because it fails to properly account for the influence of the correlated \(\alpha_i\) term. This bias makes the RE estimator inconsistent.
- What the Hausman Test Measures: The test statistic is constructed based on the difference between the coefficient vectors from the two models: \((\hat{\beta}_{FE} - \hat{\beta}_{RE})\).
- If the null hypothesis is true (\(Cov(\alpha_i, x_{it}) = 0\)), both estimators are consistent. This means as the sample size grows, both \(\hat{\beta}_{FE}\) and \(\hat{\beta}_{RE}\) should converge to the same true parameter values. Any difference between them should be small and statistically insignificant (due to random sampling variation).
- If the null hypothesis is false (\(Cov(\alpha_i, x_{it}) \neq 0\)), the two estimators will converge to different values. The FE estimator will converge to the true \(\beta\), but the RE estimator will converge to a different, biased value. Therefore, the difference \((\hat{\beta}_{FE} - \hat{\beta}_{RE})\) will be systematic and statistically significant.
- Conclusion from a Significant Result: A large and statistically significant Hausman test statistic indicates that the difference between the FE and RE coefficients is too large to be attributed to chance. Since we know the FE estimator is consistent under either scenario, the systematic difference must be due to the failure of the RE estimator. By rejecting the null hypothesis, you are concluding that the core assumption required for the RE model’s consistency is violated. Therefore, the Random Effects estimates are inconsistent.
Choosing an Empirical Strategy
1. Fixed Effects (FE) Estimator: Advantage and Limitation
Primary Advantage: The primary advantage of the Fixed Effects estimator is its ability to control for omitted variable bias arising from unobserved, time-invariant heterogeneity. The model assumes an unobserved effect, \(a_i\), for each firm that is constant over time. This \(a_i\) could represent factors like a firm’s inherent managerial quality or its long-standing brand reputation. If these factors are correlated with both CEO salary and the explanatory variables (e.g., more reputable firms hire more experienced CEOs and pay them more), then estimators that ignore \(a_i\) (like Pooled OLS) will produce biased results. The FE estimator eliminates \(a_i\) by de-meaning the data (i.e., subtracting the time-mean from each variable for each individual firm). This process removes any bias from time-invariant confounders, making the resulting estimates for \(\beta_1\) and \(\beta_2\) consistent.
Significant Limitation: The main limitation of the FE estimator for this specific model is that it cannot estimate the effect of time-invariant explanatory variables. In the proposed equation, the variable
IvyLeague_idoes not change over time for a given CEO. The de-meaning process of the FE estimator subtracts the mean of each variable from itself: \(\ddot{z}_{it} = z_{it} - \bar{z}_i\). For a time-invariant variable, \(z_{it} = \bar{z}_i\) for all \(t\), so the transformed variable becomes zero. As a result, the effect of graduating from an Ivy League university, \(\gamma_1\), cannot be estimated using the FE model.
2. Random Effects (RE) Model: Key Assumption and Advantage
Key Assumption: The critical assumption of the Random Effects model is that the unobserved individual effect, \(a_i\), is uncorrelated with all explanatory variables in all time periods. Formally, this is expressed as: \[ E(a_i | \text{FirmProfit}_{i1}, ..., \text{FirmProfit}_{iT}, \text{Experience}_{i1}, ..., \text{Experience}_{iT}, \text{IvyLeague}_i) = 0 \] In this context, it means that unobserved factors like a firm’s “corporate culture” are not systematically related to its profits or the experience of its CEO.
Advantage over FE: If this key assumption holds, the RE model has two major advantages over the FE model:
- It can estimate the effects of time-invariant variables. Unlike the FE model, the RE model does not wipe out variables like
IvyLeague_i, allowing the economist to estimate the salary premium associated with an Ivy League education. - It is more efficient. The RE estimator uses both the “within” (over time) and “between” (across firms) variation in the data, whereas the FE estimator only uses the “within” variation. This makes the RE estimator more efficient (i.e., it produces estimates with smaller standard errors), assuming its key assumption is met.
- It can estimate the effects of time-invariant variables. Unlike the FE model, the RE model does not wipe out variables like
3. Step-by-Step Testing Procedure
To determine the most appropriate estimator, the economist should follow this procedure:
- Step 1: Choose between Pooled OLS and Random Effects.
- Test: Perform the Breusch-Pagan Lagrange Multiplier (LM) test for random effects.
- Hypotheses:
- \(H_0: \sigma_a^2 = 0\) (There are no significant firm-specific effects. The variance of the unobserved effect is zero).
- \(H_1: \sigma_a^2 > 0\) (There are significant firm-specific effects).
- Decision: If the p-value is high (e.g., > 0.05), fail to reject the null. This suggests firm-specific effects are negligible, and the simpler Pooled OLS model is adequate. If the p-value is low (< 0.05), reject the null, indicating that a model accounting for individual effects (Random Effects) is superior to Pooled OLS.
- Step 2: Choose between Random Effects and Fixed Effects.
- Test: Perform the Hausman test. This test compares the coefficient estimates from the RE and FE models.
- Hypotheses:
- \(H_0: E(a_i | X_{it}) = 0\) (The key RE assumption holds; the unobserved effects are not correlated with the regressors. Any difference between RE and FE coefficients is not systematic).
- \(H_1: E(a_i | X_{it}) \neq 0\) (The RE assumption is violated. The difference between coefficients is systematic).
- Decision: If the p-value is high (e.g., > 0.05), fail to reject the null. This means there is no evidence that the RE assumption is violated, so the economist should choose the more efficient Random Effects model. If the p-value is low (< 0.05), reject the null. This indicates the RE model would suffer from omitted variable bias and is inconsistent. The economist must therefore use the consistent Fixed Effects model, even if it means not being able to estimate the effect of
IvyLeague_i.
- Step 3: Check for Serial Correlation.
- Regardless of the chosen model (POLS, RE, or FE), the economist should test for serial correlation in the error terms using a test like the Breusch-Godfrey test. If serial correlation is present, standard errors will be incorrect. The solution is to re-estimate the chosen model using clustered standard errors (clustered at the firm level) to produce valid statistical inferences.
Interpreting Conflicting Results
1. Breusch-Pagan Test: RE vs. Pooled OLS
Based on the Breusch-Pagan test, the researcher should prefer the Random Effects model over the Pooled OLS model.
- Interpretation: The null hypothesis of the Breusch-Pagan test is that the variance of the unobserved individual-specific effects is zero (\(H_0: \sigma_a^2 = 0\)). This implies that there are no unique, time-invariant characteristics across farms that influence crop yield. The reported p-value of 0.001 is far below any conventional significance level (e.g., 0.05). Therefore, we strongly reject the null hypothesis. This provides statistical evidence that significant unobserved heterogeneity exists across farms (e.g., differences in soil quality, drainage, or farmer skill). Because Pooled OLS ignores this heterogeneity, it is an inappropriate model. The Random Effects model, which explicitly models this farm-specific variance, is the better choice.
2. Explanation for Different Fertilizer Coefficients
The drastic difference in the Fertilizer coefficient between the RE model (6.2) and the FE model (2.1) is a classic sign of omitted variable bias, which the Hausman test later confirms.
Econometric Explanation: The Random Effects model assumes that the unobserved farm-specific effects (\(a_i\)) are uncorrelated with the regressors, including
Fertilizer_it. The Fixed Effects model makes no such assumption. The large, downward shift in the coefficient when moving from RE to FE suggests that \(a_i\) is positively correlated with both fertilizer use and crop yield.Real-World Example: The unobserved effect \(a_i\) could represent the farm’s innate soil quality. It is plausible that farmers with better soil quality (high \(a_i\)) are more progressive or have more capital, making them more likely to adopt the new, effective fertilizer (positive correlation between \(a_i\) and
Fertilizer_it). At the same time, better soil naturally leads to higher yields (positive correlation between \(a_i\) andYield_it).Conclusion: The RE model fails to control for this correlation. It incorrectly attributes some of the yield-enhancing effect of good soil to the fertilizer, leading to an upwardly biased and inconsistent estimate of the fertilizer’s true effect. The FE model, by controlling for all time-invariant factors like soil quality, isolates the true causal effect of changing fertilizer use on yield for a given farm, providing a much lower and likely more accurate estimate.
3. Final Model Choice and Interpretation
The researcher should select the Fixed Effects (FE) model as the most reliable.
Justification: The Hausman test is designed to choose between the RE and FE models. Its null hypothesis is that the RE model is appropriate (i.e., the unobserved effects are not correlated with the regressors). The test yields a p-value of 0.000, which is highly significant. This leads us to reject the null hypothesis and conclude that the key assumption of the RE model is violated. The systematic difference between the RE and FE coefficients is statistically significant, confirming the presence of omitted variable bias in the RE estimates. Therefore, the consistent FE model must be chosen.
Final Estimated Effect: According to the chosen Fixed Effects model, a one-unit increase in the application of the fertilizer is associated with a 2.1-unit increase in crop yield, holding rainfall constant. This effect is statistically significant at the 10% level (p < 0.1).
Random Effects and the Role of \(\theta\)
- As the variance of the individual effect approaches zero (\(\sigma^2_\alpha \to 0\))
Behavior of \(\theta\): As \(\sigma^2_\alpha \to 0\), the term \(T \sigma^2_\alpha\) also goes to zero. The formula for \(\theta\) becomes: \(\theta \to 1 - \sqrt{\frac{\sigma^2_\epsilon}{(0 + \sigma^2_\epsilon)}} = 1 - \sqrt{1} = 0\) The value of \(\theta\) approaches zero.
Model Simplification: The Random Effects transformation is \(y_{it} - \theta \bar{y}_i\). When \(\theta=0\), this becomes \(y_{it} - 0 \cdot \bar{y}_i = y_{it}\). The data is left untransformed. Therefore, the Random Effects model simplifies to the Pooled OLS model. This is logical, as a zero variance for the individual effect implies there are no unique individual effects to control for.
- As the number of time periods becomes very large (\(T \to \infty\))
Behavior of \(\theta\): As \(T \to \infty\), the denominator \((T \sigma^2_\alpha + \sigma^2_\epsilon)\) also approaches infinity. This causes the fraction inside the square root to approach zero: \(\frac{\sigma^2_\epsilon}{(T \sigma^2_\alpha + \sigma^2_\epsilon)} \to 0\) The formula for \(\theta\) thus becomes: \(\theta \to 1 - \sqrt{0} = 1\) The value of \(\theta\) approaches one.
Model Equivalence: When \(\theta=1\), the Random Effects transformation \(y_{it} - \theta \bar{y}_i\) becomes \(y_{it} - \bar{y}_i\). This is the “de-meaning” or “within” transformation. Consequently, the Random Effects estimator becomes equivalent to the Fixed Effects (FE) estimator.