Empirical Economics

Tutorial 5: Panel Data II

Tutorial 5

Recapitulation of the Lecture

New Estimators

The fundamental static panel data model is: \[ y_{it} = \beta_1 x_{1it} + ... + \beta_k x_{kit} + a_i + u_{it} \]

\(a_i\) is the unobserved, time-invariant individual effect.
\(u_{it}\) is the idiosyncratic error term.

New estimators:

Pooled OLS: Ignores the panel structure and treats the data as one large cross-section. It estimates

\(y_{it} = \beta_0 + \beta_1 x_{1it} + ... + \beta_k x_{kit} + v_{it}\) where the composite error is \(v_{it} = a_i + u_{it}\).
Key Assumption: Assumes the unobserved effect \(a_i\) is uncorrelated with the explanatory variables (\(E(a_i | X_i) = 0\)) AND that the variance of \(a_i\) is zero (\(\sigma_a^2 = 0\)).
Problem: The presence of \(a_i\) in the error term induces serial correlation. This requires using clustered standard errors for correct inference.

New Estimators (Cont.)

The fundamental static panel data model is: \[ y_{it} = \beta_1 x_{1it} + ... + \beta_k x_{kit} + a_i + u_{it} \]

\(a_i\) is the unobserved, time-invariant individual effect.
\(u_{it}\) is the idiosyncratic error term.

Random Effects (RE): acknowledges the unobserved effect \(a_i\) but treats it as a random component of the error term. It uses Generalized Least Squares (GLS) to account for the serial correlation structure.

Key Assumption: \(a_i\) is uncorrelated with the explanatory variables (\(E(a_i | X_i) = 0\)).
Main Advantages:
It can estimate the coefficients of time-invariant variables (e.g., gender, race, a firm’s industry).
It is more efficient (has smaller standard errors) than Fixed Effects if the key assumption holds.

Dilemma - Fixed Effects vs. Random Effects

The most important choice in panel data analysis is often between Fixed Effects (FE) and Random Effects (RE).

Feature	Fixed Effects (FE)	Random Effects (RE)
Key Assumption	Allows for correlation between the unobserved effect \(a_i\) and regressors \(X_{it}\).	Assumes no correlation between \(a_i\) and \(X_{it}\).
Consistency	Always consistent, even if the RE assumption is violated.	Inconsistent if \(a_i\) and \(X_{it}\) are correlated (suffers from omitted variable bias).
Efficiency	Less efficient (uses only within-individual variation).	More efficient if its assumption holds (uses both within and between variation).
Time-Invariant Variables	Cannot estimate their effects (they are wiped out by the de-meaning process).	Can estimate their effects.

Hausman Test

The Hausman test is the formal procedure for choosing between FE and RE.

Intuition: It compares the coefficient estimates from the FE and RE models. If the RE assumption holds, the estimates should be similar. If they are systematically different, the RE assumption is likely violated.
Hypotheses:
\(H_0\): The Random Effects model is appropriate (\(E(a_i | X_{it}) = 0\)).
\(H_A\): The Fixed Effects model is appropriate (\(E(a_i | X_{it}) \neq 0\)).
Decision Rule:
If p-value is low (< 0.05): Reject \(H_0\). The models are significantly different. Use the consistent Fixed Effects model.
If p-value is high (> 0.05): Fail to reject \(H_0\). You have no evidence against the RE assumption. Use the more efficient Random Effects model.

Line of Reasoning for Model Selection

Follow this flowchart to select and justify your panel data model:

Step 1: Is there an individual effect? (Pooled OLS vs. Random Effects)

Action: Run a Breusch-Pagan LM Test after estimating a Pooled OLS model.
Hypothesis: \(H_0: \sigma_a^2 = 0\) (no significant individual effect).
Decision:
- p-value HIGH \(\rightarrow\) Stop. Use Pooled OLS (with clustered standard errors).
- p-value LOW \(\rightarrow\) An individual effect exists. Proceed to Step 2.

Step 2: Is the individual effect correlated with regressors? (RE vs. FE)

Action: Run a Hausman Test comparing your Random Effects and Fixed Effects models.
Hypothesis: \(H_0: E(a_i | X_{it}) = 0\) (the effect is not correlated with regressors).
Decision:
- p-value HIGH \(\rightarrow\) The RE assumption holds. Your final model is Random Effects.
- p-value LOW \(\rightarrow\) The RE assumption is violated. Your final model is Fixed Effects.

Step 3: Final Checks

Always check for serial correlation in the residuals of your chosen model (e.g., with a Breusch-Godfrey test).
If serial correlation is present, use robust (clustered) standard errors to ensure your statistical inferences are valid. This is standard practice in panel data analysis.

Wooclap

Wooclap Link

Wooclap code: OFZFSD (Tutorial 5)

Questions

The Omitted Variable Bias Problem in Pooled OLS

The Pooled OLS model ignores the panel structure of the data and can suffer from omitted variable bias if unobserved individual characteristics are correlated with the regressors. The Fixed Effects model is designed to solve this problem.

Load the wagepan dataset (and declare it as a panel dataset using nr as the individual identifier and year as the time identifier).¹
Estimate a Pooled OLS model to predict lwage using exper (experience), expersq (experience squared), and union.
Estimate a Fixed Effects (“within”) model using the same variables.
Present the results of both models side-by-side in a single table.
Interpret the difference: Pay close attention to the coefficient for the union variable. Why is the estimated return to union membership different in the FE model compared to the Pooled OLS model? What does this suggest about how union members may differ from non-union members in ways not captured by the other variables?

Choosing Your Model

While the Fixed Effects (FE) model is robust to correlation between unobserved effects and regressors, the Random Effects (RE) model is more efficient if its key assumption (that this correlation is zero) holds. The Hausman test is the standard tool for making this choice.

Using the wagepan panel data, estimate a Random Effects model where lwage is a function of the time-varying predictors married and union, and the (mostly) time-invariant predictor educ.¹
Estimate the corresponding Fixed Effects model with the same variables.
Perform a Hausman test to formally compare the two models.
Conclude: Based on the p-value of the test, which model should you choose? Clearly state the null hypothesis of the Hausman test and explain what your result implies about the relationship between the unobserved individual effects (\(\alpha_i\)) and the explanatory variables in your model.

Time-Invariant Variables

A critical drawback of the Fixed Effects model is that it cannot estimate the coefficients of time-invariant variables. The “within” transformation, which subtracts the individual-specific mean, wipes out any variable that is constant over time for an individual. Again using the wagepan dataset:

Attempt to estimate a Fixed Effects model for lwage that includes both time-varying (union, married) and time-invariant (black, hisp) predictors.
Observe the model output. What happens to the coefficients for the black and hisp variables?
Now, estimate a Random Effects model using the exact same set of predictors.
Explain the difference: Why can the RE model provide estimates for black and hisp while the FE model cannot? Relate your answer directly to the underlying mathematical transformation of each estimator.

Interpreting the Hausman Test Statistic

The Hausman test is based on the difference between the coefficient vectors from the Fixed Effects and Random Effects models, \((\hat{\beta}_{FE} - \hat{\beta}_{RE})\).

In mathematical terms, what is the null hypothesis (\(H_0\)) of the Hausman test regarding the unobserved individual-specific effects, \(\alpha_i\)?

Why would a large, statistically significant test statistic lead you to conclude that the Random Effects estimates are inconsistent?

Choosing an Empirical Strategy

An economist is studying the determinants of CEO salary in the tech industry. They have collected panel data for 100 tech firms over a 10-year period (2015-2024). The proposed model is:

\[ \log(\text{Salary}_{it}) = \beta_0 + \beta_1 \text{FirmProfit}_{it} + \beta_2 \text{Experience}_{it} + \gamma_1 \text{IvyLeague}_{i} + a_i + u_{it} \]

Where Salary_it is the salary of the CEO in firm i at time t, FirmProfit_it is the annual profit of firm i at time t, Experience_it is the number of years the CEO has been in their role at firm i at time t, IvyLeague_i is a time-invariant dummy variable, equal to 1 if the CEO graduated from an Ivy League university and 0 otherwise, a_i represents unobserved, time-invariant firm-specific effects (e.g., corporate culture, brand reputation), and u_it is the idiosyncratic error term.

The economist first considers using a Fixed Effects (FE) / Within estimator. What is the primary advantage of this estimator in the context of the unobserved effect a_i? What is a significant practical limitation of the FE estimator for the specific model proposed above?
Next, the economist considers a Random Effects (RE) model. What is the key assumption the RE model makes about the relationship between the unobserved effect a_i and the explanatory variables? How does this assumption, if true, make the RE model advantageous over the FE model?
Outline a complete, step-by-step testing procedure the economist should follow to decide which estimator is most appropriate for their data. Specifically, name the statistical tests they should perform and explain what the null and alternative hypotheses for each test tell them about which model to choose (i.e., Pooled OLS vs. RE, and RE vs. FE).

Interpreting Conflicting Results

A researcher wants to determine the effect of a new fertilizer (Fertilizer_it) on crop yield (Yield_it) using panel data from 500 farms over 5 years. They also control for annual rainfall (Rainfall_it). They estimate three different models and conduct the relevant specification tests. The results are presented in the table below.

\[ \text{Yield}_{it} = \beta_0 + \beta_1 \text{Fertilizer}_{it} + \beta_2 \text{Rainfall}_{it} + v_{it} \]

Variable	(1) Pooled OLS	(2) Random Effects (RE)	(3) Fixed Effects (FE)
Fertilizer	10.5***	6.2***	2.1*
	(1.1)	(1.5)	(1.2)
Rainfall	0.8***	0.7***	0.6**
	(0.2)	(0.2)	(0.3)
Standard errors in parentheses
* p<0.01, p<0.05, * p<0.1

Specification Test Results:

Breusch-Pagan Test for Random Effects: p-value = 0.001
Hausman Test (RE vs. FE): Chi-squared = 25.4, p-value = 0.000

Interpreting Conflicting Results (Cont.)

Based on the result of the Breusch-Pagan test, should the researcher prefer the Random Effects model or the Pooled OLS model? Explain what the p-value of 0.001 implies about the unobserved individual effects (a_i) in the data.
The coefficient for Fertilizer is positive and highly significant in the Pooled OLS and RE models but is much smaller and only marginally significant in the FE model. Provide a clear and concise econometric explanation for why the Fertilizer coefficient might be so different between the RE and FE models. What does this difference suggest about the nature of the unobserved farm-specific effects (a_i)?
Using the result of the Hausman test, which of the three models (Pooled OLS, RE, or FE) should the researcher select as the most reliable? Justify your choice by interpreting what the test’s p-value indicates about the core assumption that distinguishes the RE and FE models. What is the final estimated effect of the fertilizer on crop yield according to your chosen model?

Random Effects and the Role of \(\theta\)

The Random Effects model can be estimated using a transformation of the data involving a parameter \(\theta\), where \(\theta = 1 - \sqrt{\frac{\sigma^2_\epsilon}{(T \sigma^2_\alpha + \sigma^2_\epsilon)}}\). Analyze the behavior of \(\theta\) under two extreme scenarios:

What happens to the value of \(\theta\) as the variance of the individual-specific effect, \(\sigma^2_\alpha\), approaches zero? What does the Random Effects model simplify to in this case?
What happens to the value of \(\theta\) as the number of time periods, \(T\), becomes very large (\(T \to \infty\))? To which other estimator does the Random Effects estimator become equivalent in this scenario?