Tutorial 5: Panel Data II
The fundamental static panel data model is: \[ y_{it} = \beta_1 x_{1it} + ... + \beta_k x_{kit} + a_i + u_{it} \]
New estimators:
The fundamental static panel data model is: \[ y_{it} = \beta_1 x_{1it} + ... + \beta_k x_{kit} + a_i + u_{it} \]
The most important choice in panel data analysis is often between Fixed Effects (FE) and Random Effects (RE).
| Feature | Fixed Effects (FE) | Random Effects (RE) |
|---|---|---|
| Key Assumption | Allows for correlation between the unobserved effect \(a_i\) and regressors \(X_{it}\). | Assumes no correlation between \(a_i\) and \(X_{it}\). |
| Consistency | Always consistent, even if the RE assumption is violated. | Inconsistent if \(a_i\) and \(X_{it}\) are correlated (suffers from omitted variable bias). |
| Efficiency | Less efficient (uses only within-individual variation). | More efficient if its assumption holds (uses both within and between variation). |
| Time-Invariant Variables | Cannot estimate their effects (they are wiped out by the de-meaning process). | Can estimate their effects. |
The Hausman test is the formal procedure for choosing between FE and RE.
Follow this flowchart to select and justify your panel data model:
Step 1: Is there an individual effect? (Pooled OLS vs. Random Effects)
Step 2: Is the individual effect correlated with regressors? (RE vs. FE)
Step 3: Final Checks
Wooclap code: OFZFSD (Tutorial 5)
The Pooled OLS model ignores the panel structure of the data and can suffer from omitted variable bias if unobserved individual characteristics are correlated with the regressors. The Fixed Effects model is designed to solve this problem.
wagepan dataset (and declare it as a panel dataset using nr as the individual identifier and year as the time identifier).1lwage using exper (experience), expersq (experience squared), and union.union variable. Why is the estimated return to union membership different in the FE model compared to the Pooled OLS model? What does this suggest about how union members may differ from non-union members in ways not captured by the other variables?While the Fixed Effects (FE) model is robust to correlation between unobserved effects and regressors, the Random Effects (RE) model is more efficient if its key assumption (that this correlation is zero) holds. The Hausman test is the standard tool for making this choice.
wagepan panel data, estimate a Random Effects model where lwage is a function of the time-varying predictors married and union, and the (mostly) time-invariant predictor educ.1A critical drawback of the Fixed Effects model is that it cannot estimate the coefficients of time-invariant variables. The “within” transformation, which subtracts the individual-specific mean, wipes out any variable that is constant over time for an individual. Again using the wagepan dataset:
lwage that includes both time-varying (union, married) and time-invariant (black, hisp) predictors.black and hisp variables?black and hisp while the FE model cannot? Relate your answer directly to the underlying mathematical transformation of each estimator.The Hausman test is based on the difference between the coefficient vectors from the Fixed Effects and Random Effects models, \((\hat{\beta}_{FE} - \hat{\beta}_{RE})\).
In mathematical terms, what is the null hypothesis (\(H_0\)) of the Hausman test regarding the unobserved individual-specific effects, \(\alpha_i\)?
Why would a large, statistically significant test statistic lead you to conclude that the Random Effects estimates are inconsistent?
An economist is studying the determinants of CEO salary in the tech industry. They have collected panel data for 100 tech firms over a 10-year period (2015-2024). The proposed model is:
\[ \log(\text{Salary}_{it}) = \beta_0 + \beta_1 \text{FirmProfit}_{it} + \beta_2 \text{Experience}_{it} + \gamma_1 \text{IvyLeague}_{i} + a_i + u_{it} \]
Where Salary_it is the salary of the CEO in firm i at time t, FirmProfit_it is the annual profit of firm i at time t, Experience_it is the number of years the CEO has been in their role at firm i at time t, IvyLeague_i is a time-invariant dummy variable, equal to 1 if the CEO graduated from an Ivy League university and 0 otherwise, a_i represents unobserved, time-invariant firm-specific effects (e.g., corporate culture, brand reputation), and u_it is the idiosyncratic error term.
The economist first considers using a Fixed Effects (FE) / Within estimator. What is the primary advantage of this estimator in the context of the unobserved effect a_i? What is a significant practical limitation of the FE estimator for the specific model proposed above?
Next, the economist considers a Random Effects (RE) model. What is the key assumption the RE model makes about the relationship between the unobserved effect a_i and the explanatory variables? How does this assumption, if true, make the RE model advantageous over the FE model?
Outline a complete, step-by-step testing procedure the economist should follow to decide which estimator is most appropriate for their data. Specifically, name the statistical tests they should perform and explain what the null and alternative hypotheses for each test tell them about which model to choose (i.e., Pooled OLS vs. RE, and RE vs. FE).
A researcher wants to determine the effect of a new fertilizer (Fertilizer_it) on crop yield (Yield_it) using panel data from 500 farms over 5 years. They also control for annual rainfall (Rainfall_it). They estimate three different models and conduct the relevant specification tests. The results are presented in the table below.
\[ \text{Yield}_{it} = \beta_0 + \beta_1 \text{Fertilizer}_{it} + \beta_2 \text{Rainfall}_{it} + v_{it} \]
| Variable | (1) Pooled OLS | (2) Random Effects (RE) | (3) Fixed Effects (FE) |
|---|---|---|---|
| Fertilizer | 10.5*** | 6.2*** | 2.1* |
| (1.1) | (1.5) | (1.2) | |
| Rainfall | 0.8*** | 0.7*** | 0.6** |
| (0.2) | (0.2) | (0.3) | |
| Standard errors in parentheses | |||
| *** p<0.01, ** p<0.05, * p<0.1 |
Specification Test Results:
Based on the result of the Breusch-Pagan test, should the researcher prefer the Random Effects model or the Pooled OLS model? Explain what the p-value of 0.001 implies about the unobserved individual effects (a_i) in the data.
The coefficient for Fertilizer is positive and highly significant in the Pooled OLS and RE models but is much smaller and only marginally significant in the FE model. Provide a clear and concise econometric explanation for why the Fertilizer coefficient might be so different between the RE and FE models. What does this difference suggest about the nature of the unobserved farm-specific effects (a_i)?
Using the result of the Hausman test, which of the three models (Pooled OLS, RE, or FE) should the researcher select as the most reliable? Justify your choice by interpreting what the test’s p-value indicates about the core assumption that distinguishes the RE and FE models. What is the final estimated effect of the fertilizer on crop yield according to your chosen model?
The Random Effects model can be estimated using a transformation of the data involving a parameter \(\theta\), where \(\theta = 1 - \sqrt{\frac{\sigma^2_\epsilon}{(T \sigma^2_\alpha + \sigma^2_\epsilon)}}\). Analyze the behavior of \(\theta\) under two extreme scenarios:
What happens to the value of \(\theta\) as the variance of the individual-specific effect, \(\sigma^2_\alpha\), approaches zero? What does the Random Effects model simplify to in this case?
What happens to the value of \(\theta\) as the number of time periods, \(T\), becomes very large (\(T \to \infty\))? To which other estimator does the Random Effects estimator become equivalent in this scenario?
Empirical Economics: Tutorial - Panel Data II