Lecture 4: Panel Data I
Motivation
Advantage of Panel Data
Main features of panel models
The individual specific effect
Strict exogeneity
Between and within variation
Least Squares Dummy Variable estimator
Within estimator (or fixed effect estimator)
Material: Wooldridge Chapters 13.3, 13.4, 13.5, 14.1
Imagine a government is considering increasing the tax on alcohol to reduce traffic-related deaths. A crucial question for economists is: how can we estimate the causal impact of such a policy?
To answer this, we need to overcome a fundamental challenge: states with higher beer taxes might also have other characteristics that influence traffic fatalities, such as better roads or stricter law enforcement.
This is where panel data becomes an invaluable tool.
We need two requirements on our data:
Requirement 1: Information Before and After the Policy Change (Time Dimension) We need to observe traffic fatality rates in states before and after any changes in beer taxes. This allows us to see what happens within a state when the policy changes.
Requirement 2: Information from Multiple Entities (Cross-Sectional Dimension) We need data from the same states over consecutive periods. This allows us to compare the changes in fatality rates in states that changed their beer tax to those that did not.
How to estimate the impact of the Covid-outbreak March 2020 on behaviour by individual persons?
The Netherlands: Lockdown from March 16th 2020 onwards.
What data do we need to investigate the impact of the lockdown?
Methodological claim: empirical analyses that are not based on panel data are in general terms not very strong (= the results can easily be falsified).
The two figures below give an impression about what happened in 2020 in the Netherlands.
Examples: Why Panel Data Is Needed?
“Year of Birth” cohorts are followed across time. The research question is “do households sell their house when they become old?” The figure below cannot address this question because from one cross-section to another, it is not possible to disentangle cohort effects from age effects.

Examples: Why Panel Data Is Needed?
The figure below is constructed by panel data. The figure indicates strong cohort effects! For each birth cohort, in various years (\(t= 1,2,3,\dots,12\)) the average Dutch home ownership is given.

From the cross section it looks like (on average) home ownership rate peaks at around 69%. However, this not necessarily the same for two different cohorts. E.g., compare the 1953 with the 1948 cohort.
| Profit (= dependent variable) | Innovation (=explanatory variable) | |
|---|---|---|
| Firm A | 500 thousand Euros | 1 percent |
| Firm B | 750 thousand Euros | 3 percent |
| Profit (= dependent variable) | Innovation (=explanatory variable) | |
|---|---|---|
| Time 1 | 500 thousand Euros | 1 percent |
| Time 2 | 550 thousand Euros | 3 percent |
Within variation (for a given firm): as a result of the increased innovation (from 1 percent to 3 percent (thus by 2 percentage points) the profits increase from 500 thousand Euros to 550 thousand Euros from \(t=1\) to \(t=2\)
Economists are usually interested in the within variation more than in the between variation.
To measure the within variation of x on y, we need to control for individual effects. Consequently, it allows for correlation between the individual effect and the explanatory variable.
Example: Why Panel Data is Needed?
Let’s assume that a cross-section study suggests that female labor force participation is equal to 50%. There are two extreme possibilities that we cannot distinguish between cross-sections:
We need panel data to solve this issue.
Specification 1: Cross section: In the first week we considered cross sections:
\[ \text{Profit} = \beta_0 + \beta_1 \text{Innovation} + \beta_2 \text{Firm Size} + u \]
Cross-sectional dimension \(N\).
\[ \text{Profit}_i = \beta_0 + \beta_1 \text{Innovation}_i + \beta_2 \text{Firm Size}_i + u_i \]
Specification 2: Time Series: In the third week we considered the following static time-series model. It is based on a data set containing outcomes for one firm, which is observed over T periods.
\[ \text{Profit}_t = \beta_0 + \beta_1 \text{Innovation}_t + \beta_2 \text{Firm Size}_t + u_t \]
Time dimension \(T\).
Again, there is one intercept \(\beta_0\)
In this equation, we add a subscript \(t\) for the \(t\)-th period:
Specification 3: Panel data is a combination of the previous two equations:
\[ \text{Profit}_{it} = \beta_0 + \beta_1 \text{Innovation}_{it} + \beta_2 \text{Firm Size}_{it} + u_{it} \]
for \(i=1,\dots, N\), \(t=1,\dots, T\).
Again, the only intercept is \(\beta_0\). It has a cross-sectional dimension N and a time dimension T.
Issue 1: Can equation (3) be generalized by N intercepts \(\alpha_i\). Does each firm (subscript \(i\)) have their own intercept?
\[ \text{Profit}_{it} = \alpha_i + \beta_1 \text{Innovation}_{it} + \beta_2 \text{Firm Size}_{it} + u_{it} \]
Issue 2: Is there any correlation between these \(N\) intercepts \(\alpha_i\) and each of the explanatory variables innovation and firm size?
Issue 3: Should variables that remain constant within individual firms be treated differently? E.g. in the following specification, firm size does not change across time. Thus, frmsizeᵢ has no subscript \(t\), in case the size of the firms is constant in all of the \(T\) periods. \[ \text{Profit}_{it} = \beta_0 + \beta_1 \text{Innovation}_{it} + \beta_2 \text{Firm Size}_{i} + u_{it} \]
Issue 4: Are the explanatory variables of the regression equation strictly exogenous? This is an econometric issue that is required for unbiased estimators. It will be explained below further.
| Country (i) | Year (t) | GDP (\(y_{it}\)) | Investment (\(X_{it}\)) |
|---|---|---|---|
| USA | 2019 | 21.4 | 0.25 |
| USA | 2020 | 20.9 | 0.16 |
| USA | 2021 | 23.0 | 0.36 |
| USA | 2022 | 25.4 | 0.13 |
| Germany | 2019 | 3.8 | 0.14 |
| Germany | 2020 | 3.8 | 0.13 |
| … | … | … | … |
| Japan | … | … | … |
Suppose the following “true model”:
\[ y_{it} = a_i + \beta_1x_{1it} + ... + \beta_kx_{kit} + u_{it} \quad i=1,...,N;t=1,...,T \]
Where:
The constant \(a_i\) captures all individual-specific variables that are not observed by the researcher; e.g. motivation (it is referred to as unobserved heterogeneity).
It is possible that \(E(a_i | x_{i11},..,x_{i1k},..,x_{iT1},...,x_{iTk}) \neq 0\) (e.g. in an equation where wage is the dependent variable, motivation (subsumed in \(a_i\)) might be correlated with the RHS-variable experience).
Assumption TS.2 (Strict exogeneity)
For each \(t\), the expected value of \(u_t\) given ALL of the \(k\) explanatory variables FOR ALL \(T\) time periods, is equal to zero: \(E[u_t | X] = 0\)
Assumption TS.2 (Contemporaneous exogeneity)
For each t, the expected value of \(u_t\), given ALL of the k explanatory variables in period t, is equal to zero: \(E(u_t | x_{t1},...,x_{tk}) = E(u_t | \mathbf{x}_t) = 0\)
This assumption implies that the error term in period t is uncorrelated with all k regressors in the same period, \(t\): \(Corr(u_t,x_{tj})=0 \quad j=1,...,k\)
Definition: Fixed Effects Model
\[ y_{it} = (\beta_0 + \alpha_i) + \beta_1 X_{it} + u_{it} \] or more simply: \[ y_{it} = \alpha_i + \beta_1 X_{it} + u_{it} \] (Here, the \(\alpha_i\) represent the individual-specific intercepts)
Definition: Within Transformation
Suppose \(y_{it} = \alpha_i + \beta_1 X_{it} + u_{it}\).
Step 1: For each individual \(i\), calculate the time-average of their variables: \[ \bar{y}_i = \frac{1}{T} \sum_{t=1}^{T} y_{it} \quad \text{and} \quad \bar{X}_i = \frac{1}{T} \sum_{t=1}^{T} X_{it} \quad \text{and} \quad \bar{\alpha_i} = \frac{1}{T} \sum_{t=1}^{T} \alpha_i = \alpha_i\]
Step 2: Subtract the individual-specific average from the original model:
\[ \begin{align} y_{it} - \bar{y}_i &= \beta_1 (X_{it} - \bar{X}_i) + (\alpha_i - \bar{\alpha_i}) + (u_{it} + - \bar{u}_i) \\ \ddot{y}_{it} &= \beta_1 \ddot{x}_{it} + \ddot{u}_{it} \end{align} \] The fixed effect \(\alpha_i\) is time-constant, so \(\alpha_i - \bar{\alpha}_i = \alpha_i - \alpha_i = 0\). It drops out!
Step 3: Run OLS on the “de-meaned” data: \[ (y_{it} - \bar{y}_i) = \beta_1 (X_{it} - \bar{X}_i) + (u_{it} - \bar{u}_{it}) \Leftrightarrow \ddot{y}_{it} = \beta_1 \ddot{x}_{it} + \ddot{u}_{it}\] This gives a consistent estimate of \(\beta_1\).
LSDV Model
Create a dummy (0/1) variable for each individual \(i\) (except for one, to avoid the dummy variable trap).
Run a single OLS regression including these \(N-1\) dummy variables.
\[ y_{it} = \beta_0 + \beta_1 X_{it} + d_1\alpha_1 + d_2\alpha_2 + ... + d_{N-1}\alpha_{N-1} + u_{it} \]
The estimated coefficient \(\beta_1\) from the LSDV model is identical to the one from the “Within” estimator.
Interpretation of Fixed Effects models
In a Fixed Effects model, the coefficient \(\beta_1\) measures:
The average change in \(y\) for a one-unit increase in \(X\) within a given individual over time.
ability is correlated with both education (your X) and wage (your Y), FE solves this problem because ability is constant for an individual.gender or race on wages using a standard FE model, because these variables do not change over time for an individual.Arguably, the fixed effects model is one of the most often-used models in modern econometrics.
Standard statistical software such as lm() in R or sm.OLS() in Python can implement FE using the LSDV method, but this is often tedious.
The fixest (R) and pyfixest (Python) package provide a very easy way to estimate FE proceding from a dataset that looks like the one on Slide 4.5.
This is how that works in practice:
Why is \(\hat{\beta}_{fdif}\) a consistent (or unbiased) estimator? Let’s assume for simplicity we have a bivariate model: \(y_{it} = x_{it}\beta + a_i + u_{it}\) and \(\Delta y_{it} = \Delta x_{it}\beta + \Delta u_{it}\)
Consistency requires that the error term is uncorrelated with the explanatory variable: \(Corr(\Delta u_{it}, \Delta x_{it}) = Corr(u_{it} – u_{i,t-1}, x_{it} – x_{i,t-1}) = 0\)
It means that \(u_{it}\) is uncorrelated with \(x_{i,t-1}, x_{it}, x_{i,t+1}\). In other words, strict exogeneity is needed (no lagged dependent variable, no feedback effects) for consistency (unbiasedness) of the first-difference estimator.
What do the estimators have in common?
They allow for correlation between \(a_i\) and the explanatory variables \(x_{i1}...x_{iT}\): \(E(a_i | x_{i1},...,x_{iT},...,x_{ki1},...,x_{kiT}) \neq 0\)
The assumption of strict exogeneity (which means that the regression equation contains no feedback mechanism; there is no lag of dependent variable)
Consequence: The parameter estimates of both estimators should be about the same (if the assumption of strict exogeneity is true).
How do they differ? Assumption about the error term u:
Consequence: Fixed-effects estimator gives smaller standard errors if the specification is correct and there is strict exogeneity. This estimator is more efficient.
First-difference estimator is preferred if there is a unit root in the error terms: \(u_{it} = u_{i,t-1} + e_{it}\)
If the regression equation is correctly specified,the within estimation procedure and the ‘first-difference estimation’ procedure should yield similar estimates for the parameters \(\beta\).
\(y_{it} = \beta_1x_{1it} +...+ \beta_kx_{kit} + a_i + u_{it}\)
Question: which of the two estimation procedures is preferable?
Motivation for the procedure
For ease of exposition here we take one explanatory variable x
Fixed-effects estimator: \(u_{it}\) is identically and independently distributed.
First-difference estimator: \(\Delta u_{it}\) is identically and independently distributed.
Correlation of \(u_{it}\) in the FD model
Assumption: uᵢₜ is identically and independently distributed:
\(Cov(u_{it},u_{is}) = 0\) for \(t \neq s\) (same individual i) \(Cov(u_{it},u_{it}) = \sigma_u^2\) for \(t = s\)
Next, we estimate the model: \(\Delta y_{it} = \Delta x_{it}\beta + \Delta u_{it} \quad i=1,...,N; t = 2,...,T\)
The correlation between \(\Delta u_{it}\) and \(\Delta u_{i,t-1}\) is:
\(Corr(\Delta u_{it}, \Delta u_{i,t-1}) = Corr(u_{it} - u_{i,t-1}, u_{i,t-1} - u_{i,t-2}) = -0.5\).
Hence, the intertemporal correlation of the FD-estimator is -0.5 if \(u_{it}\) is i.i.d.
Correlation of \(u_{it}\) in the FD model
Covariance for the same individual across time:
\[ Cov[aX+bY,cW +dZ]= acCov[X,W]+ adCov[X,Z]+bcCov[Y,W]+bdCov[Y,Z] \]
\[ \begin{align} Cov(u_{it}-u_{i,t-1}, u_{i,t-1}-u_{i,t-2}) &= \\ Cov(u_{it},u_{i,t-1}) &+ Cov(u_{it},-u_{i,t-2}) + Cov(-u_{i,t-1},u_{i,t-1}) + \\ &Cov(-u_{i,t-1},-u_{i,t-2}) \\ &= 0 + 0 - Var(u_{i,t-1}) + 0 \\ &= -\sigma_u^2 \end{align} \]
\[ \begin{align} Var(u_{it} - u_{i,t-1}) &= Var(u_{it}) + Var(-u_{i,t-1}) + 2Cov(u_{it},-u_{i,t-1}) \\ &= \sigma_u^2 + \sigma_u^2 + 0 = 2\sigma_u^2 \end{align} \]
\[ \begin{align} Corr(\Delta u_{it}, \Delta u_{i,t-1}) &= \frac{Cov(\Delta u_{it}, \Delta u_{i,t-1})}{\sqrt{Var(\Delta u_{it}) \cdot Var(\Delta u_{i,t-1})}} \\ &= \frac{-\sigma_u^2}{2\sigma_u^2} = -0.5 \end{align} \]
FE vs. FE Procedure
To check for FE versus FD, follow the below procedure:
Step 1: Run a first-difference regression equation of \(\Delta y_{it}\) on \(\Delta x_{it}\)
Step 2: Predict the residuals (which gives \(\Delta \hat{u}_{it}\)) and run a Breusch-Godfrey test for autocorrelation of \(\Delta \hat{u}_{it}\) on \(\Delta \hat{u}_{i,t-1}\) and \(\Delta x_{it}\).
Step 3A: If the estimated coefficient on \(\Delta \hat{u}_{i,t-1}\) is about -0.5 (i.e. -0.5 is within the 95% confidence interval) then there is an indication that \(u_{it}\) is an independent error term (\(u_{it}\) is i.i.d.). Conclusion: prefer within-estimates (fixed effects).
Step 3B: If the estimated coefficient on \(\Delta \hat{u}_{i,t-1}\) is not equal to -0.5 (i.e. -0.5 is outside the 95% confidence interval) then there is an indication that \(u_{it}\) is not an independent error term. Conclusion: prefer first-differences.
If the two procedures yield dramatically different estimates for \(\beta\), the two conclusions are possible, either:
It is useful to compare the results of both regression procedures.
plm function from the plm library, using plm(y ~ x1 + x2, data = df, model = "fd").reg d.y d.x, cluster(number)
xtreg d.y d.x# Install the packages if you don't have them already
# install.packages("plm")
# install.packages("AER")
# Load the libraries
library(plm)
library(AER)
# Load the Grunfeld dataset
data("Grunfeld")
# Estimate the first difference model
# We are modeling the change in investment (inv) as a function of
# the change in the value of the firm (value) and the change in capital stock (capital).
fd_model <- plm(inv ~ value + capital, data = Grunfeld, model = "fd")
# Print the summary of the regression results
summary(fd_model)
## Oneway (individual) effect First-Difference Model
##
## Call:
## plm(formula = inv ~ value + capital, data = Grunfeld, model = "fd")
##
## Balanced Panel: n = 10, T = 20, N = 200
## Observations used in estimation: 190
##
## Residuals:
## Min. 1st Qu. Median 3rd Qu. Max.
## -200.889558 -13.889063 0.016677 9.504223 195.634938
##
## Coefficients:
## Estimate Std. Error t-value Pr(>|t|)
## (Intercept) -1.8188902 3.5655931 -0.5101 0.6106
## value 0.0897625 0.0083636 10.7325 < 2.2e-16 ***
## capital 0.2917667 0.0537516 5.4281 1.752e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Total Sum of Squares: 584410
## Residual Sum of Squares: 345460
## R-Squared: 0.40888
## Adj. R-Squared: 0.40256
## F-statistic: 64.6736 on 2 and 187 DF, p-value: < 2.22e-16import pandas as pd
import statsmodels.api as sm
from linearmodels.panel import PanelOLS
# Load the Grunfeld dataset
# The statsmodels library includes the Grunfeld dataset
grunfeld_data = sm.datasets.grunfeld.load_pandas().data
# Set firm and year as the index for the panel data model
# The linearmodels package requires a MultiIndex of (entity, time)
grunfeld_data = grunfeld_data.set_index(['firm', 'year'])
# Define the dependent and independent variables
dependent = grunfeld_data['invest']
exog = sm.add_constant(grunfeld_data[['value', 'capital']])
# Estimate the first difference model
# The PanelOLS function with the entity_effects=True argument
# when applied to first-differenced data is equivalent to a first-difference model.
# A more direct way is to difference the data manually.
# 1. Manual First-Differencing
grunfeld_diff = grunfeld_data.groupby('firm').diff().dropna()
dependent_diff = grunfeld_diff['invest']
exog_diff = sm.add_constant(grunfeld_diff[['value', 'capital']])
# We use OLS on the differenced data.
# Note: For a first-difference model, the intercept represents the average time trend.
fd_model_manual = sm.OLS(dependent_diff, exog_diff)
fd_results = fd_model_manual.fit()
# Print the summary of the regression results from our manual first difference
print("First Difference Model Results (Manual Approach)")
## First Difference Model Results (Manual Approach)
print(fd_results.summary())
## OLS Regression Results
## ==============================================================================
## Dep. Variable: invest R-squared: 0.411
## Model: OLS Adj. R-squared: 0.405
## Method: Least Squares F-statistic: 71.76
## Date: Wed, 29 Oct 2025 Prob (F-statistic): 2.25e-24
## Time: 12:56:40 Log-Likelihood: -1071.0
## No. Observations: 209 AIC: 2148.
## Df Residuals: 206 BIC: 2158.
## Df Model: 2
## Covariance Type: nonrobust
## ==============================================================================
## coef std err t P>|t| [0.025 0.975]
## ------------------------------------------------------------------------------
## const -1.6539 3.200 -0.517 0.606 -7.963 4.656
## value 0.0897 0.008 11.271 0.000 0.074 0.105
## capital 0.2906 0.051 5.741 0.000 0.191 0.390
## ==============================================================================
## Omnibus: 46.816 Durbin-Watson: 1.611
## Prob(Omnibus): 0.000 Jarque-Bera (JB): 550.345
## Skew: 0.348 Prob(JB): 3.12e-120
## Kurtosis: 10.919 Cond. No. 409.
## ==============================================================================
##
## Notes:
## [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.* 1. Load Stata's built-in Grunfeld dataset
* The 'clear' option removes any data currently in memory
webuse grunfeld, clear
* 2. Declare the data as a panel dataset
* Stata needs to know the panel identifier ('company') and the time variable ('time')
xtset company time
* 3. Estimate the first difference model
* We use the 'xtreg' command with the 'fd' option.
* Note: Stata's Grunfeld dataset uses the variable names mvalue and kstock.
xtreg invest mvalue kstock, fdEmpirical Economics: Lecture 4 - Panel Data I