Empirical Economics

Lecture 3: Time Series and Prediction

ECU Message

ECU Message [1]

ECU Message [2]

ECU Message [3]

Introduction

Course Overview

Linear Model I
Linear Model II
Time Series and Prediction
Panel Data I
Panel Data II
Binary Outcome Data
Potential Outcomes and Difference-in-differences
Instrumental Variables

What do we do today?

Have a first look at time series data
Discuss statistics measuring the persistence of univariate time series (autocovariance, autocorrelation)
Discuss various particular univariate processes
A key property: stationarity
Switch to bivariate time series and consider the danger of spurious regression
Consider more advanced time series models
Prediction in all aforementioned models
Material: Wooldridge Chapters 10, 11, 12, 18.5

Motivation

Central question: how long does it take before a change of a key economic variable has an influence on economic processes?
- For instance, the interest increase by the ECB has influence on the economy. There are several intertemporal transmission mechanisms through increased lending, etc.
- Any further examples on transmission mechanisms in your Master’s programme?
- Other example: effect of interest on inflation.
- Thus: A change of inflation in period in period \(t\) has an effect on unemployment at \(t+1\) (or \(t+2\)), etc.
What if a researcher ignores the intertemporal transmission mechanism? She specifies the following regression equation: \(\text{Inflation}_t = \beta_0 + \beta_1 \text{InterestRate}_t + u_t\)
- Result 1: It leads to correlation of the error term over time: autocorrelation.
- Result 2: autocorrelation will lead to biased and inconsistent parameter estimates
- Result 3: t-statistics and F-statistics are wrong!

Nature of Time Series

What is Time Series Data?

Definition: Time Series Data

Time series data consists of observations of a variable or several variables over time.

Key Feature: The data is ordered chronologically.

Examples: Time Series Data

Gross Domestic Product (GDP) measured quarterly
Monthly inflation rates
Daily stock prices
Annual government budget deficits

Characteristics of Time Series Data

Temporal Ordering: Unlike cross-sectional data, the order of observations in time series data matters. The past can affect the future, but the future cannot affect the past.
Serial Correlation (or Autocorrelation): Observations in a time series are often correlated with their own past values. For example, a high GDP in one quarter may suggest a high GDP in the next quarter.
Seasonality: Many time series exhibit regular patterns at certain times of the year (e.g., retail sales are higher in the fourth quarter).
Trends: A time series may have a long-term upward or downward movement.

Objectives of Time Series Analysis

There are various objectives of time series econometrics:

Forecasting: Predicting future values of a variable is a primary objective. For instance, forecasting inflation is crucial for central banks.
Policy Analysis: Economists use time series models to assess the impact of policy changes. For example, what is the effect of an interest rate hike on unemployment?
Understanding Dynamic Economic Relationships: Time series analysis helps us understand how economic variables interact over time.

Notation and Basic Concepts

\(Y_t\): The value of the variable Y at time period t.
\(Y_{t-1}\): The value of Y in the previous period (the first lag).
\(Y_{t-s}\): The value of Y s periods ago (the s-th lag).
\(\Delta Y_t = Y_t - Y_{t-1}\): The first difference of \(Y\), which represents the change in \(Y\) from the previous period to the current period.

Interpretation of Dynamic Models

Interpreting Dynamic Econometric Models

What Makes a Model “Dynamic”?
- A model is considered dynamic if it explicitly includes lagged values of the dependent variable, the independent variables, or both. This structure allows us to move beyond simple static relationships and analyze how variables evolve and interact over time.
A common dynamic specification is the Autoregressive model:

\[ y_t = \alpha + \sum_{i=1}^{p} \beta_i y_{t-i} + \epsilon_t \]
Interpreting the coefficients (\(\beta_i\)) directly is often not the primary goal. Instead, we focus on key dynamic characteristics derived from them.

Key Questions

Key Questions to Answer When Interpreting Dynamic Models

What is the immediate impact of a change?
- Short-Run or Impact Multiplier: This measures the instantaneous effect of a one-unit change in an independent variable (\(x\)) on the dependent variable (\(y\)).
What is the total, long-run effect?
- Long-Run Multiplier (or Long-Run Propensity): This captures the total cumulative effect on \(y\) after a permanent one-unit change in \(x\), once the system has fully adjusted to its new equilibrium. It is calculated from the estimated coefficients.
How quickly does the system adjust?
- Speed of Adjustment: This indicates how fast the dependent variable responds to deviations from its long-run equilibrium. It is determined by the coefficients on the lagged dependent variable (\(\beta_i\)).

By analyzing these dynamic properties, we gain a much richer understanding of the underlying economic relationships than a static model could ever provide. We can distinguish between temporary shocks and permanent changes and understand the full adjustment path of the system over time.

Time Trends

A time trend is a variable, denoted as \(t\), that increases by one unit for each period (\(t = 1, 2, 3, \dots, T\)). It is included as an independent variable to capture sustained, long-term movements in the dependent variable.
A model with a linear time trend takes the form:

\[ y_t = \alpha + \beta t + \delta x_t + \epsilon_t \]
The coefficient \(\beta\) measures the average change in the dependent variable (\(y_t\)) per time period (e.g., per year or per quarter), holding all other included variables constant (ceteris paribus).
- If \(\beta > 0\), \(y_t\) exhibits an underlying positive trend (growth).
- If \(\beta < 0\), \(y_t\) exhibits an underlying negative trend (decline).
The trend term acts as a proxy for unobserved, slowly evolving factors like technological progress, population growth, or gradual shifts in consumer preferences.

Time Trend Example

Introduction to Dynamic Processes

A Time Series as a Stochastic Process

Definition: Time Series (Mathematical)

A time series is a set of random variables indexed by time, denoted as \(\{Y_1, Y_2, \dots, Y_T\}\) or simply \(\{Y_t\}\).

Stochastic vs. Deterministic: We treat a time series as a stochastic process because we don’t know the future values with certainty. We can only talk about their probability distributions.
One Realization: The actual data we have (e.g., GDP from 1950-2024) is just one of many possible paths the process could have taken.
Our goal is to infer the properties of the underlying process from this single realization.

Basic Statistical Properties

Just like any random variable, each point in a time series has statistical properties.
- Mean: The expected value of the process at time t: \(E(Y_t) = \mu_t\)
- Variance: The variance of the process at time t: \(Var(Y_t) = E[(Y_t - \mu_t)^2] = \sigma_t^2\)
Crucially, in general, the mean and variance could be different at each point in time.

Autocovariance

Covariance:
- Recall that covariance measures how two variables move together.
Autocovariance
- Autocovariance is the covariance of a time series with its own past values. It measures the linear dependence between different points in the series.

Definition: Autocovariance

The \(k\)-th Order Autocovariance (\(\gamma_k\)):

\[ \gamma(t, t-k) = Cov(Y_t, Y_{t-k}) = E[(Y_t - \mu_t)(Y_{t-k} - \mu_{t-k})] \]

This measures the covariance between the series at time t and time t-k.

Autocorrelation Function (ACF)

The value of the autocovariance (\(\gamma_k\)) depends on the units of the variable \(Y\). This makes it hard to compare across different series.
Autocorrelation: standardize the autocovariance to get the autocorrelation, which is unit-free.

Definition: Autocorrelation

The \(k\)-th Order Autocorrelation (\(\rho_k\)):

\[ \rho_k = \frac{ Cov(Y_t, Y_{t-k}) }{\sqrt{Var(Y_t) Var(Y_{t-k})}} \]

If the series is stationary, this simplifies to:

\[ \rho_k = \frac{\gamma_k}{\gamma_0} \]

where \(\gamma_0\) is the variance, \(Var(Y_t)\).

Interpretation of the Autocorrelation Function (ACF)

The ACF, denoted by \(\rho_k\), measures the “memory” or persistence of a time series.
- It is a value between -1 and +1.
- \(\rho_1\): The correlation between \(Y_t\) and \(Y_{t-1}\)(“today” and “yesterday”).
- \(\rho_k\): The correlation between \(Y_t\) and \(Y_{t-k}\)
- A high \(\rho_k\) suggests that a shock in one period will have a persistent effect k periods later.
For a stationary series, we expect \(\rho_k \rightarrow 0\) as \(k\) gets larger.

The Correlogram (ACF Plot)

A correlogram is a bar chart that plots the sample autocorrelations (\(r_k\)) for different lags (\(k = 1, 2, 3, \dots\)).
- It provides a visual summary of the persistence in a time series. We can see how quickly the correlations die out.
- Correlograms usually include confidence bands. Autocorrelations that fall outside these bands are considered statistically different from zero.

Modeling a Single Time Series

Why Model a Single Series?

Before we ask how X affects Y, we must first understand the behavior of Y itself.
Univariate models describe the dynamic properties of a single time series using its own past.
Main uses:
1. Understanding Persistence: How long do shocks to the economy last?
2. Forecasting: Using the past of a series to predict its future.

The White Noise Process

The Simplest Time Series: A process called “white noise” is the building block for more complex models. We often denote it as \(u_t\)

Definition: White Noise Process

A white noise is a time series such that:

\(E(u_t) = 0\) (Zero mean)
\(Var(u_t) = \sigma^2\) (Constant variance)
\(Cov(u_t, u_s) = 0\) for all \(t\neq s\) (No autocorrelation)

Interpretation: A white noise process is completely random and unpredictable. Its ACF will be zero for all lags \(k > 0\). It is the ideal error term in a time series regression.

The Autoregressive (AR) Model: AR(1)

The simplest dynamic model is the first-order autoregressive, or AR(1), model:

\[ Y_t = α + \rho Y_{t-1} + u_t \]

\(Y_t\): The value of Y at time t.
\(Y_{t-1}\): The value of Y in the previous period.
\(\rho\): The autoregressive coefficient, which measures the persistence of the series.
\(u_t\): A white noise error term (uncorrelated with past values).

Visualization AR(1) Process

Stationarity

For analysis to be tractable, we often assume a time series is covariance stationary (or weakly stationary).
This means the three aforementioned statistical properties do not change over time. Three conditions must hold:

Definition: Stationarity

Constant Mean: \(E(Y_t) = \mu\) for all t. The series fluctuates around a constant level.
Constant Variance: \(Var(Y_t) = \sigma^2\) for all t. The volatility of the series is constant.
Constant Autocovariance: \(Cov(Y_t, Y_{t-k}) = \gamma_k\) for all t. The covariance between two points depends only on the lag \(k\) (how far apart they are), not on their position in time.

Testing for Stationarity

We start with an AR(1) model, which is the building block for understanding persistence:

\[ y_t = \rho y_{t-1} + \epsilon_t \]

The entire dynamic behavior of this series is governed by the value of \(\rho\).
- If \(|\rho| < 1\), the process is stationary. Shocks (\(\epsilon_t\)) are temporary and their effects decay over time. The series will always revert to its mean.
- If \(\rho = 1\), the process has a unit root and is non-stationary. Shocks are permanent, fundamentally altering the future path of the series. This is a “random walk”.

Dickey-Fuller Test

Directly testing \(\rho=1\) is problematic. Instead, Dickey and Fuller proposed a clever transformation. We subtract \(y_{t-1}\) from both sides of the AR(1) equation:

\[ y_t - y_{t-1} = \rho y_{t-1} - y_{t-1} + \epsilon_t \]
This simplifies to the classic Dickey-Fuller regression form:

\[ \Delta y_t = (\rho - 1) y_{t-1} + \epsilon_t \quad \implies \quad \Delta y_t = \gamma y_{t-1} + \epsilon_t \]
The test is now focused on the coefficient \(\gamma = (\rho - 1)\):
- Null Hypothesis (\(H_0\)): \(\gamma = 0\) (This is equivalent to \(\rho = 1\), meaning a unit root exists).
- Alternative Hypothesis (\(H_A\)): \(\gamma < 0\) (This is equivalent to \(\rho < 1\), meaning the series is stationary).

Dickey-Fuller Distribution

We cannot simply perform a standard t-test on \(\gamma\).
- Under the null hypothesis, the regressor \(y_{t-1}\) is non-stationary.
- This violates the standard assumptions required for the t-statistic to follow a t-distribution.
Instead, the test statistic follows a unique Dickey-Fuller distribution, which has its own set of critical values.
- These critical values are larger (more negative) than the standard t-distribution values, making it harder to reject the null hypothesis of a unit root.

Augmented Dickey-Fuller (ADF) Test

The simple DF test assumes the error term \(\epsilon_t\) is white noise. What if the true underlying process is more complex, like an AR(2), causing the errors in the simple DF regression to be serially correlated?
The Augmented Dickey-Fuller (ADF) Test adds as many lagged difference terms as necessary to eliminate serial correlation in the residuals of the test equation:

\[ \Delta y_t = \alpha + \beta t + \gamma y_{t-1} + \sum_{i=1}^{p} \delta_i \Delta y_{t-i} + \epsilon_t \]

The test for the unit root remains the t-test on the \(\gamma\) coefficient using the Dickey-Fuller critical values.
- The lagged difference terms are only included to ensure the validity of the test; their coefficients are not part of the unit root test itself.
- The terms \(\alpha\) (constant) and \(\beta t\) (time trend) are included to capture a potential drift or deterministic trend in the series.

Dickey-Fuller Test in Software

The most common specification (including a constant) is shown. The decision is always the same: If p-value < 0.05, reject the null and conclude the series is stationary.

Python
R
Stata

Code

# Import the function
from statsmodels.tsa.stattools import adfuller

# Assume 'y' is a pandas Series or NumPy array
# The most common test includes a constant ('c')
result = adfuller(y, regression='c', autolag='AIC')

# Print the key results
print(f'ADF Statistic: {result[0]}')
print(f'p-value: {result[1]}')

# result[1] is the p-value.
# To test with a constant and trend, use regression='ct'

Code

# Using the simpler 'tseries' package
# Install if needed: install.packages("tseries")
library(tseries)

# Assume 'y' is a time series object
# The default test includes a constant
result <- adf.test(y)

# Print the result
print(result)

# The p-value is directly reported.
# For a test with constant and trend using the more robust 'urca' package:
# library(urca)
# summary(ur.df(y, type = "trend", lags = ...))

Code

/* Assume the time series variable is named 'y' and you have tsset your data */

/* Test with a constant (the default) */
/* The 'lags(k)' option can be used to specify the number of lags */
dfuller y

/* Test with a constant and a trend */
dfuller y, trend

/* Stata reports the test statistic and critical values for 1%, 5%, and 10%.
   You must compare the Test Statistic to the critical values.
   To get a p-value, use the user-written command 'dfgls'. */

Stationarity of an AR(1) Process

For the AR(1) model \(Y_t = \alpha + \rho Y_{t-1} + u_t\):
- If \(|\rho| < 1\):, the series is stationary. Any shock (\(u_t\)) will eventually die out.
- If \(|\rho|=1\), the series is non-stationary and is called a random walk. Shocks have a permanent effect.
- If \(|\rho|>1\), the series is explosive (this is rare in economics).
Speed of Adjustment: A value of \(\rho\) close to 0 suggests that the effect of a past value dies out quickly.
Persistence: A value of \(\rho\) close to 1 indicates that a shock to the system will persist for a long time. The series will return to its mean slowly.

The Random Walk Model

The special case of non-stationarity in the AR(1) model occurs when \(\rho=1\):

\[ Y_t = Y_{t-1} + u_t \]
The best forecast for tomorrow’s value is today’s value.
Permanent Shocks: The impact of a shock \(u_t\) is permanent; it is carried forward in all future periods.
- Examples: The prices of financial assets are often modeled as random walks.

Visualization Random Walk

Higher-Order AR(p) Models

We can include more lags of the dependent variable. An \(AR(p)\) model includes \(p\) lags:

\[ Y_t = \alpha + \rho_1 Y_{t-1} + \rho_2 Y_{t-2} + \dots + \rho_p Y_{t-p} + u_t \]

Choosing p: The number of lags (\(p\)) can be determined using information criteria like the Adjusted \(R^2\), Akaike Information Criterion (AIC) or the Bayesian Information Criterion (BIC). More about this later in the lecture.

The Moving Average Model

In the MA model, the current value is a function of the current and past random shocks.

\[ Y_t = \mu + u_t + \theta u_{t-1} \]
\(\mu\): The mean of the series.
\(\theta\): The moving average coefficient.
The MA(1) process has a finite memory. A shock \(u_{t-1}\) affects \(Y_{t-1}\) and \(Y_t\), but has no direct effect on \(Y_{t+1}\).

Estimation and Interpretation of MA Coefficients (\(\theta\))

As you may have noticed, there are no independent variables in an MA model.
- Usually, what we do is assume \(u_0=0\), take a guess for \(\mu\) and \(\theta\), and solve for \(u_t\): \(u_t = Y_t - \mu - \theta u_{t-1}\)
- If the true shocks came from a Normal distribution \(N(0, \sigma^2_u)\), how likely was it to get this specific sequence of inferred shocks? The joint probability of observing this sequence is called the Likelihood.
- We try to find the most likely \(\mu\) and \(\theta\) that could have generated the actual data.¹
Unlike an AR(1) model, a shock in an MA(1) model only persists for one period. \(Y_t\) is affected by \(u_t\) and \(u_{t-1}\), but \(Y_{t+1}\) will not be affected by \(u_t\).
The coefficient \(\theta\) determines the extent to which a past shock affects the current observation.
The MA model is always stationary.

The Autoregressive Moving Average (ARMA) Model

We can combine the features of autoregressive and moving average models to create an ARMA(p,q) model.
For example, an ARMA(1,1) Model: \(Y_t = \alpha + \rho Y_{t-1} + u_t + \theta u_{t-1}\)
ARMA models are usually very flexible and can capture a wide range of time series dynamics. This is how they are implemented in Python/R/Stata:

R
Python
Stata

Code

# Generate sample data
set.seed(123)
data <- rnorm(100)

# Fit ARMA(1,1) model
arma_model <- arima(data, order = c(1, 0, 1))  # (p, d, q) where d=0 for ARMA

# Print summary
summary(arma_model)
##           Length Class  Mode     
## coef        3    -none- numeric  
## sigma2      1    -none- numeric  
## var.coef    9    -none- numeric  
## mask        3    -none- logical  
## loglik      1    -none- numeric  
## aic         1    -none- numeric  
## arma        7    -none- numeric  
## residuals 100    ts     numeric  
## call        3    -none- call     
## series      1    -none- character
## code        1    -none- numeric  
## n.cond      1    -none- numeric  
## nobs        1    -none- numeric  
## model      10    -none- list

Code

import numpy as np
import statsmodels.api as sm

# Generate sample data
np.random.seed(123)
data = np.random.randn(100)

# Fit ARMA(1,1) model
arma_model = sm.tsa.ARIMA(data, order=(1,0,1))
arma_results = arma_model.fit()

# Print summary
print(arma_results.summary())
##                                SARIMAX Results                                
## ==============================================================================
## Dep. Variable:                      y   No. Observations:                  100
## Model:                 ARIMA(1, 0, 1)   Log Likelihood                -153.957
## Date:                Wed, 29 Oct 2025   AIC                            315.915
## Time:                        12:56:11   BIC                            326.335
## Sample:                             0   HQIC                           320.132
##                                 - 100                                         
## Covariance Type:                  opg                                         
## ==============================================================================
##                  coef    std err          z      P>|z|      [0.025      0.975]
## ------------------------------------------------------------------------------
## const          0.0270      0.114      0.237      0.812      -0.196       0.250
## ar.L1          0.0035     16.295      0.000      1.000     -31.935      31.942
## ma.L1          0.0036     16.302      0.000      1.000     -31.949      31.956
## sigma2         1.2729      0.219      5.809      0.000       0.843       1.702
## ===================================================================================
## Ljung-Box (L1) (Q):                   0.00   Jarque-Bera (JB):                 1.74
## Prob(Q):                              1.00   Prob(JB):                         0.42
## Heteroskedasticity (H):               0.70   Skew:                             0.03
## Prob(H) (two-sided):                  0.31   Kurtosis:                         2.36
## ===================================================================================
## 
## Warnings:
## [1] Covariance matrix calculated using the outer product of gradients (complex-step).

Code

* Generate sample data  
set obs 100  
set seed 123  
gen y = rnormal()  

* Estimate ARMA(1,1)  
arima y, ar(1) ma(1)  

* Display results  
estimates table

How to pick a model?

To build a dynamically complete model, we need a model whose residuals are white noise (i.e., free from autocorrelation), ensuring that we have captured all the systematic, predictable information in the data.
Step 1: Analyze the Variables (Pre-Modeling)
- Tool: Augmented Dickey-Fuller (ADF) Test.
- Test each variable (dependent and independent) for unit roots.
  - If a variable is stationary (I(0)), use it in its level form (\(y_t, x_t\)).
  - If a variable is non-stationary (I(1)), use it in its first-differenced form (\(\Delta y_t, \Delta x_t\)) to induce stationarity.
This step determines the correct form of the variables to include in our initial regression.

Example Stationarity

First differences can induce stationarity.

How to pick a model? (Cont.)

Step 2: Run a regression and perform diagnostics:
- When you run a spurious regression, the resulting OLS model appears to fit well, but the relationship is nonsensical.
- The “error” in this model is not a random, white-noise process. Instead, the residuals themselves are highly persistent and non-stationary.
- Run an initial OLS regression based on economic theory, using the variable forms decided in Step 1.
Step 3: Diagnose the Model
- Breusch-Godfrey (BG) Test for serial correlation in the error terms?
- Are the residuals, \(\hat{u}_t\), white noise? Or is there still predictable information left in them that our model has failed to capture?

Breusch-Godfrey Test

Definition: Breusch-Godfrey Test

The BG test assesses whether the error term, \(\epsilon_t\), follows an AR(p) process. You, the researcher, choose the number of lags (p) to test for. The assumed model for the error is: \[\epsilon_t = \rho_1 \epsilon_{t-1} + \rho_2 \epsilon_{t-2} + \dots + \rho_p \epsilon_{t-p} + v_t,\]

where \(v_t\) is a white noise term. \(H_0: \rho_1, \dots, \rho_p=0\), and \(H_A\) is that at least one \(\rho\) coefficient is non-zero: there is serial correlation up to some order \(p\).

Breusch-Godfrey Test Steps

Breusch-Godfrey Test: Procedure

The BG test is performed in three simple steps:

First, estimate your original model using OLS, regardless of what it is. Obtain the series of residuals.
Run a new regression where the residual \(e_t\) is the dependent variable. The independent variables are:
- All the original regressors from the main model.
- The lagged residuals up to order \(p\): \(e_{t-1}, \dots, e_{t-p}\).

When testing for first-order serial correlation (p=1), a t-test can be used to determine the significance of the coefficient on the single lagged residual in the auxiliary regression.
- Under the null hypothesis of no serial correlation, this t-statistic follows a t-distribution.
- If the absolute value of the calculated t-statistic exceeds the critical value from the t-distribution, the null hypothesis is rejected, indicating the presence of serial correlation.
When testing for higher orders of serial correlation (p > 1), multiple lagged residuals are included in the auxiliary regression.
- In this situation, an F-test is employed to test the joint significance of all the coefficients on the lagged residuals.
- A significant F-statistic suggests the presence of serial correlation.
- For the case of a single lagged residual (p=1), the F-statistic is equal to the square of the t-statistic, yielding the same conclusion.

Example BG Test

We demonstrate the BG test using autocorrelated errors: \(Y_t = \alpha + \beta X_t + u_t\) where \(u_t = \rho u_{t-1} + e_t\), with \(e_t\) being white noise.

R
Python
Stata

Code

library(lmtest)
# Step 1: Simulate Data with Serially Correlated Errors

# For reproducibility of the random numbers
set.seed(42) 

# Define simulation parameters
n <- 200                 # Sample size
alpha <- 5               # True intercept
beta <- 2                # True slope for X

# *** Key step: Define the autocorrelation structure for the errors ***
# We will create an error term u_t = rho * u_{t-1} + e_t
# where e_t is white noise.
rho <- 0.8               # High positive autocorrelation coefficient
sigma_e <- 1.5           # Standard deviation of the white noise component

# Generate the independent variable X
X <- rnorm(n, mean = 20, sd = 10)

# Generate the serially correlated error term 'u'
u <- numeric(n)
e <- rnorm(n, mean = 0, sd = sigma_e) # White noise component

# The first error term has no preceding term
u[1] <- e[1] 

# Generate the rest of the errors using the AR(1) formula
for (t in 2:n) {
  u[t] <- rho * u[t-1] + e[t]
}

# Generate the dependent variable Y using the regression equation
# Y_t = alpha + beta*X_t + u_t
Y <- alpha + beta * X + u

# Combine into a data frame for easy use
sim_data <- data.frame(Y = Y, X = X)

# Step 2: Estimate the OLS Regression Model

# We estimate the model Y ~ X, pretending we don't know about the
# autocorrelation in the errors.
ols_model <- lm(Y ~ X, data = sim_data)

# Print the summary. The coefficients will likely be close to the true
# values, but the standard errors and p-values are unreliable.
print(summary(ols_model))
## 
## Call:
## lm(formula = Y ~ X, data = sim_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.0295 -1.3534 -0.0239  1.3439  5.1120 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.52239    0.31093   17.76   <2e-16 ***
## X            1.97522    0.01414  139.70   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.944 on 198 degrees of freedom
## Multiple R-squared:   0.99,  Adjusted R-squared:  0.9899 
## F-statistic: 1.952e+04 on 1 and 198 DF,  p-value: < 2.2e-16


#--------------------------------------------------------------------
# Step 3: Conduct the Breusch-Godfrey Test
#--------------------------------------------------------------------

# The bgtest() function from the 'lmtest' package performs the test.
# We need to specify the model and the order of autocorrelation to test for.
# Since we created an AR(1) process, we will test for order 1.
bg_test_order1 <- bgtest(ols_model, order = 1)
print(bg_test_order1)
## 
##  Breusch-Godfrey test for serial correlation of order up to 1
## 
## data:  ols_model
## LM test = 96.14, df = 1, p-value < 2.2e-16

# You can also test for higher orders. For example, testing up to order 3.
# This would test the joint null hypothesis that the coefficients on the first
# three lags of the residuals are all zero in the auxiliary regression.
bg_test_order3 <- bgtest(ols_model, order = 3)
print(bg_test_order3)
## 
##  Breusch-Godfrey test for serial correlation of order up to 3
## 
## data:  ols_model
## LM test = 96.794, df = 3, p-value < 2.2e-16

Code

import numpy as np
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
from statsmodels.stats.diagnostic import acorr_breusch_godfrey

# Step 1: Simulate Data with Serially Correlated Errors

# For reproducibility of the random numbers
np.random.seed(42)

# Define simulation parameters
n = 200                 # Sample size
alpha = 5               # True intercept
beta = 2                # True slope for X

# *** Key step: Define the autocorrelation structure for the errors ***
# We will create an error term u_t = rho * u_{t-1} + e_t
# where e_t is white noise.
rho = 0.8               # High positive autocorrelation coefficient
sigma_e = 1.5           # Standard deviation of the white noise component

# Generate the independent variable X
# In numpy, mean is 'loc' and standard deviation is 'scale'
X = np.random.normal(loc=20, scale=10, size=n)

# Generate the serially correlated error term 'u'
# Initialize arrays for the white noise and the final error term
e = np.random.normal(loc=0, scale=sigma_e, size=n)
u = np.zeros(n)

# The first error term has no preceding term (Python uses 0-based indexing)
u[0] = e[0]

# Generate the rest of the errors using the AR(1) formula
for t in range(1, n):
  u[t] = rho * u[t-1] + e[t]

# Generate the dependent variable Y using the regression equation
# Y_t = alpha + beta*X_t + u_t
Y = alpha + beta * X + u

# Combine into a data frame for easy use
sim_data = pd.DataFrame({'Y': Y, 'X': X})

# Step 2: Estimate the OLS Regression Model

# We estimate the model Y ~ X, pretending we don't know about the
# autocorrelation in the errors.
# The formula syntax in statsmodels is identical to R's.
# We must call .fit() to perform the estimation.
ols_model = smf.ols('Y ~ X', data=sim_data).fit()

# Print the summary. The coefficients will likely be close to the true
# values, but the standard errors and p-values are unreliable.
print(ols_model.summary())
##                             OLS Regression Results                            
## ==============================================================================
## Dep. Variable:                      Y   R-squared:                       0.985
## Model:                            OLS   Adj. R-squared:                  0.985
## Method:                 Least Squares   F-statistic:                 1.273e+04
## Date:                Wed, 29 Oct 2025   Prob (F-statistic):          1.24e-181
## Time:                        12:56:11   Log-Likelihood:                -452.92
## No. Observations:                 200   AIC:                             909.8
## Df Residuals:                     198   BIC:                             916.4
## Df Model:                           1                                         
## Covariance Type:            nonrobust                                         
## ==============================================================================
##                  coef    std err          t      P>|t|      [0.025      0.975]
## ------------------------------------------------------------------------------
## Intercept      5.3778      0.387     13.914      0.000       4.616       6.140
## X              2.0112      0.018    112.817      0.000       1.976       2.046
## ==============================================================================
## Omnibus:                        4.086   Durbin-Watson:                   0.452
## Prob(Omnibus):                  0.130   Jarque-Bera (JB):                4.160
## Skew:                           0.196   Prob(JB):                        0.125
## Kurtosis:                       3.587   Cond. No.                         50.7
## ==============================================================================
## 
## Notes:
## [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

# Step 3: Conduct the Breusch-Godfrey Test

# The acorr_breusch_godfrey() function performs the test.
# It takes the fitted model results and the order of autocorrelation (nlags).
# It returns (lm_statistic, lm_pvalue, f_statistic, f_pvalue)

# Test for order 1, as we created an AR(1) process.
bg_test_order1 = acorr_breusch_godfrey(ols_model, nlags=1)

print("\n" + "="*50)
## 
## ==================================================
print("Breusch-Godfrey Test (Order 1)")
## Breusch-Godfrey Test (Order 1)
print(f"  LM Statistic: {bg_test_order1[0]:.4f}")
##   LM Statistic: 119.7827
print(f"  p-value (LM): {bg_test_order1[1]:.4f}")
##   p-value (LM): 0.0000
# A very low p-value indicates rejection of the null hypothesis (no serial correlation).

# You can also test for higher orders. For example, testing up to order 3.
# This tests the joint null hypothesis that the coefficients on the first
# three lags of the residuals are all zero in the auxiliary regression.
bg_test_order3 = acorr_breusch_godfrey(ols_model, nlags=3)

print("\n" + "="*50)
## 
## ==================================================
print("Breusch-Godfrey Test (Order 3)")
## Breusch-Godfrey Test (Order 3)
print(f"  LM Statistic: {bg_test_order3[0]:.4f}")
##   LM Statistic: 119.9787
print(f"  p-value (LM): {bg_test_order3[1]:.4f}")
##   p-value (LM): 0.0000
print("="*50)
## ==================================================

Code

* Housekeeping: Clear memory and stop the output from pausing
clear all
set more off

* Step 1: Simulate Data with Serially Correlated Errors
set seed 42

* Define simulation parameters using local macros
local n = 200                 // Sample size
local alpha = 5               // True intercept
local beta = 2                // True slope for X
local rho = 0.8               // High positive autocorrelation coefficient
local sigma_e = 1.5           // Standard deviation of the white noise component

* Set the number of observations for the dataset
set obs `n'

* Generate an index variable 't' to declare this as time-series data
gen t = _n
tsset t

* Generate the independent variable X
gen X = rnormal(20, 10)

* Generate the serially correlated error term 'u'
* First, create the white noise component 'e'
gen e = rnormal(0, `sigma_e')

* Now, generate 'u' using the recursive AR(1) formula
* u_t = rho * u_{t-1} + e_t
* We initialize u as missing, set the first value, then generate the rest
gen u = .
replace u = e in 1
replace u = `rho' * L.u + e in 2/l // L. is the lag operator, 2/l means "from obs 2 to the last"

* Generate the dependent variable Y using the regression equation
* Y_t = alpha + beta*X_t + u_t
gen Y = `alpha' + `beta'*X + u


* Step 2: Estimate the OLS Regression Model

* We estimate the model Y ~ X, pretending we don't know about the
* autocorrelation in the errors. The output is printed automatically.
regress Y X

* Step 3: Conduct the Breusch-Godfrey Test

* The `estat bgodfrey` command is a post-estimation command that must be
* run immediately after the regression.

* Test for order 1, as we created an AR(1) process.
estat bgodfrey, lags(1)

* You can also test for higher orders. For example, testing up to order 3.
* This tests the joint null hypothesis that the coefficients on the first
* three lags of the residuals are all zero in the auxiliary regression.
estat bgodfrey, lags(3)

How to pick a model? (Cont.)

Case 1: The BG Test Fails to Reject \(H_0\) (No Autocorrelation)
- Interpretation: Your simple model may be dynamically complete. No further dynamic terms are needed. You can proceed with interpreting the coefficients (cautiously, pending other diagnostic tests).
Case 2: The BG Test Rejects \(H_0\) (Autocorrelation is Present)
- Interpretation: Your model is dynamically misspecified and incomplete. The significant autocorrelation in the residuals is a clear signal that there is persistence in the system that your current model is ignoring.
- Solution: the information “trapped” in the autocorrelated errors must be brought into the model explicitly. We do this by considering more extensive models.
- Potentially by incorporating independent variables, lagged dependent variables and lagged independent variables.

Relationship Between Time Series

Linear Model in Time Series

The starting point for a relationship between time series is often the linear model.
The model form is identical to its cross-sectional counterpart:

\[ Y_t = \alpha + \beta X_t + \epsilon_t \]
This model posits a contemporaneous relationship between X and Y.
Question: Can we estimate this with OLS and trust the results?
For OLS to be a good estimator, the standard assumptions must hold. The most problematic one for time series data is the assumption of finite variance:

\[ \text{Var}(X_t) = \sigma^2 \]

The Problem of Spurious Regression

OLS might also work in other cases where \(X_t\) is stationary, or when \(X_t\) and \(Y_t\) are cointegrated.
However, in most situations, OLS may give a deceptive relationship:
- A spurious regression occurs when we find a statistically significant relationship between two time series variables that are actually unrelated.
This often happens when both variables have a strong trend (are non-stationary), and especially, when \(X\) is non-stationary.
- The trend can be deterministic (predictable) or stochastic (random).

Spurious Regression is Non-Stationarity

Let’s look at the most common non-stationary process: the random walk.

\[ X_t = X_{t-1} + u_t \]

Theorem: Variance of a Random Walk

Suppose \(X_0=0\). Let us find the variance of \(X_t\).

We know that \(X_1 = u_i\). \(X_2 = X_1 + u_2 = u_1 + u_2\), etc.

Hence, \(X_t = \sum u_i\), and \(Var(X_t) = Var(u_1 + u_2 + \dots + u_t)\).

Since the \(u_i\) are uncorrelated (white noise), the variance of the sum is the sum of the variances: \[ \begin{align} Var(X_t) &= \sigma^2_u + \dots + \sigma^2_u \text{ ( t times)} \\ &= t \times \sigma^2_u \end{align} \]

The variance depends on time \(t\) and grows without bound. This is a clear violation of stationarity.

Consequences of Spurious Regression

If you run a regression of \(Y_t\) on \(X_t\), where both are trending but unrelated:
- You will likely find a high R-squared.
- The t-statistics for the estimated coefficients will likely be significant.
- You might conclude that X causes Y, but this conclusion would be meaningless and misleading.
Detecting spurious regression: Dickey-Fuller tests
- If both series \(Y_t\) and \(X_t\) have a unit root (i.e., you fail to reject \(H_0\) for both), then running a simple OLS regression of Y on X is at high risk of being spurious.

Avoiding Spurious Regression

If the static model \(Y_t = \alpha + \beta X_t + \epsilon_t\) is so dangerous, what is the alternative?
Differencing the Data: If two variables are non-stationary, their first differences (\(\Delta Y_t\) and \(\Delta X_t\)) may be stationary.
Dynamic models: We need models that explicitly account for dynamics—the idea that a change in X can affect Y both today and in the future.
- This makes it more likely that the error term is white noise.

Distributed Lag (DL) Models

Distributed lag models are used when we want to model how a change in an explanatory variable (X) affects the dependent variable (Y) over time, and we want to focus on potential delayed effects.
- Example: How does a change in government spending (X) affect GDP (Y) in the current quarter and in future quarters?

Definition: Distributed Lag Model

A finite distributed lag model of order \(q\) is:

\[Y_t = \alpha + \beta_0 X_t + \beta_1 X_{t-1} + \dots + \beta_q X_{t-q} + u_t\] where:

\(\beta_0\): The impact multiplier - the immediate effect of a one-unit change in \(X_t\) on \(Y_t\).

\(\beta_s (s > 0)\): The dynamic multipliers - the effect of a one-unit change in \(X_{t-s}\) on \(Y_t\).

Interpreting the DL Coefficients

\[Y_t = \alpha + \beta_0 X_t + \beta_1 X_{t-1} + \dots + \beta_q X_{t-q} + u_t\]

Short-Run Multiplier: The total effect after a certain number of periods (e.g., \(\beta_0 + \beta_1\)).
Long-Run (or Total) Multiplier: The total effect of a sustained one-unit change in X on Y. It is the sum of all the \(\beta\) coefficients: \(\sum \beta_S\).

The Autoregressive Distributed Lag (ARDL) Model

ARDL models are a class of models that include lags of both the dependent variable and the explanatory variable(s).

Definition: ARDL Model

An ARDL(p,q) model is defined as follows:

\[Y_t = \alpha + \sum(ρ_j \cdot Y_{t-j}) \text{ [from j=1 to p]} + \sum(\beta_j \cdot X_{t-m}) \text{ [from m=0 to q]} + u_t\]

It includes both autoregressive (lags of the DV) and distributed lag (lags of the IV) terms.
ARDL models are very flexible and can be used to estimate both short-run and long-run effects.

Interpretation ARDL Coefficients

The individual coefficients represent direct, short-term impacts before feedback effects occur.
Impact Multiplier (\(\beta_0\)): The immediate effect of a change in \(X_t\) on \(Y_t\).
- Derivation: \(\frac{\partial Y_t}{\partial X_t} = \beta_0\)
Delay Multipliers (\(\beta_j\)): The effect of past changes in \(X\) on current \(Y\).
- Interpretation: A one-unit increase in \(X_{t-j}\) is associated with a change of \(\beta_j\) units in today’s \(Y_t\).
Autoregressive Coefficients (\(\phi_i\)): Measure the persistence or inertia in the process.
- Interpretation: \(\phi_i\) captures the proportion of \(Y_{t-i}\) that is carried over to \(Y_t\). It describes how quickly the variable returns to its equilibrium after a shock.

Long-Run Multiplier (LRM) Derivation & Interpretation

The LRM shows the total effect on Y after a permanent change in X has fully worked through the system.
Derivation (via Equilibrium Condition):
- In a long-run steady state, we assume variables are constant:
  - \(Y_t = Y_{t-1} = \dots = Y_{eq}\)
  - \(X_t = X_{t-1} = \dots = X_{eq}\)
- Substitute this into the ARDL equation (assuming \(E[u_t]=0\)):
  - \(Y_{eq} = c + \sum \phi_i Y_{eq} + \sum \beta_j X_{eq}\)
- Solve for \(Y_{eq}\):
  - \(Y_{eq} (1 - \sum \phi_i) = c + (\sum \beta_j) X_{eq}\)
  - \(Y_{eq} = \frac{c}{1 - \sum \phi_i} + \left( \frac{\sum \beta_j}{1 - \sum \phi_i} \right) X_{eq}\)

Long-Run Multiplier (LRM) Derivation & Interpretation (Cont.)

The Long-Run Multiplier (\(\theta\)):
- The coefficient on \(X_{eq}\) in the long-run equation is the LRM: \[ \theta = \text{LRM} = \frac{\text{Total effect of X}}{\text{Persistence of Y}} = \frac{\sum_{j=0}^{q} \beta_j}{1 - \sum_{i=1}^{p} \phi_i} \]
Interpretation: A permanent one-unit increase in \(X\) is associated with a total long-run change of \(\theta\) units in \(Y\), after all dynamic adjustments are complete.
The ARDL model powerfully separates the immediate impacts (\(\beta_j\)) from the total equilibrium effect (\(\theta\)), which accounts for the feedback from Y’s own past values.

Estimating ARDL Models

In R, the ARDL package is the standard for estimating these models.
Stata has a powerful built-in ardl command (in newer versions) or a widely used user-written one (ssc install ardl). It automates lag selection and estimation.
In Python, the statsmodels library has incorporated a dedicated ARDL class, which makes estimation much more straightforward than the previous manual method of creating lagged variables.

R
Python
Stata

Code

library(ARDL)
library(urca) # Contains the example 'denmark' dataset

# Load the data
data(denmark)
# The variables are LRM, LRY, IBO, IDE (inflation)

# Find the optimal lag order automatically
# We model LRM as a function of LRY and IBO, with a max lag of 4 for each
auto_model <- auto_ardl(LRM ~ LRY + IBO, data = denmark, max_order = 4)

# Print the selected orders
print(auto_model$best_order)
## LRM LRY IBO 
##   3   1   0

# Estimate the selected ARDL model
# Let's say auto_ardl selected ARDL(3, 1, 4)
ardl_model <- ardl(LRM ~ LRY + IBO, data = denmark, order = c(3, 1, 4))

# Get a detailed summary of the model
# This output neatly separates short-run and long-run coefficients
print(summary(ardl_model))
## 
## Time series regression with "ts" data:
## Start = 5, End = 55
## 
## Call:
## dynlm::dynlm(formula = full_formula, data = data, start = start, 
##     end = end)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.033673 -0.012617 -0.002257  0.008639  0.073178 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.83728    0.68121   2.697 0.010189 *  
## L(LRM, 1)    0.46368    0.14626   3.170 0.002920 ** 
## L(LRM, 2)    0.56593    0.14483   3.908 0.000351 ***
## L(LRM, 3)   -0.28115    0.11100  -2.533 0.015338 *  
## LRY          0.57978    0.14671   3.952 0.000307 ***
## L(LRY, 1)   -0.36536    0.15667  -2.332 0.024817 *  
## IBO         -1.14595    0.35807  -3.200 0.002688 ** 
## L(IBO, 1)    0.01182    0.61215   0.019 0.984693    
## L(IBO, 2)    0.59817    0.63372   0.944 0.350890    
## L(IBO, 3)   -0.81318    0.56768  -1.432 0.159783    
## L(IBO, 4)    0.37907    0.37942   0.999 0.323763    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.02095 on 40 degrees of freedom
## Multiple R-squared:  0.9848, Adjusted R-squared:  0.981 
## F-statistic: 259.3 on 10 and 40 DF,  p-value: < 2.2e-16

Code

import pandas as pd
import statsmodels.api as sm
from statsmodels.tsa.api import ARDL, ardl_select_order

# Load the same 'lutkepohl2' dataset from statsmodels datasets
data = r.denmark
data.index = pd.to_datetime(data['ENTRY'], format='%Y:%m').dt.to_period('Q')
# Select the variables of interest
data_subset = data[['LRM', 'LRY', 'IBO']]

# Assign endogenous (Y) and exogenous (X) variables
endog = data_subset['LRM']
exog = data_subset[['LRY', 'IBO']]

# Find the optimal lag order
# maxlag is for the AR (dependent) part, maxorder is for the DL (independent) part
# aic is the information criterion to use for selection
sel_order = ardl_select_order(
    endog, maxlag=4, exog=exog, maxorder=4, ic='aic', trend='c'
)

# Instantiate the ARDL model with the selected order
# sel_order.ardl_order gives the optimal (p, q1, q2, ...)
ardl_result = sel_order.model.fit()

# Print the summary of the results
# This shows the short-run coefficients
print(ardl_result.summary())
##                               ARDL Model Results                              
## ==============================================================================
## Dep. Variable:                    LRM   No. Observations:                   55
## Model:                  ARDL(3, 1, 0)   Log Likelihood                 132.589
## Method:               Conditional MLE   S.D. of innovations              0.019
## Date:                Wed, 29 Oct 2025   AIC                           -249.178
## Time:                        12:56:12   BIC                           -233.568
## Sample:                    06-30-1974   HQIC                          -243.194
##                          - 03-31-1987                                         
## ==============================================================================
##                  coef    std err          z      P>|z|      [0.025      0.975]
## ------------------------------------------------------------------------------
## const          1.9372      0.364      5.326      0.000       1.205       2.670
## LRM.L1         0.4330      0.118      3.661      0.001       0.195       0.671
## LRM.L2         0.5434      0.129      4.213      0.000       0.284       0.803
## LRM.L3        -0.2511      0.093     -2.696      0.010      -0.439      -0.063
## LRY.L0         0.6018      0.132      4.574      0.000       0.337       0.867
## LRY.L1        -0.3562      0.141     -2.521      0.015      -0.641      -0.072
## IBO.L0        -1.0497      0.178     -5.883      0.000      -1.409      -0.690
## ==============================================================================

params = ardl_result.params
denominator = 1 - params.filter(like='LRM.L').sum()

# 3. Calculate the sum of coefficients for each independent variable and divide
long_run_effects = pd.Series({
    'LRY': params.filter(like='LRY.L').sum() / denominator,
    'IBO': params.filter(like='IBO.L').sum() / denominator,
    'const': params['const'] / denominator  # Optional: long-run intercept
})

print(long_run_effects)
## LRY      0.893800
## IBO     -3.821235
## const    7.052197
## dtype: float64

Code

* Load the data (same as the 'denmark' dataset in R)
webuse lutkepohl2, clear

* Declare the data as time series
tsset qtr

* Estimate the ARDL model, letting Stata select the best lags up to a max of 4.
* The dependent variable is 'ln_rm' (log real money).
* The independent variables are 'ln_inc' (log income) and 'ln_consump' (log consumption).
* We'll use the variables from the Stata dataset.
ardl ln_rm ln_inc, maxlags(4)

* After estimation, Stata shows the optimal lags and the short-run results.

* To see the long-run coefficients, use the post-estimation command `estat ec`.
estat ec

Model Selection and Forecasting

What is Forecasting?

Definition: Forecasting

To predict future values of a time series based on its own past and potentially other related variables.

We use the information available up to time \(t\) to form the best possible guess about the value of the series at a future time \(t+h\).
- Information Set (\(I_t\)): All known values of the series (and other relevant variables) up to and including time \(t\). \(I_t = \{Y_t, Y_{t-1}, ...\}\)
- Forecast Horizon (\(h\)): How many steps into the future we are predicting.
  - \(h=1\): One-step-ahead forecast.
  - \(h>1\): Multi-step-ahead forecast.
We want to find the conditional expectation of the future value, given our current information.

\[ \text{Forecast} = E[Y_{t+h} | I_t] \]

Which Model Should I Pick?

How to pick a model for forecasting? Recall the standard R-squared: \(R^2 = \frac{ESS}{TSS} = 1 - \frac{RSS}{TSS}\)
Problem: The standard \(R^2\) is a non-decreasing function of the number of included regressors (\(k\)).
- Adding any variable to a model, even an irrelevant one, will almost always increase the \(R^2\) (it will never decrease).
Relying solely on \(R^2\) to compare models is dangerous. It incentivizes “kitchen sink” regressions—throwing in as many variables as possible to maximize the fit on the sample data. This often leads to:
- Overfitting: The model captures random noise in the sample rather than the true underlying relationship.
- Poor out-of-sample prediction: An overfit model performs poorly on new data.
- Lack of Parsimony: We prefer simpler, more elegant models (Occam’s razor).

The Adjusted R-squared

The Adjusted R-squared modifies the standard \(R^2\) by incorporating a penalty for each additional regressor. It adjusts the RSS and TSS by their respective degrees of freedom.

Definition: Adjusted \(R^2\)

\[ R^2_{adj} = 1 - \frac{RSS / (n - k - 1)}{TSS / (n - 1)} \]

Where \(n\) = number of observations, and \(k\) = number of predictors (independent variables)

This can also be expressed in terms of the standard \(R^2\):

\[ R^2_{adj} = 1 - (1 - R^2) \frac{n - 1}{n - k - 1} \]

Intuition Behind the Penalty

Let’s examine the penalty term: \(\frac{n-1}{n-k-1}\)
- As you add a new predictor, \(k\) increases by 1.
- This causes the denominator \((n-k-1)\) to decrease.
- This makes the entire fraction \(\frac{n-1}{n-k-1}\) larger.
- Therefore, the adjustment factor \((1 - R^2) \frac{n-1}{n-k-1}\) becomes larger, which imposes a bigger penalty.
For \(R^2_{adj}\) to increase when adding a new variable, the increase in the standard \(R^2\) must be substantial enough to outweigh the penalty from losing a degree of freedom.
If you add an irrelevant variable, \(R^2\) will increase slightly, but the penalty will dominate, causing \(R^2_{adj}\) to decrease.

Information Criterion

Another way to evaluate a model is by considering its information content.
A key feature of Akaike Information Criterion (AIC) is that it balances the goodness of fit of the model with its simplicity.
- It penalizes models that have more parameters, which helps to prevent overfitting—a scenario where a model is too complex and captures noise in the data rather than the underlying pattern.

Definition: AIC

\[AIC = 2k - 2ln(L)\]

Where:

k is the number of estimated parameters in the model.
L is the maximized value of the likelihood function for the model.

Note: For OLS models assuming normal errors, the AIC can be expressed using the RSS: \(AIC = n \cdot \ln(RSS/n) + 2k\)

When comparing multiple models, the one with the minimum AIC value is preferred.

Optimal Lags

The AIC provides a systematic way to select the optimal number of lags. The general procedure is as follows:
1. Define a maximum number of lags to consider. This is an upper limit to prevent the model from becoming overly complex.
2. Fit a series of AR models. Starting from a model with one lag, fit successive AR models, increasing the number of lags by one up to the defined maximum.
3. Calculate the AIC for each model. For each fitted model, compute the AIC value.
4. Select the model with the lowest AIC. The number of lags corresponding to the model with the minimum AIC is considered the optimal lag order.

R
Python
Stata

Code

# Load a sample time series dataset
data(sunspot.year)

# Fit an AR model and let the function select the best lag order based on AIC
ar_model <- ar(sunspot.year)

# The optimal number of lags is stored in the 'order' component of the model object
optimal_lags <- ar_model$order

# Print the optimal number of lags
print(paste("Optimal number of lags based on AIC:", optimal_lags))
## [1] "Optimal number of lags based on AIC: 9"

# You can also view the AIC values for different lag orders
print(ar_model$aic)
##          0          1          2          3          4          5          6 
## 500.451015 188.271695  37.689005  31.834674  33.427738  35.353944  28.912532 
##          7          8          9         10         11         12         13 
##  23.654977   9.099444   0.000000   1.973243   3.377547   5.376389   7.146043 
##         14         15         16         17         18         19         20 
##   8.037731   7.970507   9.526070   5.107497   6.785910   8.666711  10.661378 
##         21         22         23         24 
##  10.633330  12.467419  13.087979  14.552615

Code

import pandas as pd
from statsmodels.tsa.ar_model import ar_select_order
import statsmodels.api as sm

# Load a sample time series dataset (e.g., Sunspots data)
# This data is available in statsmodels
data = sm.datasets.sunspots.load_pandas().data
series = data['SUNACTIVITY']

# Use ar_select_order to find the optimal lag
# We specify a maximum lag to consider and the information criterion to use ('aic')
selection = ar_select_order(series, maxlag=20, ic='aic')

# The selected lags are stored in the 'ar_lags' attribute
optimal_lags = selection.ar_lags

# Print the optimal number of lags
print(f"Optimal number of lags based on AIC: {optimal_lags}")
## Optimal number of lags based on AIC: [1, 2, 3, 4, 5, 6, 7, 8, 9]

Code

* Load a sample time series dataset (Sunspot data)
* webuse is Stata's command to load data from the internet
webuse sunspot

* Declare the data as a time series
* The 'year' variable indicates the time period
tsset year

* Use varsoc to get the selection criteria for lags 1 through 20
* Stata will automatically mark the optimal lag for each criterion with a '*'
varsoc sunspot, maxlag(20)

BIC

In addition to the AIC, there is also the Bayesian Information Criterion (BIC), which has a different penalty for larger models.

Definition: BIC

\[ BIC = -2 \ln(\hat{L}) + k' \ln(n) \]

Where:

\(\ln(\hat{L})\) = The maximized value of the log-likelihood function for the model. This is the goodness-of-fit term.
\(k'\) = The number of estimated parameters in the model. (For a standard linear regression, this is the number of predictors \(k\) plus the intercept, so \(k' = k+1\)).
\(n\) = The number of observations.
\(k' \ln(n)\) = The penalty term.

Note: For OLS models assuming normal errors, the BIC can be expressed using the RSS: \(BIC = n \cdot \ln(RSS/n) + k' \ln(n)\)

AIC vs BIC

The Difference: The penalty in BIC is \(k' \ln(n)\), while in AIC it is \(2k'\).
Which is stricter? For any sample size \(n \ge 8\), we have \(\ln(n) > 2\).
- Therefore, BIC imposes a larger penalty for extra parameters than AIC.
- As a result, BIC has a stronger tendency to select more parsimonious (simpler) models than AIC.

Evaluation

Evaluating Forecasts

After we’ve picked a model, we need to know how well it predicts.
We do this by comparing the actual, observed values (\(Y_t\)) with our forecasted values (\(\hat{Y}_t\)).

Definition: Forecast Error

The forecast error is the difference between an actual observation and a forecast.

\[ e_t = Y_t - \hat{Y}_t \]

Common Evaluation Metrics

Mean Absolute Error (MAE): \(\text{MAE} = \frac{1}{n} \sum_{t=1}^{n} |Y_t - \hat{Y}_t|\)
- Interpretation: The average absolute size of the errors. Easy to understand and robust to outliers.
Mean Squared Error (MSE): \(\text{MSE} = \frac{1}{n} \sum_{t=1}^{n} (Y_t - \hat{Y}_t)^2\)
- Interpretation: Penalizes large errors much more heavily than small ones due to the squaring. Widely used in statistics.
Root Mean Squared Error (RMSE): \(\text{RMSE} = \sqrt{\text{MSE}}\)
- Interpretation: Same units as the original data, making it more interpretable than MSE.

Relative and Percentage Errors

Sometimes the scale of the data matters. An error of 10 is huge for a series with an average of 5, but tiny for a series with an average of 5,000.
Scale-Free Metric:
- Mean Absolute Percentage Error (MAPE): \(\text{MAPE} = \left( \frac{1}{n} \sum_{t=1}^{n} \left| \frac{Y_t - \hat{Y}_t}{Y_t} \right| \right) \times 100\%\)
  - Interpretation: The average percentage error. Very intuitive (“we were off by about 5% on average”).
  - Can be problematic if actual values (\(Y_t\)) are close to or equal to zero.
In-Sample vs. Out-of-Sample Evaluation:
- In-Sample: Evaluating the model on the same data used for training. Tells you about model fit.
- Out-of-Sample: Splitting your data (e.g., 80% train, 20% test). You train the model on the first part and evaluate its forecasting performance on the part it has never seen.
- This is the true test of a forecast model!

Forecasting with AR Models

Let’s start predicting with a simple AR(p) model.

\[ Y_t = c + \phi_1 Y_{t-1} + \phi_2 Y_{t-2} + \dots + \phi_p Y_{t-p} + \varepsilon_t \]
Produce forecasts by iterative substitution:
1. One-Step-Ahead Forecast (\(h=1\)):
  - We want \(E[Y_{t+1} | I_t]\). The future error term has an expectation of zero, \(E[\varepsilon_{t+1}|I_t] = 0\).
  - We simply plug in all the known values of Y up to time \(t\): \[ \hat{Y}_{t+1|t} = c + \phi_1 Y_t + \phi_2 Y_{t-1} + \dots + \phi_p Y_{t-p+1} \]
2. Two-Step-Ahead Forecast (\(h=2\)):
  - We want \(E[Y_{t+2} | I_t]\). The formula is \(Y_{t+2} = c + \phi_1 Y_{t+1} + \dots\)
  - The problem: We don’t know \(Y_{t+1}\)!
  - Solution: We substitute our forecast for it, \(\hat{Y}_{t+1|t}\). \[ \hat{Y}_{t+2|t} = c + \phi_1 \mathbf{\hat{Y}_{t+1|t}} + \phi_2 Y_t + \dots + \phi_p Y_{t-p+2} \]
For stationary AR processes, the long-run forecast (as \(h \to \infty\)) will converge to the unconditional mean of the process: \(E[Y] = \frac{c}{1 - \sum \phi_i}\).

Example: AR Forecast

Example: GDP Growth

Let’s assume we are modeling a country’s quarterly GDP growth rate, \(g_t\). After analyzing the data, we have estimated the following AR(1) model:

\[ g_t = 0.20 + 0.70 g_{t-1} + \epsilon_t \]

Where:

\(g_t\) is the GDP growth in quarter \(t\).
The intercept \(c = 0.20\).
The autoregressive coefficient \(\phi_1 = 0.70\).
\(\epsilon_t\) is a white noise error term, for which our best forecast is its expected value, \(E[\epsilon_t]=0\).

We are at the end of Quarter 4, 2024 (let’s call this period T). The last observed GDP growth rate was 1.5%.

Our goal is to forecast GDP growth for Q1 2025 (period T+1) and Q2 2025 (period T+2).

Example: AR Forecast (Cont.)

The core principle of AR forecasting is to use the model iteratively. For any future period, our best guess for the unknown error term (\(\epsilon_{T+h}\)) is always zero.

Example: GDP Growth (Continued)

A. Forecast for Q1 2025 (1-Step-Ahead Forecast)

To forecast for period T+1, we use the actual, known data from period T. The forecast, denoted with a “hat”, is:

\[ \hat{g}_{T+1} = c + \phi_1 g_T \]

Plugging in our known values:

\[ \hat{g}_{T+1} = 0.20 + 0.70 \times (1.5) \]

\[ \hat{g}_{T+1} = 0.20 + 1.05 = 1.25 \]

Our forecast for Q1 2025 growth is 1.25%.

Example: AR Forecast (Cont.)

Example: GDP Growth (Continued)

B. Forecast for Q2 2025 (2-Step-Ahead Forecast)

To forecast for period T+2, we need the value from the previous period (T+1). Since we are at time T, we do not know the actual value of \(g_{T+1}\). Therefore, we must use our own forecast, \(\hat{g}_{T+1}\), as the input.

\[ \hat{g}_{T+2} = c + \phi_1 \hat{g}_{T+1} \]

Plugging in the forecast we just calculated:

\[ \hat{g}_{T+2} = 0.20 + 0.70 \times (1.25) \]

\[ \hat{g}_{T+2} = 0.20 + 0.875 = 1.075 \]

Our forecast for Q2 2025 growth is 1.075%.

Example: AR Forecast (Cont.)

Example: GDP Growth (Continued)

Now, time passes. We observe the actual GDP growth rates for the two quarters we forecasted.

Period	Forecast (\(\hat{g}\))	Actual (\(g\))
Q1 2025	1.25%	1.40%
Q2 2025	1.075%	1.10%

We can now calculate the Root Mean Squared Error (RMSE) to measure the accuracy of our forecast.

Step 1: Calculate the Forecast Errors (\(e_t = g_t - \hat{g}_t\))

Error for Q1 2025: \(e_1 = 1.40 - 1.25 = 0.15\)
Error for Q2 2025: \(e_2 = 1.10 - 1.075 = 0.025\)

Example: AR Forecast (Cont.)

Example: GDP Growth (Continued)

Step 2: Square the Errors and Calculate the Mean (MSE)

The Mean Squared Error (MSE) is the average of the squared errors.

\[ MSE = \frac{1}{n} \sum_{i=1}^{n} e_i^2 \]

\[ MSE = \frac{1}{2} \left( (0.15)^2 + (0.025)^2 \right) \]

\[ MSE = \frac{1}{2} \left( 0.0225 + 0.000625 \right) = \frac{0.023125}{2} = 0.0115625 \]

Example: AR Forecast (Cont.)

Example: GDP Growth (Continued)

Step 3: Take the Square Root of the MSE (RMSE)

The Root Mean Squared Error (RMSE) brings the metric back into the original units of the data (in this case, percentage points).

\[ RMSE = \sqrt{MSE} \]

\[ RMSE = \sqrt{0.0115625} \approx 0.1075 \]

Example: AR Forecast (Cont.)

Example: GDP Growth (Continued)

The RMSE for our two-period forecast is 0.1075%.

This value can be interpreted as the typical, or average, magnitude of our forecast error.

It means that, over this forecast horizon, our AR(1) model’s predictions were, on average, off by about 0.11 percentage points.

This single number provides a concise summary of the overall predictive accuracy of the model for this specific forecast period.

Forecasting with MA Models

Now for an MA(q) model. The “memory” of these models is very different.

\[ Y_t = \mu + \varepsilon_t + \theta_1 \varepsilon_{t-1} + \dots + \theta_q \varepsilon_{t-q} \]
1. One-Step-Ahead Forecast (\(h=1\)):
  - We want \(E[Y_{t+1} | I_t]\). The equation is \(Y_{t+1} = \mu + \varepsilon_{t+1} + \theta_1 \varepsilon_t + \dots\)
  - The only future term is \(\varepsilon_{t+1}\), whose expectation is 0. We can estimate the past error terms from the data (using residuals). \[ \hat{Y}_{t+1|t} = \mu + \theta_1 \varepsilon_t + \dots + \theta_q \varepsilon_{t-q+1} \]
2. Two-Step-Ahead Forecast (\(h=2\)):
  - \(Y_{t+2} = \mu + \varepsilon_{t+2} + \theta_1 \varepsilon_{t+1} + \dots\)
  - Both \(\varepsilon_{t+2}\) and \(\varepsilon_{t+1}\) are unknown future shocks. Their conditional expectation is 0. \[ \hat{Y}_{t+2|t} = \mu + \theta_2 \varepsilon_t + \dots + \theta_q \varepsilon_{t-q+2} \]
For any forecast horizon \(h\) greater than the order \(q\) (\(h > q\)), all the error terms in the equation will be in the future.
- \(\mathbf{\hat{Y}_{t+h|t} = \mu}\) for all \(h > q\).
- The forecast quickly reverts to the process mean. MA models have a “short memory.”

Forecasting with ARMA(p,q) Models

An ARMA model combines the features of both AR and MA processes.

\[ Y_t = c + \phi_1 Y_{t-1} + \dots + \phi_p Y_{t-p} + \varepsilon_t + \theta_1 \varepsilon_{t-1} + \dots + \theta_q \varepsilon_{t-q} \]
Forecasting combines the two previous methods:
- For the AR part, we use an iterative process, plugging in past forecasts for unknown future values of \(Y\).
- For the MA part, the influence of past shocks (\(\varepsilon_t\)) fades and disappears after \(q\) periods.

Example: ARMA(1,1) Forecast

\(Y_{t+1} = c + \phi_1 Y_t + \varepsilon_{t+1} + \theta_1 \varepsilon_t\)
- Forecast \(\hat{Y}_{t+1|t} = c + \phi_1 Y_t + \theta_1 \varepsilon_t\)
\(Y_{t+2} = c + \phi_1 Y_{t+1} + \varepsilon_{t+2} + \theta_1 \varepsilon_{t+1}\)
- Forecast: \(\hat{Y}_{t+2|t} = c + \phi_1 \mathbf{\hat{Y}_{t+1|t}} \quad (\text{The MA term vanishes as } E[\varepsilon_{t+1}|I_t]=0)\)

The AR component gives the forecast a persistent, decaying memory, while the MA component captures the effect of short-term, finite shocks.

Forecasting with ARDL(p,q) Models

Why limit ourselves to just the past of Y? Other variables might help predict it.

\[ Y_t = c + \sum_{i=1}^{p} \phi_i Y_{t-i} + \sum_{j=0}^{q} \beta_j X_{t-j} + \varepsilon_t \]
The process is similar to a standard AR forecast, but with an added component.
- One-Step-Ahead Forecast (\(h=1\)): \[ \hat{Y}_{t+1|t} = c + \sum_{i=1}^{p} \phi_i Y_{t-i+1} + \sum_{j=0}^{q} \beta_j X_{t-j+1} \]
- Simply plug in all known values for past \(Y\) and past \(X\).

Challenge of Multivariate Forecasting

To forecast \(Y_{t+1}\), the model requires the value of \(X_{t+1}\) if \(\beta_0 \neq 0\)!
To forecast \(Y_{t+h}\), you need values for \(X_{t+h}, X_{t+h-1}, \dots, X_{t+1}\).
Solutions:
1. Build a separate forecast model for X. (e.g., use an ARIMA model to forecast future temperatures).
2. Create scenarios for X. (e.g., “What will sales be if we assume a 10% increase in our advertising budget (\(X\))?”).
3. Only use lagged X’s. If \(\beta_0 = 0\), then you only need \(X\) values up to time \(t\) to predict \(Y_{t+1}\), which are known.

Takeaways

The Foundation is Conditional Expectation: all standard forecasting methods are an attempt to calculate \(E[Y_{t+h} | I_t]\).
Test your forecasts on out-of-sample data using metrics like MAE, RMSE, or MAPE to understand their real-world performance.
Model Structure dictates forecast behavior:
- AR(p): Forecasts are iterative and have long memory. They converge slowly to the mean.
- MA(q): Forecasts have short memory. They revert to the mean after \(q\) periods.
- ARMA(p,q): A hybrid approach capturing both long-term persistence and short-term shocks.
- ARDL(p,q): Incorporates external information but introduces the challenge of needing to forecast your predictors (\(X\) variables).

Summary

What did we do?

Univariate Time Series: We looked at ways to analyze a univariate time series, in particular, at:
- Autoregressive models: Explain a variable using its own past values.
- Moving Average models: Explain a variable using past random shocks.
The Linear Model: We focused on the possibility of using OLS on bivariate time series data.
- We saw some cases (where \(X_t\) is stationary) in which it is allowed to estimate \(Y_t = \alpha + \beta X_t + u_t\)
- If we have stationary errors, this might be a sensible reason that this assumption is justified.
- However, if not, we run into the problem of spurious regression: a misleading relationship between two non-stationary time series.

What did we do? (Cont.)

Distributed Lag (DL) Models:
- To model a more realistic dynamic structure, we focused on models that investigate the dynamic impact of an explanatory variable over time, allowing us to distinguish between short-run and long-run effects. These models make it more likely that the error term is white noise.
Autoregressive Distributed Lag Models:
- A generalization of distributed lag models, also integrating autoregressive components. A model like this has the ability to separate feedback effects from long-run impacts of \(X\).
Forecasting:
- We saw all of these models in a forecasting setting. We discussed conditional expectation as the main vehicle for forecasting, talked about particular processes and discussed some of the evaluation metrics.