
Lecture 2: The Linear Model II
First two lectures devoted to the linear model.
Prequisite knowledge:
This lecture:
Material: Wooldridge Chapters 3 and 4
Simple Linear Regression is often inadequate because we can’t control for other factors that might be important. This leads to omitted variable bias.
The solution is to include those other factors in the model.
Definition: Multiple Linear Regression Model
\[ y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_k x_k + u \]
Now we have \(k\) explanatory variables.
The principle is the same: we choose \(\hat{\beta}_0, \hat{\beta}_1, ..., \hat{\beta}_k\) to minimize the Sum of Squared Residuals (SSR).
The formulas are complex (usually done with matrix algebra) but are easily handled by software:
lm(y ~ x1 + x2, data=df) in Rreg y x1 x2 in Statapf.feols("y ~ x1 + x2", data=df) after import pyfixest as pf in Python.1In multiple linear regression, our primary goal is often to estimate the causal effect of a specific variable of interest (say \(X_1\)) on an outcome (\(Y\)). The challenge is that in the real world, many factors are changing at once.
The purpose of control variables (\(X_2, X_3, \dots\)) is to statistically “hold constant” other relevant factors that could be confounding our results.
The validity of your estimated causal effect hinges on choosing the right controls. Including the wrong ones can be worse than including none at all.
Example: Good Control Variable
In estimating the effect of education (\(X_1\)) on wages (\(Y\)), innate ability is a classic confounder. Ability is likely correlated with how much education someone pursues and their potential earnings.
Controlling for a proxy of ability (like an IQ test score, \(X_2\)) helps to prevent the estimated return to education from being biased upwards.
Example: Bad Control
Suppose we study the relationship between having a MSc degree (\(X_1\)) and having a start-up idea (\(Y\)) among people who receive research grants (\(Z\)). Getting a grant (\(Z\)) may be caused by both having a MSc and having a good idea.
If we only look at people who received grants (i.e., control for \(Z\)), we might find a spurious negative correlation. Within the grant-recipient pool, someone with a MSc but a weak idea could get the grant, as could someone without a MSc but a brilliant idea. This creates an artificial trade-off in our selected sample.
Definition: (Estimated) Variance of the OLS Estimator (Multivariate)
\[ \widehat{Var}(\hat{\beta}_j) = \frac{\hat{\sigma}^2}{SST_j (1 - R_j^2)} \]
The \(t\)-statistic (Multivariate)
\[ t = \frac{\text{Estimate} - \text{Hypothesized Value}}{\text{Standard Error}} = \frac{\hat{\beta}_j - 0}{se(\hat{\beta}_j)} \]
where the \(se(\hat{\beta}_j)\) is the square root of \(Var(\hat{\beta}_j)\) as presented earlier.
F Test for Overall Significance
This tests whether any of our independent variables have an effect on the dependent variable.
Model: \(y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \dots + \beta_k x_k + u\)
Null Hypothesis (\(H_0\)): None of the independent variables have an effect on \(y\). The model has no explanatory power. \(H_0: \beta_1 = \beta_2 = \dots = \beta_k = 0\)
Alternative Hypothesis (\(H_A\)): \(H_0\) not true.
F Statistic: Definition (General)
The F statistic is defined as:
\[ F = \frac{(SSR_{restricted} - SSR_{unrestricted}) / q}{SSR_{unrestricted} / (n - k - 1)} \]
Where:
Example: F Distribution

| Feature | t-test | F-test |
|---|---|---|
| Scope | One coefficient at a time | Two or more coefficients at a time |
| Typical Use | Is this specific variable significant? | Is this group of variables jointly significant? OR Is the model as a whole useful? |
| Null Hypothesis | \(H_0: \beta_j = 0\) | \(H_0: \beta_1 = \beta_2 = \dots = 0\) |
| Test Statistic | \(t = \frac{\hat{\beta}_j}{se(\hat{\beta}_j)}\) | Compares Restricted vs. Unrestricted sum of squares |
| Key Question | “Does education significantly affect wage, holding other factors constant?” | “Does a person’s work experience, measured by exper and exper^2, jointly affect their wage?” |
The power of dummy variables becomes clear when we add other regressors. Let’s add years of education (Educ) to our model:
\[ Wage_i = \beta_0 + \beta_1 Female_i + \beta_2 Educ_i + u_i \]
We again analyze the regression equation for each group, now holding Educ constant.
The standard linear regression model \(\text{Wage}_i = \beta_0 + \beta_1 \text{Gender}_i + \beta_2 \text{Educ}_i + u_i\) assumes the effect of education on wages (\(\beta_2\)) is identical for men and women.
What if an extra year of education has a different return for females than for males? To allow for this, we must let the slope differ between the groups. We do this by adding an interaction term.
The interaction term is simply the product of the dummy variable and the continuous variable. \(Wage_i = \beta_0 + \beta_1 Female_i + \beta_2 Educ_i + \beta_3 (Female_i \cdot Educ_i) + u_i\)
Once more, we derive the regression line for each group.
Educ:The interpretation of the “main effects” (\(\beta_1\) and \(\beta_2\)) changes fundamentally when an interaction term is present.
\(\beta_2\) is the effect of an additional year of education on wages for the reference group (males) only.
\(\beta_1\) is the difference in expected wages between females and males when Educ = 0. This is the difference in the intercepts. This coefficient is often not meaningful on its own if Educ=0 is not a relevant value in the data.
\(\beta_3\) (Interaction Coefficient) is the difference in the slopes between males and females. It measures by how much the effect of an additional year of education differs for females compared to males. This is often the coefficient of primary interest.
Hence, the effect of Educ on Wage for men is \(\beta_2\), and \(\beta_2 + \beta_3\) for women.
Example: Education Gender Interaction
The marginal effect of Education for Females is: \(\frac{\partial E[wage_i | Female_i=1, Educ_i]}{\partial Educ_i} = \beta_2 + \beta_3\)
The wage differential between Females and Males is: \(E[wage|F=1] - E[wage|F=0] = ((\beta_0 + \beta_1) + (\beta_2 + \beta_3)Educ_i) - (\beta_0 + \beta_2 Educ_i) = \beta_1 + \beta_3 Educ_i\). The wage gap is no longer constant; it depends on the level of education.
Example: Multiple Dummies with Reference Category
With “West” as the baseline, the model would be:
\[ y_i = \beta_0 + \beta_1 \text{North}_i + \beta_2 \text{South}_i + \beta_3 \text{East}_i + \dots + u_i \]
Where:
The Intercept (\(\beta_0\)) now represents the average value of the dependent variable for the baseline category (the one you omitted).
The Dummy Coefficients (\(\beta_1, \beta_2, \dots\)): Each dummy variable’s coefficient represents the average difference in the dependent variable between that category and the baseline category, holding all other variables constant.
The coefficients do not show the average value of \(y\) for that category directly. They show the difference relative to the baseline.
Statistical Significance: If the \(p\)-value for a dummy variable’s coefficient is statistically significant, it suggests that there is a meaningful difference in the outcome variable between that category and the baseline category.
Scenario: We want to predict an individual’s wage based on their education level, which we have categorized as “High School,” “Bachelors,” and “Masters” We will use “High School” as our baseline category.
\[ \text{Wage}_i = \beta_0 + \beta_1 \text{Bachelors}_i + \beta_2 \text{Masters}_i + u_i \]
Let’s say we run the regression and get the following results: \[ \widehat{\text{Wage}}_i = 35000 + 15000 \cdot \text{Bachelors}_i + 30000 \cdot \text{Masters}_i \]
Interpretation:
While your choice of a baseline category changes the interpretation of the individual coefficients, it does not change the overall significance of the categorical variable as a whole.
When you include a set of dummy variables for a single categorical predictor (e.g., “Bachelors” and “Graduate” for the “Education” variable), the F-statistic tests the joint null hypothesis that all of these dummy coefficients are equal to zero.
Why it is invariant: The overall fit of the model (e.g., the R-squared) remains identical regardless of which category you choose as the baseline.
While the F-test looks at the variable as a whole, the t-test for each individual dummy coefficient examines a more specific comparison.
Each dummy coefficient (\(\beta_j\)) represents the estimated average difference in the outcome variable between category j and the baseline category.
Example: In our wage regression, the t-test for \(\beta_{\text{Bachelors}}\) answers the question: “Is the average wage for people with a Bachelor’s degree significantly different from the average wage for people with a High School diploma (the baseline)?”
mlr_model <- lm(wage ~ educ + exper, data = dat_mlr)
summary(mlr_model)
##
## Call:
## lm(formula = wage ~ educ + exper, data = dat_mlr)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.3544 -1.9212 0.1218 2.1522 7.5290
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.82941 2.64978 1.068 0.2883
## educ 1.02959 0.17534 5.872 6.03e-08 ***
## exper 0.16733 0.06647 2.518 0.0135 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.142 on 97 degrees of freedom
## Multiple R-squared: 0.2825, Adjusted R-squared: 0.2677
## F-statistic: 19.1 on 2 and 97 DF, p-value: 1.016e-07import statsmodels.api as sm
import pandas as pd
# Create a DataFrame for X (automatically keeps variable names)
X = pd.DataFrame({
'educ': r.dat_mlr['educ'],
'exper': r.dat_mlr['exper']
})
X = sm.add_constant(X) # Adds 'const' column
y = r.dat_mlr['wage']
model = sm.OLS(y, X).fit()
print(model.summary())
## OLS Regression Results
## ==============================================================================
## Dep. Variable: wage R-squared: 0.283
## Model: OLS Adj. R-squared: 0.268
## Method: Least Squares F-statistic: 19.10
## Date: Wed, 29 Oct 2025 Prob (F-statistic): 1.02e-07
## Time: 12:56:16 Log-Likelihood: -254.87
## No. Observations: 100 AIC: 515.7
## Df Residuals: 97 BIC: 523.6
## Df Model: 2
## Covariance Type: nonrobust
## ==============================================================================
## coef std err t P>|t| [0.025 0.975]
## ------------------------------------------------------------------------------
## const 2.8294 2.650 1.068 0.288 -2.430 8.088
## educ 1.0296 0.175 5.872 0.000 0.682 1.378
## exper 0.1673 0.066 2.518 0.013 0.035 0.299
## ==============================================================================
## Omnibus: 0.187 Durbin-Watson: 2.158
## Prob(Omnibus): 0.911 Jarque-Bera (JB): 0.359
## Skew: -0.059 Prob(JB): 0.836
## Kurtosis: 2.731 Cond. No. 175.
## ==============================================================================
##
## Notes:
## [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.Estimate or coef: These are \(\hat{\beta}_0\) (Intercept), \(\hat{\beta}_1\) (educ) and \(\hat{\beta}_2\) (exper).Std. Error: The standard errors of the estimates, \(se(\hat{\beta}_j)\), which measure their sampling uncertainty.t value: The t-statistic used for hypothesis testing (Estimate / Std. Error).Pr(>|t|): The p-value for the t-test.Residual standard error (R only): This is the SER (3.14).R-squared: This is our R-squared (0.27). The model explains about 28% of the variation in wages.Educ: \(\hat{\beta}_1 \approx 1.02\): Holding experience constant, one more year of education is associated with a €1.02/hr increase in wages, on average.Exper: \(\hat{\beta}_2 \approx 0.16\): Holding education constant, one more year of experience is associated with a €0.16/hr increase in wages, on average.Our goal is to find the variance of the OLS estimator, \(Var(\hat{\beta}_1)\), which is the basis for all statistical inference. Conditional on the regressors \(x_i\), we have:1
\[ Var(\hat{\beta}_1) = \frac{1}{\left(\sum (x_i - \bar{x})^2\right)^2} Var\left(\sum_{i=1}^N (x_i - \bar{x})u_i\right) \]
To evaluate the variance of the sum, we must make assumptions about the error terms, \(u_i\). The standard output in R/Python/Stata imposes two key assumptions on the errors:
Under homoskedasticity, the variance of \(\hat{\beta_1}\) equals:
\[ Var(\hat{\beta}_1) = \frac{\sigma^2 \sum (x_i - \bar{x})^2}{\left(\sum (x_i - \bar{x})^2\right)^2} = \frac{\sigma^2}{\sum (x_i - \bar{x})^2} \]
The assumption of homoskedasticity (“same scatter”) is often violated in economic data.
Heteroskedasticity occurs if the variance of the error term is not constant across observations.
If we ignore this and use the classical formula, our inference can be severely misleading.
Formal Definition: \(Var(u_i) = \sigma_i^2\). The variance is indexed by i, meaning it can take on a different value for each observation.
Example: Consider a regression of household food expenditure on household income. Low-income households have limited budgets, so their food spending will be tightly clustered around a certain amount (low variance). High-income households have more discretion; some may spend a lot on gourmet food while others spend relatively little, leading to a much wider spread of data points (high variance). The variance of the error term, which captures this deviation from the average, increases with income.
Under heteroskedasticity, the HC-Robust variance equals:
\[ Var(\hat{\beta}_1) = \frac{\sum_{i=1}^N (x_i - \bar{x})^2 \sigma_i^2}{\left(\sum_{i=1}^N (x_i - \bar{x})^2\right)^2} \]
This is the true variance of the OLS estimator in the presence of heteroskedasticity.
The challenge is that we cannot compute the true variance because we do not know the individual error variances, \(\sigma_i^2\).
The insight (White, 1980) is to use the squared OLS residual for each observation, \(\hat{u}_i^2 = (y_i - \hat{y}_i)^2\), as an estimator for the unobserved error variance, \(\sigma_i^2\).
We construct the estimator by taking the correct formula for the variance and “plugging in” \(\hat{u}_i^2\) for \(\sigma_i^2\). This gives the heteroskedasticity-robust variance estimator, also alled the White estimator or HC estimator:
\[ \widehat{Var}_{HC}(\hat{\beta}_1) = \frac{\sum_{i=1}^N (x_i - \bar{x})^2 \hat{u}_i^2}{\left(\sum_{i=1}^N (x_i - \bar{x})^2\right)^2} \]
The square root of this value is the heteroskedasticity-robust standard error (or HC-robust SE).
This is a consistent estimator of the standard error. It only works for large \(n\).
feols(y ~ x1 + x2, data = df, vcov='hc1')pf.feols("y ~ x1 + x2", data = df).vcov("HC1")reg y x1 x2, robustIn many economic datasets, the assumption that \(Cov(u_i, u_j) = 0\) is unrealistic.
Examples are students nested within schools, individuals within states, or firms over time (panel data).
Unobserved factors at the group level can induce correlation among the error terms within that group.
This structure is called clustered data.
Let’s introduce notation for clusters. Let \(g\) index the cluster (e.g., a school) and \(i\) index the individual within the cluster. An observation is denoted by \(gi\). Our model is now:
\[ y_{gi} = \beta_0 + \beta_1 x_{gi} + u_{gi} \]
The key feature of clustered data is the assumed error structure:
If we ignore this structure and use the standard OLS formula, we are incorrectly zeroing out all the covariance terms in the variance calculation.
Procedure Clustered Standard Errors
feols(y ~ x1 + x2, data = df, vcov=~cluster_variable)pf.feols("y ~ x1 + x2", data = df).vcov({"CRV3": "cluster_variable"}).summary()reg y x1 x2, vce(cluster cluster_variable)We now express the variance of the OLS estimator, \(Var(\hat{\beta}_1)\) in a form recognizing the cluster structure:
\[ \begin{align} Var(\hat{\beta}_1) &= Var \left( \beta_1 + \frac{\sum_{i=1}^N (x_i - \bar{x})u_i}{\sum_{i=1}^N (x_i - \bar{x})^2} \right) \\ &= \frac{1}{\left(\sum (x_i - \bar{x})^2\right)^2} Var\left(\sum_{i=1}^N (x_i - \bar{x})u_i\right) \\ &= \frac{1}{\left(\sum (x_i - \bar{x})^2\right)^2} {Var\left(\sum_{g=1}^G \sum_{i=1}^{N_g} (x_{gi} - \bar{x})u_{gi}\right)} \end{align} \]
In the final equality, we recognize that the sum over all observations can be partitioned into the sum over all observations within one cluster (from \(i\) to \(N_g\)), and then summing over each cluster (from \(g=1\) to \(G\)).
Restating the result from the previous slide:
\[ Var(\hat{\beta}_1) = \frac{1}{\left(\sum_{g,i} (x_{gi} - \bar{x})^2\right)^2} \color{blue}{Var\left(\sum_{g=1}^G \sum_{i=1}^{N_g} (x_{gi} - \bar{x})u_{gi}\right)} \qquad(1)\]
where \(G\) is the number of clusters and \(N_g\) is the size of cluster \(g\).
Let’s focus on the variance term (the blue part). We can rewrite the sum over individuals as a sum over clusters of within-cluster sums..
\[ Var\left(\sum_{g=1}^G \left( \sum_{i=1}^{N_g} (x_{gi} - \bar{x})u_{gi} \right) \right) \]
Since errors are uncorrelated across clusters, the variance of this sum is the sum of the variances:
\[ Var\left(\sum_{g=1}^G \left( \sum_{i=1}^{N_g} (x_{gi} - \bar{x})u_{gi} \right) \right) = \sum_{g=1}^G Var\left( \sum_{i=1}^{N_g} (x_{gi} - \bar{x})u_{gi} \right) \]
Now, let’s expand the variance term for a single cluster \(g\). This is where the non-zero covariances appear:
\[ Var\left( \sum_{i=1}^{N_g} (x_{gi} - \bar{x})u_{gi} \right) = \sum_{i=1}^{N_g} (x_{gi} - \bar{x})^2 Var(u_{gi}) + \sum_{i \neq j \in g} (x_{gi} - \bar{x})(x_{gj} - \bar{x}) Cov(u_{gi}, u_{gj}) \]
This expression is the complete variance for cluster \(g\).
We cannot directly calculate the true variance because the error terms \(u_{gi}\) and their variances/covariances are unknown. We must estimate it from the data using the OLS residuals,
\[ \hat{u}_{gi} = y_{gi} - \hat{y}_{gi} \]
Let’s define a score for each observation: \(u_{gi} = (x_{gi} - \bar{x})\hat{u}_{gi}\).
Let the sum of these scores within a cluster be \(u_g = \sum_{i=1}^{N_g} u_{gi} = \sum_{i=1}^{N_g} (x_{gi} - \bar{x})\hat{u}_{gi}\).
The variance expression we derived for a single cluster, \(Var\left( \sum_{i=1}^{N_g} (x_{gi} - \bar{x})u_{gi} \right)\), can be thought of as the expected value of the squared sum, \(E\left[ \left( \sum_{i=1}^{N_g} (x_{gi} - \bar{x})u_{gi} \right)^2 \right]\). 1
A natural estimator for this quantity is simply the squared sum of the estimated scores for that cluster: \((u_g)^2 = \left( \sum_{i=1}^{N_g} (x_{gi} - \bar{x})\hat{u}_{gi} \right)^2\).
By summing this quantity over all clusters, we get an estimate of the total variance of the numerator of \(\hat{\beta}_1\)2:
\[ {\widehat{Var}\left(\sum_{g=1}^G \sum_{i=1}^{N_g} (x_{gi} - \bar{x})u_{gi}\right)} = \sum_{g=1}^G \left( \sum_{i=1}^{N_g} (x_{gi} - \bar{x})\hat{u}_{gi} \right)^2 \]
Plugging this back into our main variance formula for \(\hat{\beta}_1\), we get the cluster-robust variance estimator:
\[ \widehat{Var}_C(\hat{\beta}_1) = \frac{1}{\left(\sum_{g,i} (x_{gi} - \bar{x})^2\right)^2} \left[ \sum_{g=1}^G \left( \sum_{i=1}^{N_g} (x_{gi} - \bar{x})\hat{u}_{gi} \right)^2 \right] \]
A small-sample correction factor, \(\frac{G}{G-1}\frac{N-k}{N-1}\), is typically applied, but the core formula above is the key insight.
Let’s compare this to the heteroskedasticity-robust estimator:
\[ \widehat{Var}_W(\hat{\beta}_1) = \frac{1}{\left(\sum_i (x_i - \bar{x})^2\right)^2} \left[ \sum_{i=1}^N (x_i - \bar{x})^2 \hat{u}_i^2 \right] \]
The HC estimator assumes observations are independent but allows their variances to differ. It estimates the variance contribution of each observation \(i\) and sums them up.
The Clustered estimator relaxes the independence assumption within clusters.
sum inside), squares that total (square outside), and then sums these squared cluster-level totals.Empirical Economics: Lecture 2 - The Linear Model II