Empirical Economics

Tutorial 2: The Linear Model II

Recapitulation of the Lecture

Multiple Linear Regression & Hypothesis Testing

The MLR model extends simple regression by including multiple explanatory variables to better control for confounding factors.

The general form is:

\[ y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_k x_k + u \]

Core Interpretation: The coefficient \(\beta_j\) measures the effect of a one-unit change in \(x_j\) on \(y\), holding all other explanatory variables constant. This is the principle of ceteris paribus.
Goal of Control Variables: The primary reason to include variables (\(x_2, ..., x_k\)) is to prevent omitted variable bias in our coefficient of interest (\(\beta_1\)). A good control is a confounder—a variable correlated with both \(x_1\) and \(y\).

Multivariate Hypothesis Testing

We use statistical tests to determine if the estimated coefficients are reliably different from zero.

Test	Scope & Purpose	Null Hypothesis (\(H_0\))	Key Question
t-test	Tests a single coefficient.	\(\beta_j = 0\)	Does variable \(x_j\) have a statistically significant effect on \(y\), holding other factors constant?
F-test	Tests multiple coefficients at once. Most often used for the overall significance of the model.	\(\beta_1 = \beta_2 = \dots = \beta_k = 0\)	Does our model, as a whole, have any explanatory power? Are the variables jointly significant?

An F-test can find a group of variables to be jointly significant even if no single variable has a significant t-statistic.

Interactions

Sometimes, the effect of one variable depends on the level of another. We model this by including an interaction term (the product of the two variables).

Model with interaction: \[ Wage = \beta_0 + \beta_1 \text{Female} + \beta_2 \text{Educ} + \beta_3 (\text{Female} \cdot \text{Educ}) + u \]

\(\beta_2\): The effect of an additional year of education for the reference group (males, where Female=0).
\(\beta_3\): The difference in the effect of education for females compared to males. It measures how the slope of the wage-education profile differs between the two groups.
Total Effect: The effect of education for females is no longer a single coefficient, but the sum \((\beta_2 + \beta_3)\). Similarly, the wage gap between genders, \((\beta_1 + \beta_3 \text{Educ})\), now depends on the level of education.

Robust Standard Errors

The default standard errors calculated by statistical software rely on strong assumptions that are often violated in economic data. Using them can lead to incorrect conclusions (e.g., finding a significant effect when none exists).

Problem	Description	When It Occurs	Solution
Heteroskedasticity	The variance of the error term (\(u_i\)) is not constant across observations. \(Var(u_i) = \sigma_i^2\).	Common in cross-sectional data where the scale or volatility of the outcome varies (e.g., spending vs. income).	Use Heteroskedasticity-Robust (HC) Standard Errors. Also known as White or robust standard errors.
Intra-Cluster Correlation	The error terms are correlated for observations within the same group (e.g., students in the same school). \(Cov(u_{gi}, u_{gj}) \neq 0\).	Common in panel data or when data is sampled at a group level (states, schools, villages).	Use Cluster-Robust Standard Errors, clustered at the level where shocks are shared.

Practical Takeaway: In modern empirical work, using HC-robust or cluster-robust standard errors is the default practice to ensure statistical inference is valid.

Wooclap

Link to Wooclap quiz: https://app.wooclap.com/OFZFSD/questionnaires/68c70d19411de4235e9a9d81
Code for Wooclap.com: OFZFSD (select Tutorial 2)

Questions

Interpreting Formatted Regression Output

The results of two regression models analyzing housing prices are reported. The first model estimates the house price based on the sqrft (square footage of the house), bdrms (number of bedrooms), and lotsize (size of the parcel of land). The second model predicts the price based on sqrft, bdrms, and whether the house is of colonial style.

In model (1), the coefficient for lotsize is 0.002 and is statistically significant (p < 0.01). What is the practical interpretation of this coefficient in the context of predicting housing prices?

In model (2), the coefficient for the colonial variable is 13.078, but it is not statistically significant. What does this suggest about the relationship between a house being of colonial style and its price, according to this model?

Regression of Housing Price on Characteristics
	(1)	(2)
* p < 0.1, p < 0.05, * p < 0.01
Standard Errors in parentheses.
(Intercept)	-21.770	-21.552
	(29.475)	(31.210)
sqrft	0.123***	0.130***
	(0.013)	(0.014)
bdrms	13.853	12.487
	(9.010)	(10.024)
lotsize	0.002***
	(0.001)
colonial		13.078
		(15.436)
Num.Obs.	88	88
R2	0.672	0.635

Wage Determinants

You are provided with a dataset named SLEEP75.DTA here. Your task is to investigate the relationship between hourly wages and several explanatory variables using linear regression. The variables of interest are:lhrwage- the natural logarithm of the hourly wage, educ - years of education, exper - years of potential work experience, union: a binary variable, where 1 indicates union membership and 0 indicates non-membership, and male: a binary variable, where 1 indicates male and 0 indicates female.

Estimate a multiple linear regression model where lhrwage is the dependent variable and educ, exper, union, and male are the independent variables.
Present the summary of your regression model, including the estimated coefficients, standard errors, t-statistics, and p-values.
Based on your regression output, provide a clear and concise interpretation for the educ and union coefficients.
Perform a hypothesis test for the exper coefficient to determine if it is statistically significant at the 5% significance level. State your null and alternative hypotheses, report the relevant test statistic and p-value from your model output, and conclude whether you reject or fail to reject the null hypothesis. What does this conclusion imply about the relationship between experience and the logarithm of hourly wage in this model?

Heteroskedasticity

Using the same multiple linear regression model from the previous question, where lhrwage is regressed on educ, exper, union, and male, you will now investigate the presence and consequences of heteroskedasticity.

After estimating the OLS model, obtain the residuals.¹
Create a scatter plot with the values of exper on the x-axis and the squared residuals on the y-axis.
Examine the plot. Does the spread of the squared residuals appear to change as the fitted values decrease? Describe the pattern you see and explain why it might suggest the presence of heteroskedasticity.
Re-estimate the model, but this time calculate heteroskedasticity-robust standard errors.
Compare the “normal” OLS standard errors with the robust standard errors for each of the four coefficients (educ, exper, union, and male).

Omitted Variable Bias

This is a challenging but crucial derivation. Suppose the true population model is a multiple regression: \[ y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + u \] However, you mistakenly estimate a simple regression, omitting \(x_2\): \[ y = \gamma_0 + \gamma_1 x_1 + v \] Let \(\hat{\gamma}_1\) be the OLS estimate from your incorrect (short) regression. Show that the expected value of this estimator is:

\[ E(\hat{\gamma}_1) = \beta_1 + \beta_2 \cdot \delta_1 \]

where \(\delta_1\) is the slope coefficient from an auxiliary regression of the omitted variable (\(x_2\)) on the included variable (\(x_1\)): \(x_2 = \delta_0 + \delta_1 x_1 + \text{error}\).

(Hint: Start with the formula for \(\hat{\gamma}_1\), substitute the true model for y, and then take the expectation. The term \(\beta_2 \cdot \delta_1\) represents the omitted variable bias.)
(Hint: \(\frac{\sum(x_{1i}-\bar{x}_1)x_{2i}}{\sum(x_{1i}-\bar{x}_1)^2} = \frac{\sum(x_{1i}-\bar{x}_1)(x_{2i}-\bar{x}_2)}{\sum(x_{1i}-\bar{x}_1)^2}\))

Perfect Multicollinearity

The variance of a coefficient estimator in a multiple regression model with two variables (\(x_1, x_2\)) is given by:

\[ Var(\hat{\beta}_1) = \frac{\sigma^2}{SST_1 (1 - R_1^2)} \]

where \(R_1^2\) is the R-squared from a regression of \(x_1\) on \(x_2\).

What does it mean for \(x_1\) and \(x_2\) to have perfect multicollinearity in terms of their relationship?
Analytically, what happens to the value of \(R_1^2\) under perfect multicollinearity?
Using the variance formula, explain mathematically why it is impossible to calculate the OLS estimate \(\hat{\beta}_1\) in this scenario. What happens to the variance of the estimator?

Zero Conditional Mean

The lecture states that the Zero Conditional Mean assumption (\(E(u|x) = 0\)) is the most crucial assumption for causality.

Explain in your own words what this assumption means.
Using the lecture’s example of wage on education, explain why “innate ability” is a potential unobserved factor (\(u\)) that likely violates this assumption.
If higher ability is positively correlated with both education and wages, in which direction will the OLS estimate for the effect of education on wages (\(\hat{\beta}_1\)) be biased? Explain your reasoning.

Mitigating Omitted Variable Bias

The lecture introduces multiple regression as a way to control for other factors and mitigate omitted variable bias. Let’s return to the wage on education model.

Besides experience (which was added in the lecture), what are two or three other variables you would want to include in the model to get a more credible estimate of the true return to education?

What practical challenges might you face in obtaining data for these variables?

Hypothesis Testing

You are given the following summary output for a regression model that predicts the hourly wage based on years of education (educ) and years of experience (exper). The sample size is \(n=100\).

Variable	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	1.145	1.550	0.739	0.462
educ	1.021	0.095	10.747	< 0.001
exper	0.163	0.041	3.976	< 0.001

Model Fit Statistics:

R-squared: 0.583
F-statistic: 67.5 on 2 and 97 DF
p-value (F-test): < 0.001
Sum of Squared Residuals (SSR): 875.2

Hypothesis Testing (Cont.)

A colleague claims that work experience has no effect on wages once you account for education. State the null (\(H_0\)) and alternative (\(H_A\)) hypotheses to test this claim. Based on the output, would you reject the null hypothesis at a 5% significance level? Why?
State the null and alternative hypotheses for the F-test of overall significance. What does the p-value of the F-test tell you about the model?
You are asked to test the hypothesis that both education and experience have no effect on wages. If you were to run a “restricted” model for this F-test, what would that regression equation look like? What would its Sum of Squared Residuals (\(SSR_{restricted}\)) be equal to in this specific case? (Hint: Think about what the model predicts if \(\beta_1=0\) and \(\beta_2=0\)).

Interpreting Interaction Effects

A researcher studies the relationship between a person’s age and their annual charitable donations (in €). They want to see if this relationship differs for individuals who are homeowners (is_homeowner = 1) versus those who are not (is_homeowner = 0).

They estimate the following model: \[ \widehat{\text{donations}} = 50 + 150 \cdot \text{is_homeowner} + 5 \cdot \text{age} + 3 \cdot (\text{is_homeowner} \cdot \text{age}) \]

Write the specific regression equation for non-homeowners. What is the estimated effect of an additional year of age on their donations?
Write the specific regression equation for homeowners. What is the estimated effect of an additional year of age on their donations?
Interpret the meaning of the coefficient on the interaction term (\(\hat{\beta}_3 = 3\)).
Calculate the predicted difference in annual donations between a 50-year-old homeowner and a 50-year-old non-homeowner.

Good vs. Bad Controls

A researcher wants to estimate the causal effect of completing a marathon (ran_marathon, a dummy variable) on an individual’s self-reported happiness level (happiness, on a scale of 1-10). The goal is to isolate the psychological or physical boost from the marathon itself.

For each potential control variable below, decide if it is a “good control” (confounder) or a “bad control” (mediator or collider). Justify your reasoning.

fitness_level: The individual’s baseline fitness level measured before they started training for the marathon.
hours_trained: The total number of hours the individual spent training to prepare for the marathon.
has_gym_membership: A dummy variable for whether the individual had a gym membership after completing the marathon. Assume that both the feeling of accomplishment from the marathon and the resulting happiness could lead someone to invest more in their fitness by getting a membership.

Correct Standard Errors

For each of the following research scenarios, choose the most appropriate type of standard error to use and briefly explain your choice. The options are standard (Homoskedastic) OLS SEs, Heteroskedasticity-Robust (HC) SEs, or Cluster-Robust SEs (if you choose this, specify the level of clustering).

Study A: You are analyzing the relationship between household income and electricity consumption using a random sample of 2,000 households from across a country for a single year. You suspect that high-income households have much more variation in their electricity use (e.g., pools, electric cars) than low-income households.
Study B: You are studying the effect of a country’s GDP on its level of CO2 emissions. Your dataset consists of annual data for 150 countries over 30 years (a panel dataset). You believe that unobserved country-specific factors (like environmental policy culture) persist over time.
Study C: You are evaluating the impact of a new teaching method on student exam scores. The experiment was conducted on 5,000 students spread across 100 different schools. All students within a given school were taught by teachers who received similar training and used the same school resources.