Empirical Economics

Tutorial 1: The Linear Model

Tutorial 1

Design of the Tutorials

Tutorials consist of two parts: short recapitulation of lecture and exercises.

Two kinds of exercises, applied and theoretical, mixed in order
Usually about 6/7 exercises in class, others for self-study.
Catching up? Prerequisite practice questions also available on Brightspace.
Policy: tutorial answers released after the tutorials are finished

Role of the teacher:

Recapitulate lecture
Moderate discussion about exercises
Demonstrate and explain exercises
Code policy: show Stata, other languages are self-study
Not debugging code

On exam: no code output - only formatted output like in journals

Recapitulation of the Lecture

The Linear Model

Goal: Use sample data to estimate the true, unknown relationship between variables in a population.

We propose a linear relationship between a dependent variable \(y\) (e.g., wages) and an independent variable \(x\) (e.g., education).
- Population Regression Function (PRF): The true, unobservable relationship is \(y=\beta_0 + \beta_1x+u\). \(\beta_0\) and \(\beta_1\) are the unknown population intercept and slope. \(u\) is the unobserved error term, capturing all other factors affecting \(y\).
- Since we can’t see the whole population, we use a sample to estimate the model. OLS is the method used to find the “best-fitting” line through the sample data.
- OLS finds the specific intercept (\(\hat{\beta_0}\)) and slope (\(\hat{\beta_1}\)) that minimize the sum of the squared residuals (SSR). The residual is the difference between the actual \(y\) value and the value predicted by our line (\(\hat{y}\))
Estimated Coefficients
- The OLS estimation gives us the Sample Regression Function (SRF): \(\hat{y} = \hat{\beta_0} + \hat{\beta_1}x\).
- The slope \(\hat{\beta_1}\) is calculated as the sample covariance of \(x\) and \(y\) divided by the sample variance of \(x\).

Interpretation and Model Fit

Goal: Understand what the estimated coefficients mean and how well the model explains our data.

Interpreting Coefficients: The meaning of \(\hat{\beta_1}\) depends on how the variables are measured (their functional form). This allows for modeling different types of economic relationships.

Model	Equation	Interpretation of `β̂₁`
Level-Level	`y = β₀ + β₁x`	A 1-unit change in `x` is associated with a `β̂₁` unit change in `y`.
Log-Level	`log(y) = β₀ + β₁x`	A 1-unit change in `x` is associated with a `(100 × β̂₁)%` change in `y`.
Level-Log	`y = β₀ + β₁log(x)`	A 1% change in `x` is associated with a `(β̂₁/100)` unit change in `y`.
Log-Log	`log(y) = β₀ + β₁log(x)`	A 1% change in `x` is associated with a `β̂₁%` change in `y` (elasticity).
Dummy Variable	`y = β₀ + β₁D`	`β̂₁` is the average difference in `y` between the group where `D=1` and the reference group (`D=0`).

Measuring Goodness-of-Fit: R-squared (R²)
- R² measures the proportion of the total variation in the dependent variable (y) that is explained by the independent variable (x).
- It is a value between 0 and 1. A higher R² indicates a better in-sample fit.
- Caution: The primary goal is often finding the true causal effect, not just maximizing R².

Quality of Estimates

Goal: Knowing when our estimates are reliable and how to test their statistical significance.

Unbiasedness: A Key Property of OLS
- An estimator is unbiased if its expected value equals the true population parameter (E[β̂₁] = β₁).
- This means that, on average, the OLS procedure gives the correct answer. This property holds if the Classical Linear Model Assumptions are met.
- Most Critical Assumption: The Zero Conditional Mean assumption (E(u|x) = 0). This implies the unobserved factors in the error term u are not correlated with the explanatory variable x.
Omitted Variable Bias (OVB)
- The Zero Conditional Mean assumption is violated if we omit a relevant variable from the model that is correlated with x. This leads to Omitted Variable Bias.
- The OLS estimate β̂₁ will be biased, capturing both the effect of x and part of the effect of the omitted variable.
- Example: Estimating the effect of education on wages while omitting “innate ability.” Since ability is correlated with both education and wages, the estimate for the education coefficient will be biased upwards.

Hypothesis Testing

We often want to test if a variable has a statistically significant effect on y (e.g., H₀: β₁ = 0).
- We use a t-test for this purpose. The t-statistic is calculated as: t = (β̂₁ - 0) / SE(β̂₁)
- The Standard Error (SE) measures the precision of our estimate β̂₁. A larger t-statistic (and a corresponding small p-value) indicates that the estimated effect is unlikely to be due to random chance, allowing us to reject the null hypothesis.

Wooclap

Wooclap Questions

Either use the below frame or go to wooclap.com and use the code OFZFSD.

Questions

Housing prices

Consider the following regression output from a model trying to explain the price of a house: price = 300,000 + 1500 * sqmtr

where price is the sale price of a house in euro, and sqmtr is the interior surface in square meters. The R-squared is 0.64.

Interpret the intercept (300,000) and the slope coefficient (1,500) in plain English.
What does the R-squared value of 0.64 tell us about this model?
If you were to re-estimate the model with price measured in thousands of euros (e.g., a 250,000 euro house becomes 250), what would the new equation be?

Log-Log Model

Suppose you run a log-log regression to analyze the relationship between online ad spending and product sales, and you get the following result:

log(Sales) = 2.1 - 0.85 * log(Ad_Price)

How would you interpret the coefficient -0.85? What is the economic term for this value?

CEO Compensation and Firm Sales

This question explores the relationship between a CEO’s salary and the sales of their firm. In economics, this relationship is often modeled with a log-log specification. You will use the ceosal1 dataset.¹

Estimate a linear regression model where the natural log of CEO salary (log(salary)) is the dependent variable, and the natural log of firm sales (log(sales)) is the independent variable. Write down the estimated regression equation.
Provide a precise interpretation of the slope coefficient on the log(sales) variable. What does this elasticity value tell you about the relationship between CEO salary and firm sales?
Is the relationship between log-salary and log-sales statistically significant at the 1% level? Justify your answer by referencing the appropriate value from the R model summary output.
What is the R-squared value of this regression? What does this number tell you about how well the model fits the data?

Socioeconomic Factors and Fertility

This question investigates the factors influencing fertility rates in Botswana around 1988. This dataset contains information on fertility (how many children they have) and several potential socioeconomic indicators for about 4400 Botswanean women.¹

Estimate a linear regression model where children is the dependent variable, and catholic is the independent variable. Write out the estimated regression equation.
Provide a precise interpretation of the coefficient on the catholic variable. What does its sign and magnitude imply, holding the other variables in the model constant?
Provide a summary of the model. What is the \(t\)-value corresponding to the catholic variable? What does that imply?

The Algebra of OLS

Consider the following small dataset of 4 observations for variables \(x\) and \(y\):

Observation (i)	y	x
1	2	1
2	7	2
3	6	4
4	9	5

Calculate the sample means, \(\bar{x}\) and \(\bar{y}\).
Using the OLS formula from the lecture, calculate the slope estimator \(\hat{\beta}_1 = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sum (x_i - \bar{x})^2}\).
Using the formula \(\hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x}\), calculate the intercept estimator \(\hat{\beta}_0\).
Write down the Sample Regression Function (SRF). Then, calculate the four residuals (\(\hat{u}_1, \hat{u}_2, \hat{u}_3, \hat{u}_4\)) and verify that their sum is zero, confirming one of the algebraic properties of OLS.

Hypothesis Testing and the t-statistic

You estimate a simple regression model to understand the relationship between house prices and house size. Your software produces the following output for a sample of 202 houses:

\[ \widehat{\text{Price}} = 40,000 + 150 \times \text{Size} \] The standard error for the intercept coefficient is 10,000, and the standard error for the Size coefficient is 25. The \(R^2\) is 0.65. Price is in euros and Size is in square meters.

You want to test if house size has a statistically significant effect on price. State the null hypothesis (\(H_0\)) and the alternative hypothesis (\(H_A\)) for this test.
Using the information provided, manually calculate the t-statistic for the coefficient on the Size variable. Show your formula and calculation.

Hypothesis Testing and the t-statistic (Cont.)

The lecture notes state that the t-distribution is characterized by its degrees of freedom (df). What are the degrees of freedom for this t-statistic?
For a 5% significance level with this many degrees of freedom, the critical t-value is approximately 1.96. Based on your calculated t-statistic, would you reject or fail to reject the null hypothesis? Explain what this conclusion means in plain language.

Error Term and Residual

What is the fundamental difference between the population error term (\(u_i\)) and the OLS residual (\(e_i\))?

Why can we observe one but not the other?

Proving a Fundamental OLS Property

The lecture states that the OLS regression line always passes through the point of sample means, \((\bar{x}, \bar{y})\).

Using the formula for the OLS intercept estimator, \(\hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x}\), prove that this is true.

That is, show that if you plug \(\bar{x}\) into the estimated regression equation, the predicted value \(\hat{y}\) is exactly \(\bar{y}\).

Unbiasedness

The unbiasedness of the OLS estimator, \(E(\hat{\beta}_1) = \beta_1\), is a cornerstone result that relies on the first four SLR assumptions. Let’s prove it. Start with the formula for the slope estimator: \[ \hat{\beta}_1 = \frac{\sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^{n} (x_i - \bar{x})^2} \]

First, substitute the true population model \(y_i = \beta_0 + \beta_1 x_i + u_i\) into the numerator. Show that the estimator can be rewritten as:

\[ \hat{\beta}_1 = \beta_1 + \frac{\sum_{i=1}^{n} (x_i - \bar{x})u_i}{\sum_{i=1}^{n} (x_i - \bar{x})^2} \]

(Hint: You will need to use the fact that \(\sum(x_i - \bar{x})(c) = 0\) for any constant c, and that \(\sum(x_i - \bar{x})(\bar{y}) = 0\).) Another useful property is \(\sum (x_i - \bar{x})(y_i - \bar{y}) = \sum (x_i - \bar{x})y_i\).)

Unbiasedness (Cont.)

Now, take the conditional expectation of your result from part (a) with respect to X (the set of all \(x_i\) values). Use the Zero Conditional Mean assumption, \(E(u_i|X) = 0\), to prove that \(E(\hat{\beta}_1|X) = \beta_1\).

Marginal Effects

The lecture explains how to interpret coefficients in models with different functional forms.

Use calculus to derive the marginal effect of a change in \(x\) on \(y\) for the following two models:

The Quadratic Model: For the model \(y = \beta_0 + \beta_1 x + \beta_2 x^2 + u\), find the derivative \(\frac{dy}{dx}\). How does your result show that the effect of a one-unit change in \(x\) on \(y\) depends on the current level of \(x\)?
The Level-Log Model: For the model \(y = \beta_0 + \beta_1 \log(x) + u\), show that a 1% change in \(x\) leads to an approximate change of \((\beta_1/100)\) units in \(y\).

(Hint: For (b), recall that a change in \(\log(x)\), i.e., \(d(\log(x))\), is equal to \(\frac{dx}{x}\), which is the proportional change in x.)

Variance of the OLS Estimator

The variance of the OLS estimator in a simple linear regression is given by: \(Var(\hat{\beta}_1) = \frac{\sigma^2}{SST_x}\).

Imagine you are a researcher designing an experiment to find the causal effect of fertilizer (\(x\)) on crop yield (\(y\)).

Using this formula as your guide, what two things could you do in your experimental design to increase the precision of your estimate, \(\hat{\beta}_1\)?

R-squared

Suppose somebody tells you, “My model is great, it has an R-squared of 0.92!”

Why is a high R-squared not necessarily the ultimate goal of an econometric analysis, especially if we are interested in making policy decisions based on one specific variable?

What is often more important than a high R-squared?

OLS Minimization

Why do we use the sum of squared residuals as the criterion to minimize in OLS?

Why not minimize the sum of the absolute values of the residuals, or just the sum of the residuals?

What are the statistical and practical advantages of squaring them?

Polynomials

The lecture introduced polynomial terms (e.g., adding \(x^2\)) to model non-linear relationships.

When might you suspect that a simple linear model is not sufficient and that a quadratic model (like wage on experience and experience²) would be more appropriate?

What would a negative coefficient on the experience² term imply about the relationship between experience and wages?