Tutorial 1: The Linear Model
Tutorials consist of two parts: short recapitulation of lecture and exercises.
Role of the teacher:
On exam: no code output - only formatted output like in journals
Goal: Use sample data to estimate the true, unknown relationship between variables in a population.
Goal: Understand what the estimated coefficients mean and how well the model explains our data.
| Model | Equation | Interpretation of β̂₁ |
|---|---|---|
| Level-Level | y = β₀ + β₁x |
A 1-unit change in x is associated with a β̂₁ unit change in y. |
| Log-Level | log(y) = β₀ + β₁x |
A 1-unit change in x is associated with a (100 × β̂₁)% change in y. |
| Level-Log | y = β₀ + β₁log(x) |
A 1% change in x is associated with a (β̂₁/100) unit change in y. |
| Log-Log | log(y) = β₀ + β₁log(x) |
A 1% change in x is associated with a β̂₁% change in y (elasticity). |
| Dummy Variable | y = β₀ + β₁D |
β̂₁ is the average difference in y between the group where D=1 and the reference group (D=0). |
R² measures the proportion of the total variation in the dependent variable (y) that is explained by the independent variable (x).R² indicates a better in-sample fit.R².Goal: Knowing when our estimates are reliable and how to test their statistical significance.
E[β̂₁] = β₁).E(u|x) = 0). This implies the unobserved factors in the error term u are not correlated with the explanatory variable x.x. This leads to Omitted Variable Bias.β̂₁ will be biased, capturing both the effect of x and part of the effect of the omitted variable.y (e.g., H₀: β₁ = 0).
t = (β̂₁ - 0) / SE(β̂₁)β̂₁. A larger t-statistic (and a corresponding small p-value) indicates that the estimated effect is unlikely to be due to random chance, allowing us to reject the null hypothesis.Consider the following regression output from a model trying to explain the price of a house: price = 300,000 + 1500 * sqmtr
where price is the sale price of a house in euro, and sqmtr is the interior surface in square meters. The R-squared is 0.64.
Interpret the intercept (300,000) and the slope coefficient (1,500) in plain English.
What does the R-squared value of 0.64 tell us about this model?
If you were to re-estimate the model with price measured in thousands of euros (e.g., a 250,000 euro house becomes 250), what would the new equation be?
Suppose you run a log-log regression to analyze the relationship between online ad spending and product sales, and you get the following result:
log(Sales) = 2.1 - 0.85 * log(Ad_Price)
How would you interpret the coefficient -0.85? What is the economic term for this value?
This question explores the relationship between a CEO’s salary and the sales of their firm. In economics, this relationship is often modeled with a log-log specification. You will use the ceosal1 dataset.1
Estimate a linear regression model where the natural log of CEO salary (log(salary)) is the dependent variable, and the natural log of firm sales (log(sales)) is the independent variable. Write down the estimated regression equation.
Provide a precise interpretation of the slope coefficient on the log(sales) variable. What does this elasticity value tell you about the relationship between CEO salary and firm sales?
Is the relationship between log-salary and log-sales statistically significant at the 1% level? Justify your answer by referencing the appropriate value from the R model summary output.
What is the R-squared value of this regression? What does this number tell you about how well the model fits the data?
This question investigates the factors influencing fertility rates in Botswana around 1988. This dataset contains information on fertility (how many children they have) and several potential socioeconomic indicators for about 4400 Botswanean women.1
Estimate a linear regression model where children is the dependent variable, and catholic is the independent variable. Write out the estimated regression equation.
Provide a precise interpretation of the coefficient on the catholic variable. What does its sign and magnitude imply, holding the other variables in the model constant?
Provide a summary of the model. What is the \(t\)-value corresponding to the catholic variable? What does that imply?
Consider the following small dataset of 4 observations for variables \(x\) and \(y\):
| Observation (i) | y | x |
|---|---|---|
| 1 | 2 | 1 |
| 2 | 7 | 2 |
| 3 | 6 | 4 |
| 4 | 9 | 5 |
Calculate the sample means, \(\bar{x}\) and \(\bar{y}\).
Using the OLS formula from the lecture, calculate the slope estimator \(\hat{\beta}_1 = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sum (x_i - \bar{x})^2}\).
Using the formula \(\hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x}\), calculate the intercept estimator \(\hat{\beta}_0\).
Write down the Sample Regression Function (SRF). Then, calculate the four residuals (\(\hat{u}_1, \hat{u}_2, \hat{u}_3, \hat{u}_4\)) and verify that their sum is zero, confirming one of the algebraic properties of OLS.
You estimate a simple regression model to understand the relationship between house prices and house size. Your software produces the following output for a sample of 202 houses:
\[
\widehat{\text{Price}} = 40,000 + 150 \times \text{Size}
\] The standard error for the intercept coefficient is 10,000, and the standard error for the Size coefficient is 25. The \(R^2\) is 0.65. Price is in euros and Size is in square meters.
Size variable. Show your formula and calculation.What is the fundamental difference between the population error term (\(u_i\)) and the OLS residual (\(e_i\))?
Why can we observe one but not the other?
The lecture states that the OLS regression line always passes through the point of sample means, \((\bar{x}, \bar{y})\).
Using the formula for the OLS intercept estimator, \(\hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x}\), prove that this is true.
That is, show that if you plug \(\bar{x}\) into the estimated regression equation, the predicted value \(\hat{y}\) is exactly \(\bar{y}\).
The unbiasedness of the OLS estimator, \(E(\hat{\beta}_1) = \beta_1\), is a cornerstone result that relies on the first four SLR assumptions. Let’s prove it. Start with the formula for the slope estimator: \[ \hat{\beta}_1 = \frac{\sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^{n} (x_i - \bar{x})^2} \]
First, substitute the true population model \(y_i = \beta_0 + \beta_1 x_i + u_i\) into the numerator. Show that the estimator can be rewritten as:
\[ \hat{\beta}_1 = \beta_1 + \frac{\sum_{i=1}^{n} (x_i - \bar{x})u_i}{\sum_{i=1}^{n} (x_i - \bar{x})^2} \]
(Hint: You will need to use the fact that \(\sum(x_i - \bar{x})(c) = 0\) for any constant c, and that \(\sum(x_i - \bar{x})(\bar{y}) = 0\).) Another useful property is \(\sum (x_i - \bar{x})(y_i - \bar{y}) = \sum (x_i - \bar{x})y_i\).)
The lecture explains how to interpret coefficients in models with different functional forms.
Use calculus to derive the marginal effect of a change in \(x\) on \(y\) for the following two models:
The Quadratic Model: For the model \(y = \beta_0 + \beta_1 x + \beta_2 x^2 + u\), find the derivative \(\frac{dy}{dx}\). How does your result show that the effect of a one-unit change in \(x\) on \(y\) depends on the current level of \(x\)?
The Level-Log Model: For the model \(y = \beta_0 + \beta_1 \log(x) + u\), show that a 1% change in \(x\) leads to an approximate change of \((\beta_1/100)\) units in \(y\).
(Hint: For (b), recall that a change in \(\log(x)\), i.e., \(d(\log(x))\), is equal to \(\frac{dx}{x}\), which is the proportional change in x.)
The variance of the OLS estimator in a simple linear regression is given by: \(Var(\hat{\beta}_1) = \frac{\sigma^2}{SST_x}\).
Imagine you are a researcher designing an experiment to find the causal effect of fertilizer (\(x\)) on crop yield (\(y\)).
Using this formula as your guide, what two things could you do in your experimental design to increase the precision of your estimate, \(\hat{\beta}_1\)?
Suppose somebody tells you, “My model is great, it has an R-squared of 0.92!”
Why is a high R-squared not necessarily the ultimate goal of an econometric analysis, especially if we are interested in making policy decisions based on one specific variable?
What is often more important than a high R-squared?
Why do we use the sum of squared residuals as the criterion to minimize in OLS?
Why not minimize the sum of the absolute values of the residuals, or just the sum of the residuals?
What are the statistical and practical advantages of squaring them?
The lecture introduced polynomial terms (e.g., adding \(x^2\)) to model non-linear relationships.
When might you suspect that a simple linear model is not sufficient and that a quadratic model (like wage on experience and experience²) would be more appropriate?
What would a negative coefficient on the experience² term imply about the relationship between experience and wages?
Empirical Economics: Tutorial - The Linear Model