import pandas as pd
import numpy as np
import statsmodels.formula.api as smf
# Load the dataset from the URL
url = 'https://raw.githubusercontent.com/basm92/ee_website/refs/heads/master/tutorials/datafiles/ceosal1.csv'
ceosal1 = pd.read_csv(url, sep=";", decimal=',')
# Create the log-transformed variables
ceosal1['lsalary'] = np.log(ceosal1['salary'])
ceosal1['lsales'] = np.log(ceosal1['sales'])
# Estimate the OLS model
model = smf.ols('lsalary ~ lsales', data=ceosal1).fit()
# Print the summary to see the coefficients
print(model.summary())Solutions Tutorial 1
Housing prices
The regression model is: price = 300,000 + 1500 * sqmtr, with R² = 0.64.
(a) Interpret the intercept (300,000) and the slope coefficient (1,500) in plain English.
Intercept (300,000): This is the predicted price of a house with 0 square meters of interior surface. In this context, the intercept has no meaningful practical interpretation, as a house cannot have zero square meters. It is a statistical construct that helps position the regression line correctly in the data cloud.
Slope (1,500): For each additional square meter of interior surface, the sale price of a house is predicted to increase by 1,500 euros, holding all other factors constant.
(b) What does the R-squared value of 0.64 tell us about this model?
- An R-squared of 0.64 means that 64% of the total variation in house prices (
price) is explained by the variation in the interior surface (sqmtr). The remaining 36% of the variation in price is due to other factors not included in the model (e.g., location, age of the house, number of bedrooms, etc.).
(c) If you were to re-estimate the model with price measured in thousands of euros (e.g., a 250,000 euro house becomes 250), what would the new equation be?
- If we divide the dependent variable
priceby 1,000, we must also divide the entire right-hand side of the equation by 1,000 to maintain the equality. Letprice_kbe the price in thousands of euros. \[ \frac{\text{price}}{1000} = \frac{300,000}{1000} + \frac{1500}{1000} \times \text{sqmtr} \] - The new equation would be:
price_k = 300 + 1.5 * sqmtr - The interpretation changes accordingly: The intercept is now 300 thousand euros, and each additional square meter increases the predicted price by 1.5 measured in the new units (thousands of euros).
Log-Log Model
The regression result is: log(Sales) = 2.1 - 0.85 * log(Ad_Price).
How would you interpret the coefficient -0.85? What is the economic term for this value?
- Interpretation: In a log-log model, the coefficient represents an elasticity. A 1% increase in the advertising price (
Ad_Price) is associated with a 0.85% decrease in product sales (Sales), on average. - Economic Term: This value is the price elasticity of demand. Since the absolute value is less than 1 (|-0.85| < 1), we would say that the demand for the product is inelastic with respect to the advertising price.
CEO Compensation and Firm Sales
(a) Estimate a linear regression model and write down the estimated equation.
First, we load the data and estimate the model using the statsmodels library.
library(fixest); library(tidyverse)
url <- 'https://raw.githubusercontent.com/basm92/ee_website/refs/heads/master/tutorials/datafiles/ceosal1.csv'
ceosal1 <- read_csv2(url)
model <- feols(log(salary) ~ log(sales), data = ceosal1)
summary(model)
## OLS estimation, Dep. Var.: log(salary)
## Observations: 209
## Standard-errors: IID
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.821996 0.288340 16.72332 < 2.2e-16 ***
## log(sales) 0.256672 0.034517 7.43617 2.7034e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## RMSE: 0.501939 Adj. R2: 0.207005* 1. Import the data from the URL
* The 'clear' option removes any data currently in memory
import delimited "https://raw.githubusercontent.com/basm92/ee_website/refs/heads/master/tutorials/datafiles/ceosal1.csv", delimiter(";") decimalseparator(,) clear
* 2. Generate the natural log variables
* In Stata, the function for the natural logarithm is ln()
generate ln_salary = ln(salary)
generate ln_sales = ln(sales)
* 3. Run the OLS regression
* The 'regress' command estimates the linear model.
* The first variable is the dependent, followed by the independent(s).
regress ln_salary ln_salesThe estimated regression equation is: \[ \widehat{\text{lsalary}} = 4.8220 + 0.2567 \times \text{lsales} \]
(b) Provide a precise interpretation of the slope coefficient.
Since this is a log-log model, the slope coefficient is an elasticity. The interpretation is:
A 1% increase in firm sales is associated with an estimated 0.257% increase in CEO salary, on average. This suggests that CEO salary increases with firm sales, but at a diminishing rate (a 1% increase in sales leads to a less than 1% increase in salary).
(c) Is the relationship statistically significant at the 1% level?
Yes, the relationship is statistically significant at the 1% level.
To justify this, we look at the p-value for the lsales coefficient in the model summary, which is listed as P>|t|. The output shows a p-value of 0.000. Since 0.000 < 0.01, we reject the null hypothesis that there is no relationship between log(sales) and log(salary).
(d) What is the R-squared value of this regression?
The R-squared value from the summary is 0.207.
This means that approximately 20.7% of the variation in the natural log of CEO salaries can be explained by the variation in the natural log of firm sales in this model.
Socioeconomic Factors and Fertility
(a) Estimate a linear regression model and write out the estimated equation.
First, we load the fertil2 dataset and estimate the simple linear model.
import pandas as pd
import statsmodels.formula.api as smf
# Load the dataset from the URL
url = 'https://raw.githubusercontent.com/basm92/ee_website/refs/heads/master/tutorials/datafiles/fertil2.csv'
fertil2 = pd.read_csv(url, sep=';', decimal=',')
# Estimate the OLS model
# 'children' is the dependent variable, 'catholic' is the independent dummy variable
model_fertility = smf.ols('children ~ catholic', data=fertil2).fit()
# Print the summary
print(model_fertility.summary())
## OLS Regression Results
## ==============================================================================
## Dep. Variable: children R-squared: 0.001
## Model: OLS Adj. R-squared: 0.000
## Method: Least Squares F-statistic: 2.248
## Date: Wed, 29 Oct 2025 Prob (F-statistic): 0.134
## Time: 12:54:32 Log-Likelihood: -9668.3
## No. Observations: 4361 AIC: 1.934e+04
## Df Residuals: 4359 BIC: 1.935e+04
## Df Model: 1
## Covariance Type: nonrobust
## ==============================================================================
## coef std err t P>|t| [0.025 0.975]
## ------------------------------------------------------------------------------
## Intercept 2.2849 0.036 64.341 0.000 2.215 2.354
## catholic -0.1663 0.111 -1.499 0.134 -0.384 0.051
## ==============================================================================
## Omnibus: 632.947 Durbin-Watson: 1.974
## Prob(Omnibus): 0.000 Jarque-Bera (JB): 940.741
## Skew: 1.074 Prob(JB): 5.26e-205
## Kurtosis: 3.752 Cond. No. 3.34
## ==============================================================================
##
## Notes:
## [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.library(fixest); library(tidyverse)
url <- 'https://raw.githubusercontent.com/basm92/ee_website/refs/heads/master/tutorials/datafiles/fertil2.csv'
fertil2 <- read_csv2(url)
model <- feols(children ~ catholic, data=fertil2)
summary(model)
## OLS estimation, Dep. Var.: children
## Observations: 4,361
## Standard-errors: IID
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.284875 0.035512 64.34054 < 2.2e-16 ***
## catholic -0.166307 0.110922 -1.49931 0.13386
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## RMSE: 2.2212 Adj. R2: 2.861e-4import delimited "https://raw.githubusercontent.com/basm92/ee_website/refs/heads/master/tutorials/datafiles/fertil2.csv", delimiter(";") decimalseparator(,) clear
regress children catholicThe estimated regression equation is: \[ \widehat{\text{children}} = 2.284 - 0.166 \times \text{catholic} \]
(b) Provide a precise interpretation of the coefficient on the catholic variable.
The catholic variable is a dummy variable (taking a value of 1 if the woman is Catholic and 0 otherwise). The interpretation is as follows:
- The intercept (2.284) represents the average number of children for the reference group (non-Catholic women).
- The slope coefficient (-0.166) represents the estimated difference in the average number of children between Catholics and the rest.
Therefore, the interpretation is: On average, Catholic women are estimated to have 0.166 fewer children than non-Catholic women.
(c) Provide a summary of the model. What is the t-value corresponding to the catholic variable? What does that imply?
The full model summary is printed by the code above.
The t-value corresponding to the catholic variable is -1.49.
This t-value implies that the estimated coefficient (-0.166) is approximately 1.5 standard errors away from zero. Since this value is smaller than the typical critical value of ~1.96 (for a 5% significance level), it suggests that the difference in the number of children between Catholic and non-Catholic women is not statistically significant. The corresponding p-value (P>|t|) is 0.133, which is more than 0.05, confirming this conclusion. We do not reject the null hypothesis that there is no difference between the two groups.
The Algebra of OLS
Consider the following small dataset of 4 observations for variables \(x\) and \(y\):
| Observation (i) | y | x |
|---|---|---|
| 1 | 2 | 1 |
| 2 | 7 | 2 |
| 3 | 6 | 4 |
| 4 | 9 | 5 |
(a) Calculate the sample means, \(\bar{x}\) and \(\bar{y}\).
- Sample mean of y: \(\bar{y} = \frac{2 + 7 + 6 + 9}{4} = \frac{24}{4} = 6\)
- Sample mean of x: \(\bar{x} = \frac{1 + 2 + 4 + 5}{4} = \frac{12}{4} = 3\)
(b) Using the OLS formula, calculate the slope estimator \(\hat{\beta}_1\).
First, we set up a table to calculate the numerator \(\sum (x_i - \bar{x})(y_i - \bar{y})\) and the denominator \(\sum (x_i - \bar{x})^2\).
| \(y_i\) | \(x_i\) | \((y_i - \bar{y})\) | \((x_i - \bar{x})\) | \((x_i - \bar{x})(y_i - \bar{y})\) | \((x_i - \bar{x})^2\) |
|---|---|---|---|---|---|
| 2 | 1 | \(2-6=-4\) | \(1-3=-2\) | \((-4) \times (-2) = 8\) | \((-2)^2 = 4\) |
| 7 | 2 | \(7-6=1\) | \(2-3=-1\) | \(1 \times (-1) = -1\) | \((-1)^2 = 1\) |
| 6 | 4 | \(6-6=0\) | \(4-3=1\) | \(0 \times 1 = 0\) | \((1)^2 = 1\) |
| 9 | 5 | \(9-6=3\) | \(5-3=2\) | \(3 \times 2 = 6\) | \((2)^2 = 4\) |
| Sum | - | - | - | 8 - 1 + 0 + 6 = 13 | 4 + 1 + 1 + 4 = 10 |
Now, we can calculate \(\hat{\beta}_1\): \[ \hat{\beta}_1 = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sum (x_i - \bar{x})^2} = \frac{13}{10} = 1.3 \]
(c) Using the formula, calculate the intercept estimator \(\hat{\beta}_0\).
\[ \hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x} = 6 - (1.3 \times 3) = 6 - 3.9 = 2.1 \]
(d) Write down the Sample Regression Function (SRF). Then, calculate the four residuals and verify that their sum is zero.
The Sample Regression Function (SRF) is: \[ \hat{y} = 2.1 + 1.3x \]
Now we calculate the predicted values (\(\hat{y}_i\)) and the residuals (\(\hat{u}_i = y_i - \hat{y}_i\)) for each observation:
- Observation 1: \(\hat{y}_1 = 2.1 + 1.3(1) = 3.4 \implies \hat{u}_1 = 2 - 3.4 = -1.4\)
- Observation 2: \(\hat{y}_2 = 2.1 + 1.3(2) = 4.7 \implies \hat{u}_2 = 7 - 4.7 = 2.3\)
- Observation 3: \(\hat{y}_3 = 2.1 + 1.3(4) = 7.3 \implies \hat{u}_3 = 6 - 7.3 = -1.3\)
- Observation 4: \(\hat{y}_4 = 2.1 + 1.3(5) = 8.6 \implies \hat{u}_4 = 9 - 8.6 = 0.4\)
Finally, we verify that the sum of the residuals is zero: \[ \sum \hat{u}_i = (-1.4) + 2.3 + (-1.3) + 0.4 = 0 \] The sum is indeed zero, confirming this fundamental algebraic property of OLS.
Hypothesis Testing and the t-statistic
You estimate a simple regression model to understand the relationship between house prices and house size. Your software produces the following output for a sample of 202 houses:
\[
\widehat{\text{Price}} = 40,000 + 150 \times \text{Size}
\] The standard error for the intercept coefficient is 10,000, and the standard error for the Size coefficient is 25.
(a) State the null hypothesis (\(H_0\)) and the alternative hypothesis (\(H_A\)) for this test.
- Null Hypothesis (\(H_0\)): House size has no effect on house price. The true population slope coefficient is zero. \[ H_0: \beta_{Size} = 0 \]
- Alternative Hypothesis (\(H_A\)): House size has a non-zero effect on house price. \[ H_A: \beta_{Size} \neq 0 \]
(b) Manually calculate the t-statistic for the coefficient on the Size variable.
The formula for the t-statistic is: \[ t = \frac{\text{Estimated Coefficient} - \text{Hypothesized Value}}{\text{Standard Error}} = \frac{\hat{\beta}_1 - 0}{SE(\hat{\beta}_1)} \] Plugging in the values: \[ t = \frac{150 - 0}{25} = 6 \] The t-statistic is 6.
(c) What are the degrees of freedom for this t-statistic?
For a simple linear regression, the degrees of freedom are calculated as \(n - 2\), where \(n\) is the sample size. \[ df = 202 - 2 = 200 \] There are 200 degrees of freedom.
(d) For a 5% significance level with this many degrees of freedom, the critical t-value is approximately 1.96. Based on your calculated t-statistic, would you reject or fail to reject the null hypothesis? Explain what this conclusion means in plain language.
Decision: We compare our calculated t-statistic (6) to the critical value (1.96). Since |6| > 1.96, we reject the null hypothesis.
Explanation in Plain Language: Our test result is highly statistically significant. This means that the relationship we observed in our sample (where larger houses are associated with higher prices) is extremely unlikely to have occurred by random chance if there were truly no relationship in the overall population. Therefore, we can confidently conclude that house size has a significant effect on house price.
Error Term and Residual
What is the fundamental difference between the population error term (\(u_i\)) and the OLS residual (\(e_i\))? Why can we observe one but not the other?
- Fundamental Difference:
- The population error term (\(u_i\)) is the vertical distance between a data point (\(y_i\)) and the true, unobservable population regression line. It represents all the unobserved factors that affect \(y_i\) besides \(x_i\). \(u_i = y_i - (\beta_0 + \beta_1 x_i)\)
- The OLS residual (\(e_i\) or \(\hat{u}_i\)) is the vertical distance between a data point (\(y_i\)) and the estimated sample regression line. It is the prediction error from our estimated model. \(e_i = y_i - \hat{y}_i = y_i - (\hat{\beta}_0 + \hat{\beta}_1 x_i)\)
- Why we can’t observe \(u_i\): We cannot observe the population error term \(u_i\) because we do not know the true population parameters \(\beta_0\) and \(\beta_1\). We can only estimate them using a sample of data, which gives us \(\hat{\beta}_0\) and \(\hat{\beta}_1\). Because we can calculate \(\hat{\beta}_0\) and \(\hat{\beta}_1\) from our sample, we can calculate the residual \(e_i\) for each observation.
Proving a Fundamental OLS Property
Using the formula for the OLS intercept estimator, \(\hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x}\), prove that the regression line passes through the point of sample means, \((\bar{x}, \bar{y})\).
The estimated OLS regression line is given by the equation: \(\hat{y} = \hat{\beta}_0 + \hat{\beta}_1 x\)
To show that the line passes through the point \((\bar{x}, \bar{y})\), we need to show that when we plug in \(x=\bar{x}\), the predicted value \(\hat{y}\) is equal to \(\bar{y}\).
Substitute the formula for the intercept, \(\hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x}\), into the regression equation: \(\hat{y} = (\bar{y} - \hat{\beta}_1 \bar{x}) + \hat{\beta}_1 x\)
Now, set \(x = \bar{x}\): \(\hat{y} = (\bar{y} - \hat{\beta}_1 \bar{x}) + \hat{\beta}_1 \bar{x}\)
The terms \(-\hat{\beta}_1 \bar{x}\) and \(+\hat{\beta}_1 \bar{x}\) cancel each other out: \(\hat{y} = \bar{y}\)
This proves that when the input is the sample mean of x, the predicted output is the sample mean of y. Therefore, the OLS regression line always passes through the point of sample means \((\bar{x}, \bar{y})\).
Unbiasedness
(a) Show that the estimator can be rewritten as: \(\hat{\beta}_1 = \beta_1 + \frac{\sum_{i=1}^{n} (x_i - \bar{x})u_i}{\sum_{i=1}^{n} (x_i - \bar{x})^2}\)
Start with the formula for \(\hat{\beta}_1\): \[ \hat{\beta}_1 = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sum (x_i - \bar{x})^2} \] A useful property is \(\sum (x_i - \bar{x})(y_i - \bar{y}) = \sum (x_i - \bar{x})y_i\). So, \[ \hat{\beta}_1 = \frac{\sum (x_i - \bar{x})y_i}{\sum (x_i - \bar{x})^2} \]
Substitute the true population model \(y_i = \beta_0 + \beta_1 x_i + u_i\) for \(y_i\): \[ \hat{\beta}_1 = \frac{\sum (x_i - \bar{x})(\beta_0 + \beta_1 x_i + u_i)}{\sum (x_i - \bar{x})^2} \]
Distribute the term \((x_i - \bar{x})\) in the numerator: \[ \hat{\beta}_1 = \frac{\sum (x_i - \bar{x})\beta_0 + \sum (x_i - \bar{x})\beta_1 x_i + \sum (x_i - \bar{x})u_i}{\sum (x_i - \bar{x})^2} \]
Analyze each term in the numerator:
- \(\sum (x_i - \bar{x})\beta_0 = \beta_0 \sum (x_i - \bar{x}) = \beta_0 \cdot 0 = 0\).
- \(\sum (x_i - \bar{x})\beta_1 x_i = \beta_1 \sum (x_i - \bar{x})x_i\). Using the same property as step 1, \(\sum (x_i - \bar{x})x_i = \sum (x_i - \bar{x})(x_i - \bar{x}) = \sum (x_i - \bar{x})^2\). So this term is \(\beta_1 \sum (x_i - \bar{x})^2\).
- \(\sum (x_i - \bar{x})u_i\) remains as is.
Substitute these back into the expression: \[ \hat{\beta}_1 = \frac{0 + \beta_1 \sum (x_i - \bar{x})^2 + \sum (x_i - \bar{x})u_i}{\sum (x_i - \bar{x})^2} \]
Separate the fraction: \[ \hat{\beta}_1 = \frac{\beta_1 \sum (x_i - \bar{x})^2}{\sum (x_i - \bar{x})^2} + \frac{\sum (x_i - \bar{x})u_i}{\sum (x_i - \bar{x})^2} \]
Simplify to get the final result: \[ \hat{\beta}_1 = \beta_1 + \frac{\sum_{i=1}^{n} (x_i - \bar{x})u_i}{\sum_{i=1}^{n} (x_i - \bar{x})^2} \]
(b) Take the conditional expectation… to prove that \(E(\hat{\beta}_1|X) = \beta_1\).
Start with the expression from part (a): \[ \hat{\beta}_1 = \beta_1 + \frac{\sum (x_i - \bar{x})u_i}{\sum (x_i - \bar{x})^2} \]
Take the expectation of both sides, conditional on \(X = \{x_1, x_2, ..., x_n\}\): \[ E(\hat{\beta}_1|X) = E\left(\beta_1 + \frac{\sum (x_i - \bar{x})u_i}{\sum (x_i - \bar{x})^2} \bigg| X\right) \]
Use the linearity of expectation: \[ E(\hat{\beta}_1|X) = E(\beta_1|X) + E\left(\frac{\sum (x_i - \bar{x})u_i}{\sum (x_i - \bar{x})^2} \bigg| X\right) \]
Analyze each term:
- \(E(\beta_1|X) = \beta_1\) because \(\beta_1\) is a constant.
- For the second term, since we are conditioning on \(X\), all \(x_i\) and \(\bar{x}\) values are treated as non-random. We can pull them outside the expectation: \[ E\left(\frac{\sum (x_i - \bar{x})u_i}{\sum (x_i - \bar{x})^2} \bigg| X\right) = \frac{1}{\sum (x_i - \bar{x})^2} E\left(\sum (x_i - \bar{x})u_i \bigg| X\right) \] \[ = \frac{1}{\sum (x_i - \bar{x})^2} \sum (x_i - \bar{x}) E(u_i | X) \]
Now, use the Zero Conditional Mean assumption, \(E(u_i|X) = 0\). This means the entire second term becomes zero: \[ \frac{1}{\sum (x_i - \bar{x})^2} \sum (x_i - \bar{x}) \cdot 0 = 0 \]
Substitute back into the main equation: \[ E(\hat{\beta}_1|X) = \beta_1 + 0 \] \[ E(\hat{\beta}_1|X) = \beta_1 \] This proves that the OLS slope estimator is unbiased.
Marginal Effects
(a) The Quadratic Model: For \(y = \beta_0 + \beta_1 x + \beta_2 x^2 + u\), find \(\frac{dy}{dx}\).
- To find the marginal effect of \(x\) on \(y\), we take the partial derivative of \(y\) with respect to \(x\): \[ \frac{\partial y}{\partial x} = \frac{\partial}{\partial x} (\beta_0 + \beta_1 x + \beta_2 x^2 + u) \] \[ \frac{\partial y}{\partial x} = 0 + \beta_1 + 2\beta_2 x + 0 \] \[ \frac{\partial y}{\partial x} = \beta_1 + 2\beta_2 x \]
- This result shows that the marginal effect of a one-unit change in \(x\) on \(y\) is not constant; it depends on the current level of \(x\). For each value of \(x\), the slope of the relationship is different.
(b) The Level-Log Model: For \(y = \beta_0 + \beta_1 \log(x) + u\), show that a 1% change in \(x\) leads to an approximate change of \((\beta_1/100)\) units in \(y\).
- First, find the derivative of \(y\) with respect to \(x\): \[ \frac{dy}{dx} = \beta_1 \frac{1}{x} \]
- Rearrange the equation to find an expression for an infinitesimal change in \(y\), \(dy\): \[ dy = \beta_1 \frac{dx}{x} \]
- The term \(\frac{dx}{x}\) represents the proportional or percentage change in \(x\). For discrete changes, we can write this as an approximation: \[ \Delta y \approx \beta_1 \frac{\Delta x}{x} \]
- If we consider a 1% change in \(x\), then \(\frac{\Delta x}{x} = 0.01\).
- Substitute this value into the approximation: \[ \Delta y \approx \beta_1 (0.01) = \frac{\beta_1}{100} \]
- Thus, a 1% change in \(x\) is associated with an approximate change in \(y\) of \((\beta_1/100)\) units.
Variance of the OLS Estimator
What two things could you do to increase the precision of your estimate, \(\hat{\beta}_1\)?
The variance formula is \(Var(\hat{\beta}_1) = \frac{\sigma^2}{SST_x}\). To increase precision, we need to decrease this variance.
Decrease the error variance (\(\sigma^2\)): \(\sigma^2\) is the variance of the unobserved factors, \(u\). In an experimental setting, this means making the experimental conditions as controlled and uniform as possible. For example, ensure all crop plots have the same soil type, water access, and sunlight exposure. By minimizing the influence of other factors, you reduce the “noise” in the model, making the relationship between fertilizer and yield clearer.
Increase the Total Sum of Squares of x (\(SST_x\)): \(SST_x = \sum(x_i - \bar{x})^2\). This term measures the total variation in the explanatory variable. In your experiment, this means you should use a wider range of fertilizer amounts (\(x\)) across your different plots. Intuitively, it is easier to detect a trend line if the points are spread far apart horizontally than if they are all bunched together. More variation in \(x\) provides more information to pin down the slope of the regression line.
R-squared
Why is a high R-squared not necessarily the ultimate goal? What is often more important?
- A high R-squared is not the ultimate goal because it only measures goodness-of-fit, not causal validity. A model can have a very high R-squared but still suffer from severe omitted variable bias, making its coefficients unreliable for policy decisions. For example, a model predicting crime rates using ice cream sales might have a high R-squared in the summer, but the relationship is spurious.
- What is often more important is obtaining an unbiased and consistent estimate of a specific coefficient that represents a causal effect of interest. For policy, we need to know the true causal impact of changing a variable (e.g., years of education, police funding, carbon tax). This requires a model specification that is theoretically sound and minimizes biases (like OVB), even if it results in a lower R-squared. Unbiasedness is usually more important than fit.
OLS Minimization
Why do we use the sum of squared residuals? Why not absolute values or just the sum?
Why not the sum of residuals? Minimizing \(\sum e_i\) is not a useful criterion. An infinite number of lines can make this sum equal to zero (any line passing through the point of means, \((\bar{x}, \bar{y})\)), so it does not yield a unique solution.
Why we use the sum of squared residuals:
- Treats Positive/Negative Errors Equally: Squaring makes all errors positive, so large positive errors and large negative errors are treated as equally “bad”.
- Penalizes Large Errors More: Squaring gives much more weight to large errors than to small ones (e.g., an error of 2 becomes 4, but an error of 10 becomes 100). This is often desirable, as it forces the line to fit the bulk of the data well by avoiding large deviations.
- Mathematical Convenience: The sum of squares is a smooth, differentiable function. Using calculus, we can easily derive a unique, closed-form analytical solution for the estimators \(\hat{\beta}_0\) and \(\hat{\beta}_1\). Minimizing the sum of absolute values (Least Absolute Deviations, or LAD) is computationally more complex and may not have a unique solution.
Polynomials
When might you suspect a quadratic model would be more appropriate? What would a negative coefficient on the experience² term imply?
- When to Suspect Non-linearity:
- Economic Theory: Theory might suggest a non-linear relationship. For example, the “law of diminishing marginal returns” is common in economics. The effect of experience on wages is likely positive but decreases as one gets more experienced.
- Visual Inspection: A scatterplot of the dependent variable against the independent variable might reveal a curved, parabolic shape rather than a straight line.
- Residual Plots: If you fit a linear model and then plot the residuals against the independent variable, a U-shaped or inverted U-shaped pattern in the residuals suggests that a quadratic term might be missing.
- Interpretation of a negative
experience²coefficient:- In the model \(\text{wage} = \beta_0 + \beta_1 \text{experience} + \beta_2 \text{experience}^2 + u\), if \(\beta_1 > 0\) and \(\beta_2 < 0\), it implies a concave, inverted U-shaped relationship between experience and wage.
- This means that as a person gains their first few years of experience, their wage increases (\(\beta_1\) term dominates). However, the rate of this increase slows down over time (the negative \(\beta_2\) term starts to have more impact). This reflects diminishing marginal returns to experience. Eventually, after a certain point, an additional year of experience might even lead to a decrease in predicted wages (if the person becomes less adaptable or their skills become obsolete).