Empirical Economics

Tutorial 0: Statistics and Probability

Introduction

This Week

We set up Python/Stata/R.

We focus on downloading and importing datasets.

We focus on statistics and probability. Both “on pen and paper” and “with the help of your laptop”.

These are important building blocks for the linear model (week 1 and 2)
Which in turn is an important building block for more complicated models later on in the course

Setting Up

This guide will walk you through installing a powerful and popular setup for Python programming. We’ll use Miniconda to manage our Python environment and Visual Studio Code (VS Code) as our code editor.

Miniconda makes managing Python packages (like pandas for data or matplotlib for plots) incredibly easy and avoids conflicts. We’ll use Miniconda, a lightweight version of Anaconda.
VSCode is a free, modern, and highly customizable code editor that works beautifully with Python.

Follow the steps for your operating system to set up your environment.

On Windows

Step 1: Install Miniconda (Python)

Go to the Miniconda download page: https://docs.conda.io/en/latest/miniconda.html
Download the latest Python 3.x Windows 64-bit installer (it will be a .exe file).
Run the installer. When prompted, we recommend these settings:
- Install for: “Just Me” (the default).
- Destination Folder: Leave as the default.
- Advanced Options: Leave “Add Miniconda3 to my PATH” unchecked. The recommended option is to use Miniconda from the “Anaconda Prompt.”

Step 2: Install Visual Studio Code (VS Code)

Go to the VS Code download page: https://code.visualstudio.com/
Click the big blue “Download for Windows” button.
Run the installer. Accept the license agreement and click Next on all screens. We recommend keeping the default settings, especially “Add to PATH”, which should be checked by default.

On macOS

Step 1: Install Miniconda (Python)

Go to the Miniconda download page: https://docs.conda.io/en/latest/miniconda.html
Download the installer for your Mac’s chip:
- Apple M1/M2/M3: Download the “macOS Apple M1 64-bit pkg” installer.
- Intel Chip: Download the “macOS Intel x86 64-bit pkg” installer.
Run the installer (.pkg file). Click Continue and Agree through the prompts, using the default settings. It will install for the current user.

Step 2: Install Visual Studio Code (VS Code)

Go to the VS Code download page: https://code.visualstudio.com/
Click the “Download Mac Universal” button.
This will download a .zip file. Unzip it to get the Visual Studio Code.app.
Drag Visual Studio Code.app into your Applications folder. This makes it easy to find and launch.

Step 3: Setting up VS Code for Python

Now that everything is installed, let’s connect them.

Open VS Code.
Install the Python Extension:
- Click on the Extensions icon on the left-hand sidebar (it looks like four squares).
- In the search bar, type Python.
- Find the one by Microsoft and click the Install button.
Select your Conda Python Interpreter:
- Open the Command Palette by pressing Ctrl+Shift+P (on Windows) or Cmd+Shift+P (on Mac).
- Type Python: Select Interpreter and press Enter.
- A list of available Python versions will appear. Choose the one that says ‘base’: conda. It should point to your Miniconda installation.

You are now set up! VS Code knows where to find your Conda Python installation.

Step 4: Your First Project - Reading a Dataset

Let’s perform our first real-world task: opening a project folder and reading data from a file.

First, create a place for your project on your computer (e.g., in Documents or on your Desktop).

Create a new folder named MyFirstProject.
Inside MyFirstProject, create a new text file and name it main.py.
Inside MyFirstProject, create another new folder named data.
Inside the data folder, create a new text file named students.csv. Open it with a basic text editor (like Notepad or TextEdit) and paste the following content:

StudentID,Name,Score
101,Alice,88
102,Bob,92
103,Charlie,75
104,Diana,95

Your folder structure should look like this:

MyFirstProject/
├── main.py
└── data/
    └── students.csv

Step 5: Open the Project in VS Code

This is a key step! Don’t just open the .py file. Open the entire folder.

In VS Code, go to File > Open Folder….
Navigate to and select your MyFirstProject folder. Click “Open” (or “Select Folder”).
The VS Code sidebar will now show your project files (main.py and the data folder).

Step 6: Install a Package and Write the Code

We need the pandas library to read the .csv file. Let’s install it.

In VS Code, open a new terminal by going to Terminal > New Terminal.
A terminal will appear at the bottom of the screen. You should see (base) at the start of the prompt, indicating your Conda environment is active.
Type the following command and press Enter: conda install pandas Type y and press Enter if it asks you to proceed.

Now, click on main.py in the sidebar to open it and paste this code:

# 1. Import the pandas library, which is a powerful tool for data analysis.
# We give it the nickname 'pd' to make it easier to use.
import pandas as pd

# 2. Define the path to our dataset.
# Because we opened the 'MyFirstProject' folder, we can use a relative path.
# This means "look inside the 'data' folder for the 'students.csv' file".
file_path = 'data/students.csv'

# 3. Use pandas to read the CSV file into a data structure called a DataFrame.
print("Reading the dataset...")
student_data = pd.read_csv(file_path)

# 4. Print the first 5 rows of the data to see if it loaded correctly.
print("Here is the top of the dataset:")
print(student_data.head())

# 5. Print a success message!
print("\nSuccessfully imported and displayed the dataset!")

Step 8: Run Your Python Script!

Look for the Play button (a white triangle) in the top-right corner of the VS Code window. Click it.

The Terminal at the bottom will spring to life and you will see the output:

Reading the dataset...
Here is the top of the dataset:
   StudentID     Name  Score
0        101    Alice     88
1        102      Bob     92
2        103  Charlie     75
3        104    Diana     95

Successfully imported and displayed the dataset!

Congratulations! You are ready to start coding.

This guide will help you install the standard toolkit for statistical computing and data science: R and RStudio. You must install R first, then install RStudio. RStudio needs R to function.

Follow the steps for your operating system.

On Windows

Step 1: Install R (The Language)

Go to the official R download site, CRAN (the Comprehensive R Archive Network): https://cran.r-project.org/bin/windows/base/
Click the big link at the top that says “Download R-x.x.x for Windows”.
Run the installer (.exe file).
Use the default settings for all steps. Simply click Next through the installation wizard.

Step 2: Install RStudio (The Dashboard)

Go to the RStudio Desktop download page: https://posit.co/download/rstudio-desktop/
The website should automatically detect you’re on Windows. Click the “DOWNLOAD RSTUDIO DESKTOP FOR WINDOWS” button.
Run the installer. Again, use the default settings by clicking Next until it’s finished.

On macOS

Step 1: Install R (The Language)

Go to the CRAN download page for macOS: https://cran.r-project.org/bin/macosx/
Choose the correct package for your Mac’s chip:
- Apple M1/M2/M3 (Apple Silicon): Download the package labeled R-x.x.x.pkg (arm64).
- Intel Chip: Download the package labeled R-x.x.x.pkg (x86_64).
Run the installer (.pkg file). Click Continue and Agree through the prompts, using the default settings.

Step 2: Install RStudio (The Dashboard)

Go to the RStudio Desktop download page: https://posit.co/download/rstudio-desktop/
The website should detect you’re on a Mac. Click the “DOWNLOAD RSTUDIO DESKTOP FOR MACOS” button.
This will download a .dmg file. Open it.
A window will appear. Drag the RStudio icon into your Applications folder.

Open RStudio. You’ll see four main panes (some may be combined at first):

Source Editor (Top-Left): This is where you write and save your R scripts (.R files).
Console (Bottom-Left): This is where you can type R commands directly and where the output from your scripts will appear.
Environment/History (Top-Right): Environment shows all the objects (like datasets) you’ve created. History shows your past commands.
Files/Plots/Packages (Bottom-Right): Files shows the files in your project folder. Plots is where your graphs will appear. Packages lets you manage your installed R packages.

Using RStudio Projects is the best way to keep your work organized. A project keeps all your files, scripts, and data together.

Step 3: Create an RStudio Project

In RStudio, go to File > New Project….
Choose New Directory.
Choose New Project.
For “Directory name”, type MyFirstRProject.
For “Create project as subdirectory of”, click Browse… and choose a location like your Desktop or Documents folder.
Click Create Project.

RStudio will now restart and open your new project. Notice the Files pane (bottom-right) now shows the contents of your MyFirstRProject folder.

Step 4: Create Your Folder and Data File

Let’s create the same structure as our Python example.

In the Files pane (bottom-right), click the New Folder button and name it data.
Now, create the data file. On your computer (outside of RStudio), navigate into your new MyFirstRProject folder, then into the data folder.
Create a new text file named students.csv. Open it with a basic text editor (like Notepad or TextEdit) and paste the following content:

StudentID,Name,Score
101,Alice,88
102,Bob,92
103,Charlie,75
104,Diana,95

Your folder structure is now:

MyFirstRProject/
├── MyFirstRProject.Rproj  (RStudio creates this for you)
└── data/
    └── students.csv

Step 5: Create an R Script and Write the Code

In RStudio, go to File > New File > R Script. A blank script will open in the Source editor.
Save the file by clicking the floppy disk icon or pressing Ctrl+S (Windows) / Cmd+S (Mac). Name it main.R.

Now, paste this code into your main.R script:

# 1. Define the path to our dataset.
# Because we are using an RStudio Project, we can use a relative path.
# This means "look inside the 'data' folder for the 'students.csv' file".
file_path <- "data/students.csv"

# 2. Use R's built-in function to read the CSV file.
# The data is stored in an object called a "data frame".
print("Reading the dataset...")
student_data <- read.csv(file_path)

# 3. Print the first 6 rows of the data to see if it loaded correctly.
# The head() function is great for peeking at your data.
print("Here is the top of the dataset:")
print(head(student_data))

# 4. Print a success message!
print("Successfully imported and displayed the dataset!")

Step 6: Run Your R Script!

There are two easy ways to run your code:

Run the whole script (recommended): Click the Source button at the top of your script editor. This will execute the entire file from top to bottom.
Run line-by-line: Place your cursor on a line and press Ctrl+Enter (on Windows) or Cmd+Enter (on Mac). This is great for debugging.

The output will appear in the Console (bottom-left):

[1] "Reading the dataset..."
[1] "Here is the top of the dataset:"
  StudentID    Name Score
1       101   Alice    88
2       102     Bob    92
3       103 Charlie    75
4       104   Diana    95
[1] "Successfully imported and displayed the dataset!"

Congratulations! You have installed R and RStudio, and are ready for data analysis.

This guide covers getting started with Stata for data analysis. Stata is commercial software, so you must have a license from UU.

Step 1: Installation (Windows & macOS):

Obtain Software: Get the Stata installer and your license file (stata.lic) from your university’s software portal or IT department.

Install Stata: Run the installer, accepting the terms and choosing the Stata version you are licensed for (e.g., Stata/SE). Use all default settings.

Step 2: Setup & First Script:

Create Project Folder: On your computer, create a folder named MyStataProject. Inside it, create a sub-folder named data and place your students.csv file inside.

Set Working Directory: This is a crucial step. In Stata, go to File > Change Working Directory… and select your MyStataProject folder.

Create and Run a Do-file:

Open the Do-file Editor (click the notepad-with-pencil icon).

Save the empty file as main.do in your project folder.

Add these commands:

* Import data from the data sub-folder
import delimited "data/students.csv", clear

* Get summary statistics and view first 5 rows
summarize
list in 1/5

Click the Execute (do) button in the Do-file Editor’s toolbar. The output will appear in the main Results window.

Congratulations! You are ready to perform statistical analysis in Stata.

Exercises

Events, Intersections, and Unions

Imagine you are drawing a single card from a standard 52-card deck. Let’s define two events:

Event A: The card drawn is a ‘King’.
Event B: The card drawn is a ‘Heart’.

Calculate the following probabilities:

\(P(A)\): The probability of drawing a King.
\(P(B)\): The probability of drawing a Heart.
\(P(A \cap B)\): The probability of drawing a King AND a Heart (i.e., the King of Hearts).
\(P(A \cup B)\): The probability of drawing a King OR a Heart. Use the formula for the union of two events.

Conditional Probability

Using the same card-drawing experiment:

Calculate \(P(A|B)\): The probability that the card is a King, given that you know it is a Heart.
Calculate \(P(B|A)\): The probability that the card is a Heart, given that you know it is a King.
Are events A and B independent? Justify your answer using the definition of independence (\(P(A \cap B) = P(A)P(B)\)).

Simulating an Experiment

Let’s verify the theoretical results from the card-drawing experiment empirically. We will simulate drawing a card a large number of times and see if the frequencies match the probabilities.

Create a list or array representing the 52 cards. A simple way is a list of strings, e.g., ['2H', '3H', ..., 'KH', 'AH', '2D', ...].
Write a Python/R/Stata script to “draw” a card from the deck 100,000 times (with replacement).
Count how many times:

A ‘King’ was drawn (Event A).
A ‘Heart’ was drawn (Event B).
The ‘King of Hearts’ was drawn (Event \(A \cap B\)).
A ‘King’ or a ‘Heart’ was drawn (Event \(A \cup B\)).

Calculate Frequencies: Divide the counts by the total number of simulations (100,000) to get the empirical probabilities. Compare these to your theoretical answers in Exercise 1.1.

PMF, Expected Value, and Variance

An online store has determined that the number of items a customer buys in a single visit, \(X\), is a discrete random variable with the following Probability Mass Function (PMF):

\(x\) (items)	0	1	2	3
P(X=x)	0.1	0.5	0.3	0.1

Verify that this is a valid PMF.
Calculate the expected number of items a customer will buy, \(E[X]\).
Calculate the variance of the number of items, \(Var(X)\).
What is the probability that a customer buys more than one item?

The Bernoulli Distribution in Theory and Programming

A loan has a probability of default of \(p=0.05\). Let \(X\) be a Bernoulli random variable where \(X=1\) if the loan defaults and \(X=0\) if it does not.

Using the formulas from the lecture for a Bernoulli distribution, calculate the expected value \(E[X]\) and the variance \(Var(X)\).
Use numpy, scipy.stats or rbinom() to simulate 1,000,000 such loans.

Calculate the sample mean and sample variance of your simulation.
Compare your simulated results to the theoretical values. They should be very close.

Standardization and Z-scores

Suppose the annual returns of a stock portfolio are normally distributed with a mean of 12% and a standard deviation of 20%. Let \(R\) be the random variable for the portfolio’s return. So, \(R \sim N(0.12, 0.20^2)\).

What is the Z-score for a return of 32%? What does this Z-score signify?
What is the Z-score for a return of -8%?
What portfolio return corresponds to a Z-score of 1.5?

Calculating Probabilities and Quantiles

Continuing with the portfolio from the previous exercise (\(R \sim N(0.12, 0.20^2)\)):

Use scipy.stats.norm in Python or pnorm in R to answer the following questions:

What is the probability that the portfolio’s return is negative (\(P(R < 0)\))?
What is the probability that the return is greater than 25% (\(P(R > 0.25)\))?
What is the probability that the return is between 0% and 15% (\(P(0 < R < 0.15)\))?
Find the 5th percentile of the return distribution. This is the value below which 5% of returns fall (often used in Value-at-Risk calculations).

(Hint: you may need norm.cdf and norm.ppf from scipy.stats)

Linear Combinations of Normal Variables in Python

You manage a portfolio consisting of two assets, Asset A and Asset B. Their annual returns are independent and normally distributed:

Return of Asset A: \(R_A \sim N(0.10, 0.15^2)\)
Return of Asset B: \(R_B \sim N(0.06, 0.10^2)\)

You create a portfolio where you invest 60% in Asset A and 40% in Asset B. The portfolio return is \(R_P = 0.6 R_A + 0.4 R_B\).

Based on the rules for linear combinations of normal random variables, what are the expected value \(E[R_P]\) and variance \(Var(R_P)\) of the portfolio? What is the full distribution of \(R_P\)?
Simulate 100,000 returns for \(R_A\) and \(R_B\).

Calculate the portfolio return \(R_P\) for each of the 100,000 simulations.
Calculate the sample mean and sample variance of your simulated \(R_P\).
Compare your simulated results to your theoretical calculations from part 1.
Bonus: Plot a histogram of your simulated \(R_P\) to visually confirm it looks normally distributed.

Rules of Expectation and Variance

Let \(X\) and \(Y\) be two random variables. You are given the following information:

\(E[X] = 10\), \(Var(X) = 4\)
\(E[Y] = 5\), \(Var(Y) = 9\)
\(Cov(X, Y) = -2\)

Calculate the following:

\(E[X + Y]\)
\(Var(X + Y)\)
\(E[3X - 2Y]\)
\(Var(3X - 2Y)\)
Recalculate \(Var(X+Y)\) assuming \(X\) and \(Y\) were independent. How does it differ from your answer in part 2?

Interpreting Conditional Expectation

A researcher is studying the relationship between years of education (\(E\)) and annual income (\(I\)). They model income as a random variable conditional on education.

The researcher finds that the conditional expectation of income given education is: \(E[I | E=e] = 15000 + 4000e\)

What is the expected income for a person with 12 years of education?
What is the expected income for a person with 16 years of education (a college degree)?
The researcher writes down the term \(E[I|E]\). Is this a single number or a random variable? Explain your reasoning.

Univariate Statistics

Import a dataset with (hourly) wages, WAGE1.DTA, downloadable here and calculate the expected value, variance and standard deviation of the wage variable.

Solutions:

R
Python
Stata

Code

library(haven)
url <- 'https://raw.githubusercontent.com/basm92/ee_website/refs/heads/master/tutorials/datafiles/WAGE1.DTA'
dataset <- read_dta(url)
mean(dataset$wage)
## [1] 5.896103
var(dataset$wage)
## [1] 13.63888
sd(dataset$wage)
## [1] 3.693086

Code

import pandas as pd
import numpy as np
url= 'https://raw.githubusercontent.com/basm92/ee_website/refs/heads/master/tutorials/datafiles/WAGE1.DTA'
dataset = pd.read_stata(url)
np.mean(dataset['wage'])
## np.float32(5.896103)
np.var(dataset['wage'])
## np.float32(13.612955)
np.std(dataset['wage'])
## np.float32(3.6895738)

* Load the dataset from the URL, replacing the current dataset in memory
use "https://raw.githubusercontent.com/basm92/ee_website/refs/heads/master/tutorials/datafiles/WAGE1.DTA", clear

* Calculate and display summary statistics for the 'wage' variable
summarize wage, detail

The Nature of Statistics and Parameters

The lecture states that a parameter is a fixed value, while a statistic is a random variable.

Eplain why this is the case. Use the example of the average height of all citizens in a country versus the average height of a 1,000-person sample to illustrate your explanation.

Constructing an Exact Sampling Distribution

Consider a tiny population consisting of only five numbers: [10, 20, 30, 40, 50].

Calculate the true population mean, \(\mu\).
Now, consider taking random samples of size n=2 without replacement. List all possible unique samples you can draw. (Hint: The number of combinations is “5 choose 2”).*
Calculate the sample mean (\(\bar{x}\)) for each of these possible samples.
The list of sample means you created in part (c) is the exact sampling distribution of the sample mean. Present this distribution as a frequency table.
Is the mean of this sampling distribution equal to the true population mean you calculated in part (a)?

*: You can use comb() in R or itertools.combinations in Python.

Simulating a Sampling Distribution

The lecture demonstrated simulating the sampling distribution from a Normal population. Your task is to do the same for a different type of population.

Assume the population data follows a highly skewed Exponential distribution. The exponential distribution is often used to model time until an event.

Simulate a population, i.e. a large number of draws.*
Draw 5,000 different random samples, each of size n=50, from this population.**
For each of the 5,000 samples, calculate its sample mean.
Plot a histogram of the 5,000 sample means you calculated.
Describe the shape of the histogram of sample means. Does it look like the original Exponential distribution, or does it look like something else?

*: In Python, you can simulate a large number of draws from an exponential distribution using numpy.random.exponential(scale=10, size=100000). Let’s say the scale parameter (which is the true mean, \(\mu\)) is 10.
**: You can use np.random.choice() in Python or sample() in R

Applying the CLT Conceptually

A logistics company knows that the weight of packages it processes is not normally distributed; it is right-skewed with a long tail (a few very heavy packages). The true mean weight of all packages is \(\mu = 8\) kg, and the true standard deviation is \(\sigma = 5\) kg.

The company takes a random sample of 100 packages.

According to the Central Limit Theorem, what can you say about the shape of the sampling distribution of the sample mean weight (\(\bar{x}\))?
What will be the theoretical mean of this sampling distribution?
What will be the theoretical standard deviation of this sampling distribution (i.e., the standard error)?

Verifying the CLT in Python/R/Stata

Let’s use the simulation from Slide 3.14 to verify the predictions of the Central Limit Theorem. Your population is from an Exponential distribution with a true mean (\(\mu\)) of 10 and a true standard deviation (\(\sigma\)) also of 10. Your sample size was n=50.

From the list of 5,000 sample means you generated in Exercise 4, calculate the empirical mean (the average of all your sample means) and the empirical standard deviation.
According to the CLT, what should the theoretical mean of the sampling distribution be?
According to the CLT, what should the theoretical standard deviation of the sampling distribution (the standard error) be? The formula is \(\sigma / \sqrt{n}\).
Compare the empirical results from part (a) with the theoretical results from parts (b) and (c). Are they close?

Formulating Hypotheses

For each research question below, formulate the appropriate null hypothesis (\(H_0\)) and alternative hypothesis (\(H_A\)). State whether a one-sided or a two-sided test would be more appropriate and justify your choice.

Scenario A: A city’s water department wants to know if the average daily water consumption per household has changed from last year’s average of 350 gallons.
Scenario B: A pharmaceutical company has developed a new drug to lower blood pressure. They want to test if the drug is effective, meaning it reduces blood pressure compared to a placebo.
Scenario C: An online retailer is testing a new website design. They want to know if the new design has a different conversion rate (proportion of visitors who make a purchase) than the old design’s rate of 15%.

Interpretation and Calculation

A car manufacturer claims its new hybrid model has a mean fuel efficiency of 50 miles per gallon (mpg). A consumer watchdog group tests a random sample of n=36 cars and finds a sample mean efficiency of \(\bar{x}\) = 48.5 mpg. Assume the population standard deviation for fuel efficiency is known to be \(\sigma = 6\) MPG.

The consumer watchdog group wants to test if the manufacturer’s claim is overstated (i.e., if the true mean is less than 50 MPG). They set \(\alpha = 0.05\).

State the null (\(H_0\)) and alternative (\(H_A\)) hypotheses for this test.
Calculate the standard error of the sample mean.
Calculate the Z-test statistic using the formula: \(Z = (\bar{x} - \mu_0) / SE\).
Using your Z-statistic, find the corresponding p-value. (You may need a Z-table or a statistical function for this).
Based on your p-value and the significance level of \(\alpha=0.05\), what is your conclusion? Do you reject or fail to reject the null hypothesis?

Performing a Hypothesis Test

Use the WAGE2.DTA dataset available here. Import it in R/Python/Stata. Conduct a hypothesis test on the IQ variable. The null hypothesis is that mu_0 = 100.

Calculate the standard error and the Z-test statistic in Python.
The p-value for a two-tailed test is the area under the standard normal curve that is more extreme (on both sides) than your calculated Z-statistic. Use scipy.stats.norm.cdf() or pnorm() to find this p-value.

Correct Interpretation

A researcher calculates a 95% confidence interval for the average number of hours students at a university study per week and gets [12.5, 15.0] hours.

Which of the following statements is the correct interpretation of this result? Explain why the other statement is incorrect.

Statement 1: “There is a 95% probability that the true average study time for all students at the university is between 12.5 and 15.0 hours.”
Statement 2: “We are 95% confident that the method we used to generate this interval captures the true average study time for all students at the university.”

Factors Affecting Interval Width

Without doing any calculations, explain how the width of a confidence interval would be affected by the following changes, assuming all other factors remain constant.

Increasing the confidence level from 90% to 99%.
Increasing the sample size from 100 to 400.
The sample having a larger standard deviation.

Constructing a Confidence Interval

You are given the same data from the fuel efficiency test: a sample of n=36 cars yielded a sample mean of \(\bar{x}\) = 48.5 MPG, and the population standard deviation is \(\sigma = 6\) MPG.

Write a Python script to calculate a 95% confidence interval for the true mean fuel efficiency.
- Find the point estimate.
- Calculate the standard error.
- Find the critical Z-value for a 95% CI. (Hint: for a 95% CI, you need the value that leaves 2.5% in each tail. You can use scipy.stats.norm.ppf(0.975)).
- Calculate the margin of error (Critical Value * Standard Error).
- Construct the interval: Point Estimate ± Margin of Error.
Now, calculate a 99% confidence interval using the same data. Is it wider or narrower than the 95% interval?