Tutorial 0: Statistics and Probability
We set up Python/Stata/R.
We focus on downloading and importing datasets.
We focus on statistics and probability. Both “on pen and paper” and “with the help of your laptop”.
This guide will walk you through installing a powerful and popular setup for Python programming. We’ll use Miniconda to manage our Python environment and Visual Studio Code (VS Code) as our code editor.
pandas for data or matplotlib for plots) incredibly easy and avoids conflicts. We’ll use Miniconda, a lightweight version of Anaconda.Follow the steps for your operating system to set up your environment.
On Windows
Step 1: Install Miniconda (Python)
.exe file).Step 2: Install Visual Studio Code (VS Code)
Next on all screens. We recommend keeping the default settings, especially “Add to PATH”, which should be checked by default.On macOS
Step 1: Install Miniconda (Python)
.pkg file). Click Continue and Agree through the prompts, using the default settings. It will install for the current user.Step 2: Install Visual Studio Code (VS Code)
.zip file. Unzip it to get the Visual Studio Code.app.Visual Studio Code.app into your Applications folder. This makes it easy to find and launch.Step 3: Setting up VS Code for Python
Now that everything is installed, let’s connect them.
Python.Ctrl+Shift+P (on Windows) or Cmd+Shift+P (on Mac).Python: Select Interpreter and press Enter.You are now set up! VS Code knows where to find your Conda Python installation.
Step 4: Your First Project - Reading a Dataset
Let’s perform our first real-world task: opening a project folder and reading data from a file.
First, create a place for your project on your computer (e.g., in Documents or on your Desktop).
MyFirstProject.MyFirstProject, create a new text file and name it main.py.MyFirstProject, create another new folder named data.data folder, create a new text file named students.csv. Open it with a basic text editor (like Notepad or TextEdit) and paste the following content:Your folder structure should look like this:
MyFirstProject/
├── main.py
└── data/
└── students.csv
Step 5: Open the Project in VS Code
This is a key step! Don’t just open the .py file. Open the entire folder.
MyFirstProject folder. Click “Open” (or “Select Folder”).main.py and the data folder).Step 6: Install a Package and Write the Code
We need the pandas library to read the .csv file. Let’s install it.
(base) at the start of the prompt, indicating your Conda environment is active.conda install pandas Type y and press Enter if it asks you to proceed.Now, click on main.py in the sidebar to open it and paste this code:
# 1. Import the pandas library, which is a powerful tool for data analysis.
# We give it the nickname 'pd' to make it easier to use.
import pandas as pd
# 2. Define the path to our dataset.
# Because we opened the 'MyFirstProject' folder, we can use a relative path.
# This means "look inside the 'data' folder for the 'students.csv' file".
file_path = 'data/students.csv'
# 3. Use pandas to read the CSV file into a data structure called a DataFrame.
print("Reading the dataset...")
student_data = pd.read_csv(file_path)
# 4. Print the first 5 rows of the data to see if it loaded correctly.
print("Here is the top of the dataset:")
print(student_data.head())
# 5. Print a success message!
print("\nSuccessfully imported and displayed the dataset!")Step 8: Run Your Python Script!
Look for the Play button (a white triangle) in the top-right corner of the VS Code window. Click it.
The Terminal at the bottom will spring to life and you will see the output:
Reading the dataset...
Here is the top of the dataset:
StudentID Name Score
0 101 Alice 88
1 102 Bob 92
2 103 Charlie 75
3 104 Diana 95
Successfully imported and displayed the dataset!
Congratulations! You are ready to start coding.
This guide will help you install the standard toolkit for statistical computing and data science: R and RStudio. You must install R first, then install RStudio. RStudio needs R to function.
Follow the steps for your operating system.
On Windows
Step 1: Install R (The Language)
.exe file).Next through the installation wizard.Step 2: Install RStudio (The Dashboard)
Next until it’s finished.On macOS
Step 1: Install R (The Language)
.pkg file). Click Continue and Agree through the prompts, using the default settings.Step 2: Install RStudio (The Dashboard)
.dmg file. Open it.Open RStudio. You’ll see four main panes (some may be combined at first):
.R files).Environment shows all the objects (like datasets) you’ve created. History shows your past commands.Files shows the files in your project folder. Plots is where your graphs will appear. Packages lets you manage your installed R packages.Using RStudio Projects is the best way to keep your work organized. A project keeps all your files, scripts, and data together.
Step 3: Create an RStudio Project
MyFirstRProject.Desktop or Documents folder.RStudio will now restart and open your new project. Notice the Files pane (bottom-right) now shows the contents of your MyFirstRProject folder.
Step 4: Create Your Folder and Data File
Let’s create the same structure as our Python example.
data.MyFirstRProject folder, then into the data folder.students.csv. Open it with a basic text editor (like Notepad or TextEdit) and paste the following content:Your folder structure is now:
MyFirstRProject/
├── MyFirstRProject.Rproj (RStudio creates this for you)
└── data/
└── students.csv
Step 5: Create an R Script and Write the Code
Ctrl+S (Windows) / Cmd+S (Mac). Name it main.R.Now, paste this code into your main.R script:
# 1. Define the path to our dataset.
# Because we are using an RStudio Project, we can use a relative path.
# This means "look inside the 'data' folder for the 'students.csv' file".
file_path <- "data/students.csv"
# 2. Use R's built-in function to read the CSV file.
# The data is stored in an object called a "data frame".
print("Reading the dataset...")
student_data <- read.csv(file_path)
# 3. Print the first 6 rows of the data to see if it loaded correctly.
# The head() function is great for peeking at your data.
print("Here is the top of the dataset:")
print(head(student_data))
# 4. Print a success message!
print("Successfully imported and displayed the dataset!")Step 6: Run Your R Script!
There are two easy ways to run your code:
Run the whole script (recommended): Click the Source button at the top of your script editor. This will execute the entire file from top to bottom.
Run line-by-line: Place your cursor on a line and press Ctrl+Enter (on Windows) or Cmd+Enter (on Mac). This is great for debugging.
The output will appear in the Console (bottom-left):
[1] "Reading the dataset..."
[1] "Here is the top of the dataset:"
StudentID Name Score
1 101 Alice 88
2 102 Bob 92
3 103 Charlie 75
4 104 Diana 95
[1] "Successfully imported and displayed the dataset!"
Congratulations! You have installed R and RStudio, and are ready for data analysis.
This guide covers getting started with Stata for data analysis. Stata is commercial software, so you must have a license from UU.
Step 1: Installation (Windows & macOS):
Obtain Software: Get the Stata installer and your license file (stata.lic) from your university’s software portal or IT department.
Install Stata: Run the installer, accepting the terms and choosing the Stata version you are licensed for (e.g., Stata/SE). Use all default settings.
Step 2: Setup & First Script:
Create Project Folder: On your computer, create a folder named MyStataProject. Inside it, create a sub-folder named data and place your students.csv file inside.
Set Working Directory: This is a crucial step. In Stata, go to File > Change Working Directory… and select your MyStataProject folder.
Create and Run a Do-file:
Open the Do-file Editor (click the notepad-with-pencil icon).
Save the empty file as main.do in your project folder.
Add these commands:
Click the Execute (do) button in the Do-file Editor’s toolbar. The output will appear in the main Results window.
Congratulations! You are ready to perform statistical analysis in Stata.
Imagine you are drawing a single card from a standard 52-card deck. Let’s define two events:
Calculate the following probabilities:
Using the same card-drawing experiment:
Let’s verify the theoretical results from the card-drawing experiment empirically. We will simulate drawing a card a large number of times and see if the frequencies match the probabilities.
['2H', '3H', ..., 'KH', 'AH', '2D', ...].An online store has determined that the number of items a customer buys in a single visit, \(X\), is a discrete random variable with the following Probability Mass Function (PMF):
| \(x\) (items) | 0 | 1 | 2 | 3 |
|---|---|---|---|---|
| P(X=x) | 0.1 | 0.5 | 0.3 | 0.1 |
A loan has a probability of default of \(p=0.05\). Let \(X\) be a Bernoulli random variable where \(X=1\) if the loan defaults and \(X=0\) if it does not.
Using the formulas from the lecture for a Bernoulli distribution, calculate the expected value \(E[X]\) and the variance \(Var(X)\).
Use numpy, scipy.stats or rbinom() to simulate 1,000,000 such loans.
Suppose the annual returns of a stock portfolio are normally distributed with a mean of 12% and a standard deviation of 20%. Let \(R\) be the random variable for the portfolio’s return. So, \(R \sim N(0.12, 0.20^2)\).
Continuing with the portfolio from the previous exercise (\(R \sim N(0.12, 0.20^2)\)):
Use scipy.stats.norm in Python or pnorm in R to answer the following questions:
norm.cdf and norm.ppf from scipy.stats)You manage a portfolio consisting of two assets, Asset A and Asset B. Their annual returns are independent and normally distributed:
You create a portfolio where you invest 60% in Asset A and 40% in Asset B. The portfolio return is \(R_P = 0.6 R_A + 0.4 R_B\).
Based on the rules for linear combinations of normal random variables, what are the expected value \(E[R_P]\) and variance \(Var(R_P)\) of the portfolio? What is the full distribution of \(R_P\)?
Simulate 100,000 returns for \(R_A\) and \(R_B\).
Let \(X\) and \(Y\) be two random variables. You are given the following information:
Calculate the following:
A researcher is studying the relationship between years of education (\(E\)) and annual income (\(I\)). They model income as a random variable conditional on education.
The researcher finds that the conditional expectation of income given education is: \(E[I | E=e] = 15000 + 4000e\)
Import a dataset with (hourly) wages, WAGE1.DTA, downloadable here and calculate the expected value, variance and standard deviation of the wage variable.
Solutions:
import pandas as pd
import numpy as np
url= 'https://raw.githubusercontent.com/basm92/ee_website/refs/heads/master/tutorials/datafiles/WAGE1.DTA'
dataset = pd.read_stata(url)
np.mean(dataset['wage'])
## np.float32(5.896103)
np.var(dataset['wage'])
## np.float32(13.612955)
np.std(dataset['wage'])
## np.float32(3.6895738)The lecture states that a parameter is a fixed value, while a statistic is a random variable.
Eplain why this is the case. Use the example of the average height of all citizens in a country versus the average height of a 1,000-person sample to illustrate your explanation.
Consider a tiny population consisting of only five numbers: [10, 20, 30, 40, 50].
*: You can use comb() in R or itertools.combinations in Python.
The lecture demonstrated simulating the sampling distribution from a Normal population. Your task is to do the same for a different type of population.
Assume the population data follows a highly skewed Exponential distribution. The exponential distribution is often used to model time until an event.
numpy.random.exponential(scale=10, size=100000). Let’s say the scale parameter (which is the true mean, \(\mu\)) is 10.np.random.choice() in Python or sample() in RA logistics company knows that the weight of packages it processes is not normally distributed; it is right-skewed with a long tail (a few very heavy packages). The true mean weight of all packages is \(\mu = 8\) kg, and the true standard deviation is \(\sigma = 5\) kg.
The company takes a random sample of 100 packages.
Let’s use the simulation from Slide 3.14 to verify the predictions of the Central Limit Theorem. Your population is from an Exponential distribution with a true mean (\(\mu\)) of 10 and a true standard deviation (\(\sigma\)) also of 10. Your sample size was n=50.
For each research question below, formulate the appropriate null hypothesis (\(H_0\)) and alternative hypothesis (\(H_A\)). State whether a one-sided or a two-sided test would be more appropriate and justify your choice.
A car manufacturer claims its new hybrid model has a mean fuel efficiency of 50 miles per gallon (mpg). A consumer watchdog group tests a random sample of n=36 cars and finds a sample mean efficiency of \(\bar{x}\) = 48.5 mpg. Assume the population standard deviation for fuel efficiency is known to be \(\sigma = 6\) MPG.
The consumer watchdog group wants to test if the manufacturer’s claim is overstated (i.e., if the true mean is less than 50 MPG). They set \(\alpha = 0.05\).
Use the WAGE2.DTA dataset available here. Import it in R/Python/Stata. Conduct a hypothesis test on the IQ variable. The null hypothesis is that mu_0 = 100.
scipy.stats.norm.cdf() or pnorm() to find this p-value.A researcher calculates a 95% confidence interval for the average number of hours students at a university study per week and gets [12.5, 15.0] hours.
Which of the following statements is the correct interpretation of this result? Explain why the other statement is incorrect.
Without doing any calculations, explain how the width of a confidence interval would be affected by the following changes, assuming all other factors remain constant.
You are given the same data from the fuel efficiency test: a sample of n=36 cars yielded a sample mean of \(\bar{x}\) = 48.5 MPG, and the population standard deviation is \(\sigma = 6\) MPG.
Write a Python script to calculate a 95% confidence interval for the true mean fuel efficiency.
scipy.stats.norm.ppf(0.975)).Now, calculate a 99% confidence interval using the same data. Is it wider or narrower than the 95% interval?
Empirical Economics: Tutorial 0 - Statistics and Probability