Create a new Quarto document (tutorial1.qmd) in a folder designated for this course.1
For each question, include:
The question number and text
Your R code in a code chunk
Brief explanation of your approach (for conceptual questions)
Make sure your YAML-header (first lines of your .qmd document) look as approximately as follows:
---
title: Tutorial 1
format: html
author: Your Name And Student No.
---
Inside the console in Positron, install two fundamental packages required to preview a Quarto document: install.packages('knitr') and install.packages('rmarkdown').
Render your document to HTML to verify all code executes correctly (click on “Preview” in Positron.)
Part 1: Teacher Demonstration
A. R Syntax Refresher
Let’s start by reviewing core R syntax concepts that form the foundation for data work:
# 1. Creating vectors (ordered collections of values)numeric_vec <-c(1, 2, 3, 4, 5)character_vec <-c("Amsterdam", "Rotterdam", "Utrecht")logical_vec <-c(TRUE, FALSE, TRUE)# 2. Accessing elements with indexing (R uses 1-based indexing!)numeric_vec[1] # First element: 1
[1] 1
character_vec[2:3] # Second and third elements
[1] "Rotterdam" "Utrecht"
logical_vec[-1] # All elements EXCEPT the first
[1] FALSE TRUE
logical_vec[-c(1,2)]# All elements EXCEPT the first and second
[1] TRUE
# 3. Creating data frames (rectangular/tidy data structures)students <-data.frame(id =1:3,name =c("Maria", "Anton", "Sophie"),gpa =c(3.7, 3.9, 3.5),has_job =c(TRUE, FALSE, TRUE))# 4. Accessing columns in different waysstudents$name # Using $ operator
[1] "Maria" "Anton" "Sophie"
students[["gpa"]] # Using double brackets
[1] 3.7 3.9 3.5
students[, "has_job"] # Using matrix-style indexing
# Install required packages ONCE in console (not in scripts or .qmd file!)# install.packages(c("readr", "readxl", "tidyverse"))library(readr) # Modern CSV reader (part of tidyverse)library(tidyverse) # Includes readr plus data manipulation tools# Method 1: Local CSV file using readr (fast and modern)students_local <-read_delim(here("tutorials", "students.csv"))students_local
# A tibble: 3 × 4
id name gpa has_job
<dbl> <chr> <dbl> <lgl>
1 1 Maria 37 TRUE
2 2 Anton 39 FALSE
3 3 Sophie 35 TRUE
# Method 2: CSV from URLstudents_url <-read_csv("https://basm92.quarto.pub/intro-to-applied-data-science/tutorials/students.csv") students_url
# Method 3: Base R alternative (slower, less informative output)students_base <-read.csv(here("tutorials", "students.csv"), stringsAsFactors =FALSE, sep =";")students_base
id name gpa has_job
1 1 Maria 3,7 TRUE
2 2 Anton 3,9 FALSE
3 3 Sophie 3,5 TRUE
Reading Excel (.xlsx) files
library(readxl)# Read first sheeteurostat_data <-read_excel(here("tutorials", "eurostat.xlsx"))# Read specific sheet by name or positiontrade_data <-read_excel(here("tutorials", "trade.xlsx"), sheet ="2023")population_data <-read_excel(here("tutorials", "population.xlsx"), sheet =2)# Specify column types explicitly (prevents automatic type guessing errors)gdp_data <-read_excel(here("tutorials", "gdp.xlsx"),col_types =c("text", "numeric", "numeric", "date"))
Rows: 15 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr (4): country, education_level, income_bracket, supports_ubi
dbl (4): respondent_id, age, trust_in_government, inflation_concern
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(tab_data, 5)
# A tibble: 5 × 8
respondent_id country age education_level income_bracket supports_ubi
<dbl> <chr> <dbl> <chr> <chr> <chr>
1 1 Netherlands 24 Bachelor 40-60k Yes
2 2 Germany 35 Master 80-100k No
3 3 Belgium 29 PhD 60-80k Yes
4 4 France 42 Bachelor 60-80k No
5 5 Netherlands 19 High School 20-40k Yes
# ℹ 2 more variables: trust_in_government <dbl>, inflation_concern <dbl>
# Space-delimited text filespace_data <-read_table(here("tutorials", "census.txt"))
# Fixed-width format (columns defined by position)fwf_data <-read_fwf(here("tutorials", "legacy.txt"),fwf_widths(c(10, 5, 8), col_names =c("id", "year", "value")))
Rows: 11 Columns: 3
── Column specification ────────────────────────────────────────────────────────
chr (3): id, year, value
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(fwf_data, 5)
# A tibble: 5 × 3
id year value
<chr> <chr> <chr>
1 id year value
2 NL_GDP 1950 12345.67
3 DE_GDP 1950 23456.78
4 BE_GDP 1950 3456.78
5 FR_GDP 1950 45678.90
D. Combining Syntax with Data Import
# Read data using base R's read.csv()gdp_data <-read.csv(here("tutorials", "gdp.csv"), stringsAsFactors =FALSE)# Filter rows: keep only Netherlands, Belgium, and Germanygdp_filtered <- gdp_data[is.element(gdp_data$country, c("Netherlands", "Belgium", "Germany")), ]# Group by country and calculate mean GDP per capitagdp_summary <-aggregate(x = gdp_filtered$gdp_per_capita, # Variable to summarizeby =list(country = gdp_filtered$country), # Grouping variableFUN = mean, # Summary functionna.rm =TRUE# Ignore missing values)# Rename the output column for claritycolnames(gdp_summary) <-c("country", "avg_gdp")# Preview the resultgdp_summary
Complete the following exercises in your R environment. For questions requiring file reading, sample datasets are available to be downloaded from the course website.
R Syntax Fundamentals
Create a numeric vector called inflation_rates containing these values: 2.1, 3.4, 1.8, 4.2, 2.9. What is the result of inflation_rates[3]?
Create a character vector called countries with elements: “Netherlands”, “Germany”, “France”, “Italy”. Use negative indexing to return all countries except “Germany” (the second element). Then confirm you get the same result using positive indexing. When would negative indexing be more convenient than positive indexing?
Given the vector x <- c(10, 20, 30, 40, 50), what does x[c(2, 4)] return? What does x[x > 25] return?
Create a data frame called cities with three columns:
If your working directory is my_project/scripts/, write the relative path to access gdp.csv. Then write the equivalent here() call (assuming my_project is your Positron project root)
Using the here package approach, what single command would reliably read gdp.csv regardless of your current working directory (assuming you’ve opened my_project as your Positron project)?
What does the command list.files("../data") do when your working directory is my_project/scripts/?
Why is using here("data", "filename.csv") generally safer than "./data/filename.csv" in scripts?
Write R code to:
Check your current working directory
List all .csv files in your project’s data/ folder
Change your working directory to the project root (without hardcoding the full path)
Reading CSV Files
Download the students.csv file here and put it into a folder named tutorials in your working directory. Read the file using read_delim(). How many rows and columns does the resulting data frame have
The same students.csv file is available online at https://basm92.quarto.pub/intro-to-applied-data-science/tutorials/students.csv. Write the code to read this remote file directly into R without downloading it first.
When reading a CSV file with read_csv(), you notice that a column containing postal codes (e.g., “1012 AB”) is being converted to numeric. How would you prevent this and keep it as character/text? (Hint: look at the col_types argument.)
What is the key practical difference between read_csv() (from readr) and read.csv() (base R) when importing large datasets? Name at least two advantages of read_csv().
After reading gdp_data <- read_csv(here("tutorials", "gdp.csv")),2 write code to:
View the first 6 rows
Get the column names
Check the data types of each column
Reading Excel Files
You need to read an Excel file eurostat.xlsx that contains multiple sheets.3 How would you:
List all available sheet names in the file?
Read the sheet named “Population_2023”?
Hint: Check out ?read_excel.
When reading an Excel file with read_excel(), you notice that date columns are being imported as character strings instead of proper dates. What parameter would you use to specify the correct column type during import?
An Excel file trade_data.xlsx has column headers starting on row 3 (rows 1-2 contain metadata). How would you skip these first two rows when reading the data?
Reading Text Files and Integration
You receive a tab-delimited text file survey_results.txt where missing values are coded as “NA” and “MISSING”.4 Write code to read this file while treating both codes as missing values (NA).