Tutorial 1

Instructions

  1. Create a new Quarto document (tutorial1.qmd) in a folder designated for this course.1
  2. For each question, include:
    • The question number and text
    • Your R code in a code chunk
    • Brief explanation of your approach (for conceptual questions)
  3. Make sure your YAML-header (first lines of your .qmd document) look as approximately as follows:
---
title: Tutorial 1
format: html
author: Your Name And Student No.
---
  1. Inside the console in Positron, install two fundamental packages required to preview a Quarto document: install.packages('knitr') and install.packages('rmarkdown').
  2. Render your document to HTML to verify all code executes correctly (click on “Preview” in Positron.)

Part 1: Teacher Demonstration

A. R Syntax Refresher

Let’s start by reviewing core R syntax concepts that form the foundation for data work:

# 1. Creating vectors (ordered collections of values)
numeric_vec <- c(1, 2, 3, 4, 5)
character_vec <- c("Amsterdam", "Rotterdam", "Utrecht")
logical_vec <- c(TRUE, FALSE, TRUE)

# 2. Accessing elements with indexing (R uses 1-based indexing!)
numeric_vec[1]      # First element: 1
[1] 1
character_vec[2:3]  # Second and third elements
[1] "Rotterdam" "Utrecht"  
logical_vec[-1]     # All elements EXCEPT the first
[1] FALSE  TRUE
logical_vec[-c(1,2)]# All elements EXCEPT the first and second
[1] TRUE
# 3. Creating data frames (rectangular/tidy data structures)
students <- data.frame(
  id = 1:3,
  name = c("Maria", "Anton", "Sophie"),
  gpa = c(3.7, 3.9, 3.5),
  has_job = c(TRUE, FALSE, TRUE)
)

# 4. Accessing columns in different ways
students$name        # Using $ operator
[1] "Maria"  "Anton"  "Sophie"
students[["gpa"]]    # Using double brackets
[1] 3.7 3.9 3.5
students[, "has_job"] # Using matrix-style indexing
[1]  TRUE FALSE  TRUE

B. Working with File Paths and Directories

# Check current working directory
getwd()
[1] "/home/bas/Documents/git/iads_website/tutorials"
# List files in current directory
list.files(".")
 [1] "census.txt"          "eurostat.xlsx"       "gdp.csv"            
 [4] "gdp.xlsx"            "legacy.txt"          "population.xlsx"    
 [7] "students.csv"        "survey_results.txt"  "survey.txt"         
[10] "trade.xlsx"          "tutorial1_files"     "tutorial1.qmd"      
[13] "tutorial1.rmarkdown" "tutorial2.html"      "tutorial2.qmd"      
[16] "tutorial3.html"      "tutorial3.qmd"       "tutorial4.html"     
[19] "tutorial4.qmd"       "tutorial5_files"     "tutorial5.html"     
[22] "tutorial5.qmd"       "tutorial6_files"     "tutorial6.html"     
[25] "tutorial6.qmd"       "tutorial7.html"      "tutorial7.qmd"      
[28] "tutorial8_files"     "tutorial8.html"      "tutorial8.qmd"      
# Using the here package for robust path handling (always starts from project root)
library(here)
here() starts at /home/bas/Documents/git/iads_website
here("tutorials", "students.csv")  # Returns full path regardless of where getwd() is  
[1] "/home/bas/Documents/git/iads_website/tutorials/students.csv"
# Relative paths demonstration
# If working directory is /Users/bas/project/
"./tutorials/students.csv"   # Same as /Users/bas/project/data/students.csv
[1] "./tutorials/students.csv"
"../other_project/data.csv" # Moves up one level, then into other_project
[1] "../other_project/data.csv"

C. Reading Different File Formats

For this demonstration, consider this project structure:

project_root/
└── tutorials/
    ├── students.csv
    ├── gdp.csv
    ├── gdp.xlsx
    ├── eurostat.xlsx
    ├── population.xlsx
    ├── trade.xlsx
    ├── census.txt
    ├── survey.txt
    └── legacy.txt

Reading CSV files (local and URL)

# Install required packages ONCE in console (not in scripts or .qmd file!)
# install.packages(c("readr", "readxl", "tidyverse"))

library(readr)    # Modern CSV reader (part of tidyverse)
library(tidyverse) # Includes readr plus data manipulation tools

# Method 1: Local CSV file using readr (fast and modern)
students_local <- read_delim(here("tutorials", "students.csv"))
students_local
# A tibble: 3 × 4
     id name     gpa has_job
  <dbl> <chr>  <dbl> <lgl>  
1     1 Maria     37 TRUE   
2     2 Anton     39 FALSE  
3     3 Sophie    35 TRUE   
# Method 2: CSV from URL
students_url <- read_csv("https://basm92.quarto.pub/intro-to-applied-data-science/tutorials/students.csv")  
students_url
# A tibble: 3 × 1
  `id;name;gpa;has_job`
  <chr>                
1 1;Maria;3,7;TRUE     
2 2;Anton;3,9;FALSE    
3 3;Sophie;3,5;TRUE    
# Method 3: Base R alternative (slower, less informative output)
students_base <- read.csv(here("tutorials", "students.csv"), stringsAsFactors = FALSE, sep = ";")
students_base
  id   name gpa has_job
1  1  Maria 3,7    TRUE
2  2  Anton 3,9   FALSE
3  3 Sophie 3,5    TRUE

Reading Excel (.xlsx) files

library(readxl)

# Read first sheet
eurostat_data <- read_excel(here("tutorials", "eurostat.xlsx"))

# Read specific sheet by name or position
trade_data <- read_excel(here("tutorials", "trade.xlsx"), sheet = "2023")
population_data <- read_excel(here("tutorials", "population.xlsx"), sheet = 2)

# Specify column types explicitly (prevents automatic type guessing errors)
gdp_data <- read_excel(
  here("tutorials", "gdp.xlsx"),
  col_types = c("text", "numeric", "numeric", "date")
)

Reading text (.txt) files

# Tab-delimited text file (common format)
tab_data <- read_delim(
  here("tutorials", "survey.txt"),
  delim = "\t"  # tab delimiter
)
Rows: 15 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr (4): country, education_level, income_bracket, supports_ubi
dbl (4): respondent_id, age, trust_in_government, inflation_concern

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(tab_data, 5)
# A tibble: 5 × 8
  respondent_id country       age education_level income_bracket supports_ubi
          <dbl> <chr>       <dbl> <chr>           <chr>          <chr>       
1             1 Netherlands    24 Bachelor        40-60k         Yes         
2             2 Germany        35 Master          80-100k        No          
3             3 Belgium        29 PhD             60-80k         Yes         
4             4 France         42 Bachelor        60-80k         No          
5             5 Netherlands    19 High School     20-40k         Yes         
# ℹ 2 more variables: trust_in_government <dbl>, inflation_concern <dbl>
# Space-delimited text file
space_data <- read_table(here("tutorials", "census.txt"))

── Column specification ────────────────────────────────────────────────────────
cols(
  region = col_character(),
  pop_2020 = col_double(),
  pop_2023 = col_double(),
  density = col_double(),
  urban_pct = col_double()
)
head(space_data, 5)
# A tibble: 5 × 5
  region pop_2020 pop_2023 density urban_pct
  <chr>     <dbl>    <dbl>   <dbl>     <dbl>
1 NL_NH   2830431  2924000    1002      91.2
2 NL_ZH   3649993  3753000    1234      93.5
3 NL_UT   1323707  1334000     958      89.7
4 DE_BY  13124737 13200000     185      75.3
5 DE_NW  17947221 18000000     525      78.9
# Fixed-width format (columns defined by position)
fwf_data <- read_fwf(
  here("tutorials", "legacy.txt"),
  fwf_widths(c(10, 5, 8), col_names = c("id", "year", "value"))
)
Rows: 11 Columns: 3
── Column specification ────────────────────────────────────────────────────────

chr (3): id, year, value

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(fwf_data, 5)
# A tibble: 5 × 3
  id     year  value   
  <chr>  <chr> <chr>   
1 id     year  value   
2 NL_GDP 1950  12345.67
3 DE_GDP 1950  23456.78
4 BE_GDP 1950  3456.78 
5 FR_GDP 1950  45678.90

D. Combining Syntax with Data Import

# Read data using base R's read.csv()
gdp_data <- read.csv(here("tutorials", "gdp.csv"), stringsAsFactors = FALSE)

# Filter rows: keep only Netherlands, Belgium, and Germany
gdp_filtered <- gdp_data[is.element(gdp_data$country, c("Netherlands", "Belgium", "Germany")), ]

# Group by country and calculate mean GDP per capita
gdp_summary <- aggregate(
  x = gdp_filtered$gdp_per_capita,      # Variable to summarize
  by = list(country = gdp_filtered$country),  # Grouping variable
  FUN = mean,                           # Summary function
  na.rm = TRUE                          # Ignore missing values
)

# Rename the output column for clarity
colnames(gdp_summary) <- c("country", "avg_gdp")

# Preview the result
gdp_summary
      country avg_gdp
1     Belgium   49525
2     Germany   47500
3 Netherlands   54780

Part 2: Student Practice Questions

Complete the following exercises in your R environment. For questions requiring file reading, sample datasets are available to be downloaded from the course website.

R Syntax Fundamentals

  1. Create a numeric vector called inflation_rates containing these values: 2.1, 3.4, 1.8, 4.2, 2.9. What is the result of inflation_rates[3]?

  2. Create a character vector called countries with elements: “Netherlands”, “Germany”, “France”, “Italy”. Use negative indexing to return all countries except “Germany” (the second element). Then confirm you get the same result using positive indexing. When would negative indexing be more convenient than positive indexing?

  3. Given the vector x <- c(10, 20, 30, 40, 50), what does x[c(2, 4)] return? What does x[x > 25] return?

  4. Create a data frame called cities with three columns:

    • name: “Amsterdam”, “Rotterdam”, “The Hague”
    • population: 872680, 651406, 545838
    • province: “Noord-Holland”, “Zuid-Holland”, “Zuid-Holland”

    How would you access the population of Rotterdam using three different methods?

  5. Explain the difference between these three expressions when applied to a data frame df with columns x and y:

    • df$x
    • df[["x"]]
    • df[, "x"]

Try running all three on your cities data frame. Is the output always identical?

Working Directories and File Paths

  1. Your project folder structure looks like this:

    my_project/
    ├── data/
    │   └── gdp.csv
    ├── scripts/
    │   └── analysis.R
    └── report.qmd

    If your working directory is my_project/scripts/, write the relative path to access gdp.csv. Then write the equivalent here() call (assuming my_project is your Positron project root)

  2. Using the here package approach, what single command would reliably read gdp.csv regardless of your current working directory (assuming you’ve opened my_project as your Positron project)?

  3. What does the command list.files("../data") do when your working directory is my_project/scripts/?

  4. Why is using here("data", "filename.csv") generally safer than "./data/filename.csv" in scripts?

  5. Write R code to:

    • Check your current working directory
    • List all .csv files in your project’s data/ folder
    • Change your working directory to the project root (without hardcoding the full path)

Reading CSV Files

  1. Download the students.csv file here and put it into a folder named tutorials in your working directory. Read the file using read_delim(). How many rows and columns does the resulting data frame have

  2. The same students.csv file is available online at https://basm92.quarto.pub/intro-to-applied-data-science/tutorials/students.csv. Write the code to read this remote file directly into R without downloading it first.

  3. When reading a CSV file with read_csv(), you notice that a column containing postal codes (e.g., “1012 AB”) is being converted to numeric. How would you prevent this and keep it as character/text? (Hint: look at the col_types argument.)

  4. What is the key practical difference between read_csv() (from readr) and read.csv() (base R) when importing large datasets? Name at least two advantages of read_csv().

  5. After reading gdp_data <- read_csv(here("tutorials", "gdp.csv")),2 write code to:

    • View the first 6 rows
    • Get the column names
    • Check the data types of each column

Reading Excel Files

  1. You need to read an Excel file eurostat.xlsx that contains multiple sheets.3 How would you:
    • List all available sheet names in the file?
    • Read the sheet named “Population_2023”?

Hint: Check out ?read_excel.

  1. When reading an Excel file with read_excel(), you notice that date columns are being imported as character strings instead of proper dates. What parameter would you use to specify the correct column type during import?

  2. An Excel file trade_data.xlsx has column headers starting on row 3 (rows 1-2 contain metadata). How would you skip these first two rows when reading the data?

Reading Text Files and Integration

  1. You receive a tab-delimited text file survey_results.txt where missing values are coded as “NA” and “MISSING”.4 Write code to read this file while treating both codes as missing values (NA).

  2. Challenge question: Combine multiple skills:

    • Read gdp.csv from your data folder5
    • Filter to keep only observations with GDP per capita > 40000
    • Calculate the average GDP per capita for these countries6
    • Store the result in an object called high_income_avg

Footnotes

  1. File > New File > Quarto Document.↩︎

  2. To be downloaded here.↩︎

  3. The file is available here↩︎

  4. The file is available here.↩︎

  5. Again, to be downloaded here.↩︎

  6. Check out the aggregate function for this, as in the teacher example.↩︎