Create a new Quarto document (tutorial8.qmd) in a folder designated for this course.1
For each question, include:
The question number and text
Your R code in a code chunk
Brief explanation of your approach (for conceptual questions)
Make sure your YAML-header (first lines of your .qmd document) look as approximately as follows:
---
title: Tutorial 8
format: html
author: Your Name And Student No.
---
Render your document to HTML to verify all code executes correctly (click on “Preview” in Positron.)
Part 1: Teacher Demonstration
A. Loading and Exploring Vector Spatial Data
We begin with North Carolina county data included in the sf package—a classic dataset for spatial analysis. This demonstrates how spatial objects combine geometry with attribute data.
# Visualize births in 1974ggplot(nc) +geom_sf(aes(fill = BIR74)) +scale_fill_viridis_c(option ="plasma", name ="Births") +theme_minimal() +labs(title ="Births per County in North Carolina (1974)")
Key points:
Spatial data frames contain a special geometry column alongside regular attributes
CRS (Coordinate Reference System) defines how coordinates map to Earth’s surface
Standard dplyr verbs (select(), filter()) work seamlessly with spatial objects
geom_sf() in ggplot2 handles spatial visualization automatically
B. Coordinate Reference Systems and Geometric Operations
CRS management is critical for accurate distance/area calculations. We’ll transform coordinates and create buffers (zones of influence).
# Check current CRS (geographic coordinates in degrees)st_crs(nc)$input
[1] "NAD27"
# Transform to projected CRS for accurate distance calculations (meters)# EPSG:32119 = NAD83 / North Carolina (meters)nc_proj <-st_transform(nc, 32119)print(paste("New CRS:", st_crs(nc_proj)$input))
[1] "New CRS: EPSG:32119"
# Calculate area in square kilometers (requires projected CRS)nc_proj <- nc_proj |>mutate(area_km2 =as.numeric(st_area(geometry)) /1e6)# Create 20km buffer around Wake County (Raleigh)wake <- nc_proj |>filter(NAME =="Wake")wake_buffer <-st_buffer(wake, dist =20000) # 20km = 20,000 meters# Visualize bufferggplot() +geom_sf(data = nc_proj, fill ="lightgray") +geom_sf(data = wake, fill ="steelblue", alpha =0.7) +geom_sf(data = wake_buffer, fill =NA, color ="red", size =1.2) +labs(title ="20km Buffer Zone Around Wake County",subtitle ="Buffer shown in red; county boundary in blue") +theme_minimal()
Always transform to projected CRS before calculating distances/areas/buffers
st_buffer() creates zones of influence—critical for market area analysis
Buffers require distance units matching the CRS (meters for projected systems)
C. Spatial Topology and Neighborhood Analysis
Spatial topology identifies relationships between features (e.g., adjacency). We’ll find counties bordering the highest-birth county.
# Identify county with highest births in 1974highest_birth <- nc_proj |>slice_max(order_by = BIR74, n =1)print(paste("Highest birth county:", highest_birth$NAME))
[1] "Highest birth county: Mecklenburg"
# Find adjacent counties using st_touches()# Returns logical matrix: TRUE where geometries share boundaryadjacent_matrix <-st_touches(highest_birth, nc_proj, sparse =FALSE)# Extract adjacent countiesneighbors <- nc_proj[adjacent_matrix, ]# Visualize resultsggplot() +geom_sf(data = nc_proj, fill ="lightgray", color ="white") +geom_sf(data = highest_birth, fill ="darkgreen", alpha =0.8) +geom_sf(data = neighbors, fill ="gold", alpha =0.6) +geom_sf_label(data = highest_birth, aes(label = NAME), size =3) +labs(title ="Counties Adjacent to Highest-Birth County (Guilford)",subtitle ="Guilford County (green) and its 8 bordering neighbors (gold)") +theme_minimal()
Key points:
st_touches() identifies features sharing boundaries (adjacency)
Spatial predicates return sparse matrices by default (efficient for large datasets)
Topology enables analysis of spatial spillovers (e.g., policy diffusion across borders)
Visualization clarifies spatial relationships for interpretation
D. Spatial Joins and Real-World Application
Spatial joins link datasets without shared IDs by leveraging geometry. We’ll simulate retail stores and join them to county attributes.
# Simulate 50 retail store locations (points) across NCset.seed(123)stores <-st_sample(st_union(nc_proj), size =50) |>st_as_sf() |>mutate(store_id =1:50,revenue =runif(50, 50, 300)) # Simulated daily revenue ($k)# Spatial join: assign each store to its containing countystores_with_county <-st_join(stores, nc_proj)# Analyze revenue by birth rate quartilestores_with_county <- stores_with_county |>mutate(birth_quartile =ntile(BIR74, 4))# Compare average revenue across birth rate quartilesrevenue_by_quartile <- stores_with_county |>st_drop_geometry() |># Remove geometry for standard summarizationgroup_by(birth_quartile) |>summarise(avg_revenue =mean(revenue),n_stores =n())print(revenue_by_quartile)
# Visualize relationshipggplot(stores_with_county) +geom_sf(data = nc_proj, fill ="white", color ="gray80") +geom_sf(aes(color = revenue, size = revenue), alpha =0.7) +scale_color_viridis_c(name ="Daily Revenue ($k)") +scale_size_continuous(range =c(2, 6)) +labs(title ="Simulated Retail Stores Across NC Counties",subtitle ="Store size/color reflect revenue; spatial join linked stores to county attributes") +theme_minimal()
Key points:
st_join() performs attribute transfer based on spatial relationships (default: intersects)
Critical for economics when datasets lack common identifiers (e.g., firms → districts)
Enables analysis of spatial exposure (e.g., “Do stores in high-birth counties earn more?”)
Geometry must be dropped before standard dplyr aggregation
Part 2: Student Practice Questions
Data Fundamentals & CRS Management
Load the nc dataset. What is the difference between st_crs(nc)$epsg and st_crs(nc)$proj4string? Transform the dataset to EPSG:4326 and EPSG:32119. Which transformation changes the shape of counties more dramatically on a map? Explain why.
Calculate the area of Buncombe County (NAME = “Buncombe”) in km² using:
The original CRS (without transformation)
EPSG:32119 (projected) Why do results differ? Which is correct for economic analysis requiring area-weighted variables?
Create two copies of nc:
nc_a transformed to EPSG:32119
nc_b transformed to EPSG:2264 (another NC projection)
Attempt to do st_intersection(nc_a[1,], nc_b[1,]). What error occurs? Fix it and compute the intersection area.
Vector Operations – Buffers, Centroids & Topology
Create a 25km buffer around the county with the highest 1979 birth rate (BIR79). Identify all counties whose centroids fall within this buffer (not just geometric intersection). How many counties qualify? Visualize with buffer (semi-transparent red), high-birth county (blue), and qualifying centroids (gold points).
Building on Q4:
Find all counties that touch (share a border with) the high-birth county
Create a 10km buffer inside the high-birth county boundary (use negative distance: st_buffer(..., dist = -10000))
Calculate what percentage of the high-birth county’s area lies within this “inner buffer” Economic interpretation: How might this represent a “core economic zone” versus peripheral areas?
Calculate centroids for all NC counties. For each county, find its nearest neighbor by centroid distance (excluding itself). Which county pair has the smallest centroid-to-centroid distance? Which has the largest? Plot all counties colored by distance to nearest neighbor.
For Wake County:
Create concentric buffers at 10km, 25km, and 50km
Calculate the population (BIR74) residing in each ring (Ring 1: 0–10km, Ring 2: 10–25km, etc.)
Visualize as a choropleth where ring membership determines color intensity Economic application: Modeling urban decay or agglomeration effects across distance bands.
Identify counties that are completely disjoint (no border contact) with counties having above-median SID74 (SIDS deaths). What geographic pattern emerges? Hypothesize an economic mechanism that could explain health outcome clustering.
Spatial Joins & Real-World Applications
Simulate 200 “pollution sources” as random points across NC. Create 5 circular pollution zones (buffers of varying radii: 5–30km) around randomly selected sources. Which counties intersect multiple pollution zones? Calculate total “exposure score” per county = sum of buffer areas overlapping the county.
Using the rnaturalearth library, retrieve two spatial datasets:
US state boundaries via ne_states(country = "United States of America", returnclass = "sf")
US populated places via ne_download(scale = 110, type = "populated_places", category = "cultural", returnclass = "sf")
Then filter the populated places to those inside the US, perform a spatial join to assign each city to its state, and answer the following:
Which state contains the most populated places in this dataset?
Calculate the average latitude of populated places per state2 and display the results sorted from northernmost to southernmost.
Simulate 150 firm headquarters as random points in NC counties. Assign each firm:
Revenue ~ Uniform($1M, $50M)
Industry (manufacturing/retail/services) with probabilities 0.3/0.4/0.3 Join firms to counties. Test whether manufacturing firms locate in counties with larger land area (st_area(geometry)) using a simple correlation.
Identify all county pairs that touch using st_touches(nc, nc). For each touching pair, calculate:
Absolute difference in BIR74 (births)
Absolute difference in NWBIR74 (non-white births) Which border has the largest demographic discontinuity? Economic relevance: Spatial RD designs often exploit such discontinuities.
Extract and plot Band 4 (near-infrared) separately. Why is this band valuable for vegetation analysis in agricultural economics?
Using st_buffer() and st_point(), create a 500m-radius circle centered at coordinates (294000, 9116000) in the Landsat CRS. Crop the Landsat image to this circle. Calculate mean reflectance per band within the cropped area.
Subset nc to 5 randomly selected counties
Transform to match Landsat CRS (st_transform(nc_sample, st_crs(landsat)))
For Band 1 of Landsat, calculate mean reflectance per county using aggregate()
Join results back to county attributes and map using ggplot2
Advanced Integration & Economics Applications
Using geodata (install if needed):
library(geodata)
Loading required package: terra
terra 1.8.80
temp <-worldclim_country(country ="Netherlands", var ="tavg", path =tempdir())
Convert to stars object and extract January 2020 temperature
Get Dutch province boundaries via rnaturalearth::ne_states("Netherlands")
Calculate mean January temperature per province
Economic question: How would you link this to provincial GDP data to test climate-economy relationships?
For NC counties:
Create a binary contiguity weights matrix where (w_{ij} = 1) if counties (i) and (j) share a border
Convert to row-standardized weights ((j w{ij} = 1))
Plot original vs. lagged births. What does a high lag value imply economically?
Use tidygeocoder to geocode:
“Radboud University, Nijmegen”
“Erasmus University Rotterdam” Convert results to sf points. Calculate straight-line distance between them in kilometers (transform to appropriate UTM zone first). Buffer each point by 2km and identify overlapping area.