Tutorial 8

Instructions

  1. Create a new Quarto document (tutorial8.qmd) in a folder designated for this course.1
  2. For each question, include:
    • The question number and text
    • Your R code in a code chunk
    • Brief explanation of your approach (for conceptual questions)
  3. Make sure your YAML-header (first lines of your .qmd document) look as approximately as follows:
---
title: Tutorial 8
format: html
author: Your Name And Student No.
---
  1. Render your document to HTML to verify all code executes correctly (click on “Preview” in Positron.)

Part 1: Teacher Demonstration

A. Loading and Exploring Vector Spatial Data

We begin with North Carolina county data included in the sf package—a classic dataset for spatial analysis. This demonstrates how spatial objects combine geometry with attribute data.

# Load required packages
library(sf)
library(dplyr)
library(ggplot2)

# Load built-in NC counties dataset (polygons)
nc <- st_read(system.file("shape/nc.shp", package = "sf"), quiet = TRUE)

# Inspect structure: geometry type, CRS, and attributes
print(paste("Geometry type:", st_geometry_type(nc)[1]))
[1] "Geometry type: MULTIPOLYGON"
print(paste("CRS:", st_crs(nc)$input))
[1] "CRS: NAD27"
print(paste("Number of features:", nrow(nc)))
[1] "Number of features: 100"
print("Available attributes:")
[1] "Available attributes:"
names(nc)
 [1] "AREA"      "PERIMETER" "CNTY_"     "CNTY_ID"   "NAME"      "FIPS"     
 [7] "FIPSNO"    "CRESS_ID"  "BIR74"     "SID74"     "NWBIR74"   "BIR79"    
[13] "SID79"     "NWBIR79"   "geometry" 
# View first 3 counties with key demographic variables
nc |> 
  select(NAME, BIR74, SID74, AREA) |> 
  head(3)
Simple feature collection with 3 features and 4 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -81.74107 ymin: 36.23388 xmax: -80.43531 ymax: 36.58965
Geodetic CRS:  NAD27
       NAME BIR74 SID74  AREA                       geometry
1      Ashe  1091     1 0.114 MULTIPOLYGON (((-81.47276 3...
2 Alleghany   487     0 0.061 MULTIPOLYGON (((-81.23989 3...
3     Surry  3188     5 0.143 MULTIPOLYGON (((-80.45634 3...
# Visualize births in 1974
ggplot(nc) +
  geom_sf(aes(fill = BIR74)) +
  scale_fill_viridis_c(option = "plasma", name = "Births") +
  theme_minimal() +
  labs(title = "Births per County in North Carolina (1974)")

Key points:

  • Spatial data frames contain a special geometry column alongside regular attributes
  • CRS (Coordinate Reference System) defines how coordinates map to Earth’s surface
  • Standard dplyr verbs (select(), filter()) work seamlessly with spatial objects
  • geom_sf() in ggplot2 handles spatial visualization automatically

B. Coordinate Reference Systems and Geometric Operations

CRS management is critical for accurate distance/area calculations. We’ll transform coordinates and create buffers (zones of influence).

# Check current CRS (geographic coordinates in degrees)
st_crs(nc)$input
[1] "NAD27"
# Transform to projected CRS for accurate distance calculations (meters)
# EPSG:32119 = NAD83 / North Carolina (meters)
nc_proj <- st_transform(nc, 32119)
print(paste("New CRS:", st_crs(nc_proj)$input))
[1] "New CRS: EPSG:32119"
# Calculate area in square kilometers (requires projected CRS)
nc_proj <- nc_proj |> 
  mutate(area_km2 = as.numeric(st_area(geometry)) / 1e6)

# Create 20km buffer around Wake County (Raleigh)
wake <- nc_proj |> filter(NAME == "Wake")
wake_buffer <- st_buffer(wake, dist = 20000)  # 20km = 20,000 meters

# Visualize buffer
ggplot() +
  geom_sf(data = nc_proj, fill = "lightgray") +
  geom_sf(data = wake, fill = "steelblue", alpha = 0.7) +
  geom_sf(data = wake_buffer, fill = NA, color = "red", size = 1.2) +
  labs(title = "20km Buffer Zone Around Wake County",
       subtitle = "Buffer shown in red; county boundary in blue") +
  theme_minimal()

Key points:

  • Geographic CRS (e.g., WGS84/EPSG:4326) uses degrees; projected CRS uses meters/feet
  • Always transform to projected CRS before calculating distances/areas/buffers
  • st_buffer() creates zones of influence—critical for market area analysis
  • Buffers require distance units matching the CRS (meters for projected systems)

C. Spatial Topology and Neighborhood Analysis

Spatial topology identifies relationships between features (e.g., adjacency). We’ll find counties bordering the highest-birth county.

# Identify county with highest births in 1974
highest_birth <- nc_proj |> slice_max(order_by = BIR74, n = 1)
print(paste("Highest birth county:", highest_birth$NAME))
[1] "Highest birth county: Mecklenburg"
# Find adjacent counties using st_touches()
# Returns logical matrix: TRUE where geometries share boundary
adjacent_matrix <- st_touches(highest_birth, nc_proj, sparse = FALSE)

# Extract adjacent counties
neighbors <- nc_proj[adjacent_matrix, ]

# Visualize results
ggplot() +
  geom_sf(data = nc_proj, fill = "lightgray", color = "white") +
  geom_sf(data = highest_birth, fill = "darkgreen", alpha = 0.8) +
  geom_sf(data = neighbors, fill = "gold", alpha = 0.6) +
  geom_sf_label(data = highest_birth, aes(label = NAME), size = 3) +
  labs(title = "Counties Adjacent to Highest-Birth County (Guilford)",
       subtitle = "Guilford County (green) and its 8 bordering neighbors (gold)") +
  theme_minimal()

Key points:

  • st_touches() identifies features sharing boundaries (adjacency)
  • Spatial predicates return sparse matrices by default (efficient for large datasets)
  • Topology enables analysis of spatial spillovers (e.g., policy diffusion across borders)
  • Visualization clarifies spatial relationships for interpretation

D. Spatial Joins and Real-World Application

Spatial joins link datasets without shared IDs by leveraging geometry. We’ll simulate retail stores and join them to county attributes.

# Simulate 50 retail store locations (points) across NC
set.seed(123)
stores <- st_sample(st_union(nc_proj), size = 50) |> 
  st_as_sf() |> 
  mutate(store_id = 1:50,
         revenue = runif(50, 50, 300))  # Simulated daily revenue ($k)

# Spatial join: assign each store to its containing county
stores_with_county <- st_join(stores, nc_proj)

# Analyze revenue by birth rate quartile
stores_with_county <- stores_with_county |> 
  mutate(birth_quartile = ntile(BIR74, 4))

# Compare average revenue across birth rate quartiles
revenue_by_quartile <- stores_with_county |> 
  st_drop_geometry() |>  # Remove geometry for standard summarization
  group_by(birth_quartile) |> 
  summarise(avg_revenue = mean(revenue),
            n_stores = n())

print(revenue_by_quartile)
# A tibble: 4 × 3
  birth_quartile avg_revenue n_stores
           <int>       <dbl>    <int>
1              1        179.       13
2              2        189.       13
3              3        202.       12
4              4        140.       12
# Visualize relationship
ggplot(stores_with_county) +
  geom_sf(data = nc_proj, fill = "white", color = "gray80") +
  geom_sf(aes(color = revenue, size = revenue), alpha = 0.7) +
  scale_color_viridis_c(name = "Daily Revenue ($k)") +
  scale_size_continuous(range = c(2, 6)) +
  labs(title = "Simulated Retail Stores Across NC Counties",
       subtitle = "Store size/color reflect revenue; spatial join linked stores to county attributes") +
  theme_minimal()

Key points:

  • st_join() performs attribute transfer based on spatial relationships (default: intersects)
  • Critical for economics when datasets lack common identifiers (e.g., firms → districts)
  • Enables analysis of spatial exposure (e.g., “Do stores in high-birth counties earn more?”)
  • Geometry must be dropped before standard dplyr aggregation

Part 2: Student Practice Questions

Data Fundamentals & CRS Management

  1. Load the nc dataset. What is the difference between st_crs(nc)$epsg and st_crs(nc)$proj4string? Transform the dataset to EPSG:4326 and EPSG:32119. Which transformation changes the shape of counties more dramatically on a map? Explain why.

  2. Calculate the area of Buncombe County (NAME = “Buncombe”) in km² using:

    • The original CRS (without transformation)
    • EPSG:32119 (projected) Why do results differ? Which is correct for economic analysis requiring area-weighted variables?
  3. Create two copies of nc:

    • nc_a transformed to EPSG:32119
    • nc_b transformed to EPSG:2264 (another NC projection)
    • Attempt to do st_intersection(nc_a[1,], nc_b[1,]). What error occurs? Fix it and compute the intersection area.

Vector Operations – Buffers, Centroids & Topology

  1. Create a 25km buffer around the county with the highest 1979 birth rate (BIR79). Identify all counties whose centroids fall within this buffer (not just geometric intersection). How many counties qualify? Visualize with buffer (semi-transparent red), high-birth county (blue), and qualifying centroids (gold points).

  2. Building on Q4:

    • Find all counties that touch (share a border with) the high-birth county
    • Create a 10km buffer inside the high-birth county boundary (use negative distance: st_buffer(..., dist = -10000))
    • Calculate what percentage of the high-birth county’s area lies within this “inner buffer” Economic interpretation: How might this represent a “core economic zone” versus peripheral areas?
  3. Calculate centroids for all NC counties. For each county, find its nearest neighbor by centroid distance (excluding itself). Which county pair has the smallest centroid-to-centroid distance? Which has the largest? Plot all counties colored by distance to nearest neighbor.

  4. For Wake County:

    • Create concentric buffers at 10km, 25km, and 50km
    • Calculate the population (BIR74) residing in each ring (Ring 1: 0–10km, Ring 2: 10–25km, etc.)
    • Visualize as a choropleth where ring membership determines color intensity Economic application: Modeling urban decay or agglomeration effects across distance bands.
  5. Identify counties that are completely disjoint (no border contact) with counties having above-median SID74 (SIDS deaths). What geographic pattern emerges? Hypothesize an economic mechanism that could explain health outcome clustering.

Spatial Joins & Real-World Applications

  1. Simulate 200 “pollution sources” as random points across NC. Create 5 circular pollution zones (buffers of varying radii: 5–30km) around randomly selected sources. Which counties intersect multiple pollution zones? Calculate total “exposure score” per county = sum of buffer areas overlapping the county.

  2. Using the rnaturalearth library, retrieve two spatial datasets:

  • US state boundaries via ne_states(country = "United States of America", returnclass = "sf")
  • US populated places via ne_download(scale = 110, type = "populated_places", category = "cultural", returnclass = "sf")

Then filter the populated places to those inside the US, perform a spatial join to assign each city to its state, and answer the following:

  1. Which state contains the most populated places in this dataset?

  2. Calculate the average latitude of populated places per state2 and display the results sorted from northernmost to southernmost.

  3. Simulate 150 firm headquarters as random points in NC counties. Assign each firm:

    • Revenue ~ Uniform($1M, $50M)
    • Industry (manufacturing/retail/services) with probabilities 0.3/0.4/0.3 Join firms to counties. Test whether manufacturing firms locate in counties with larger land area (st_area(geometry)) using a simple correlation.
  4. Identify all county pairs that touch using st_touches(nc, nc). For each touching pair, calculate:

    • Absolute difference in BIR74 (births)
    • Absolute difference in NWBIR74 (non-white births) Which border has the largest demographic discontinuity? Economic relevance: Spatial RD designs often exploit such discontinuities.

Raster Data Fundamentals

  1. Load Landsat data:

    library(stars)
    Loading required package: abind
    landsat <- read_stars(system.file("tif/L7_ETMs.tif", package = "stars"))
    • How many bands does it contain?
    • What is the resolution (cell size) in meters?
    • Extract and plot Band 4 (near-infrared) separately. Why is this band valuable for vegetation analysis in agricultural economics?
  2. Using st_buffer() and st_point(), create a 500m-radius circle centered at coordinates (294000, 9116000) in the Landsat CRS. Crop the Landsat image to this circle. Calculate mean reflectance per band within the cropped area.

  3. Subset nc to 5 randomly selected counties

    • Transform to match Landsat CRS (st_transform(nc_sample, st_crs(landsat)))
    • For Band 1 of Landsat, calculate mean reflectance per county using aggregate()
    • Join results back to county attributes and map using ggplot2

Advanced Integration & Economics Applications

  1. Using geodata (install if needed):

    library(geodata)
    Loading required package: terra
    terra 1.8.80
    temp <- worldclim_country(country = "Netherlands", var = "tavg", path = tempdir())
    • Convert to stars object and extract January 2020 temperature
    • Get Dutch province boundaries via rnaturalearth::ne_states("Netherlands")
    • Calculate mean January temperature per province
    • Economic question: How would you link this to provincial GDP data to test climate-economy relationships?
  2. For NC counties:

    • Create a binary contiguity weights matrix where (w_{ij} = 1) if counties (i) and (j) share a border
    • Convert to row-standardized weights ((j w{ij} = 1))
    • Calculate spatially lagged births: ( _i = j w{ij} _j )
    • Plot original vs. lagged births. What does a high lag value imply economically?
  3. Use tidygeocoder to geocode:

    • “Radboud University, Nijmegen”
    • “Erasmus University Rotterdam” Convert results to sf points. Calculate straight-line distance between them in kilometers (transform to appropriate UTM zone first). Buffer each point by 2km and identify overlapping area.
  4. Load climate time-series data:

    climate <- read_stars(system.file("nc/bcsd_obs_1999.nc", package = "stars"), quiet = TRUE)
    • How many time steps does it contain?
    • Extract temperature for January 1, 1999 and January 31, 1999
    • Calculate temperature change per location over January 1999
    • Economics link: How could this data proxy for agricultural shock exposure?
    • Using the nc dataset, designate counties with BIR74 > 80th percentile as “treatment” (received pro-natalist policy)
    • Create 50km buffer around treatment counties = spillover zone
    • Simulate outcome variable: outcome = 100 + 15*treatment + 8*spillover + rnorm(n, 0, 10)
    • Using spatial topology, estimate:
      1. Direct treatment effect (treatment vs. control counties)
      2. Spillover effect (spillover zone vs. pure control)

Footnotes

  1. File > New File > Quarto Document.↩︎

  2. Hint: use the aggregate function.↩︎