Tutorial 8

Answer Model

Data Fundamentals & CRS Management

  1. st_crs(nc)$epsg returns the EPSG code identifier (e.g., 32119), while st_crs(nc)$proj4string returns the full PROJ.4 parameter string defining the coordinate system mathematically. EPSG is a shorthand registry code; proj4string contains explicit projection parameters.
library(sf)
Linking to GEOS 3.14.1, GDAL 3.12.4, PROJ 9.8.1; sf_use_s2() is TRUE
nc <- st_read(system.file("shape/nc.shp", package="sf"))
Reading layer `nc' from data source 
  `/home/bas/R/x86_64-redhat-linux-gnu-library/4.6/sf/shape/nc.shp' 
  using driver `ESRI Shapefile'
Simple feature collection with 100 features and 14 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
Geodetic CRS:  NAD27
st_crs(nc)$epsg
[1] 4267
st_crs(nc)$proj4string
[1] "+proj=longlat +datum=NAD27 +no_defs"
nc_4326 <- st_transform(nc, 4326)
nc_32119 <- st_transform(nc, 32119)

Transforming to EPSG:4326 (WGS84 geographic coordinates) changes county shapes more dramatically than EPSG:32119 (NAD83 / North Carolina projection). Geographic CRS (4326) uses angular units (degrees) that distort shapes at regional scales due to Earth’s curvature, while projected CRS (32119) preserves local geometry using meters with minimal distortion.

  1. A 5 km buffer adds a ring of catchment around Buncombe County. Working in EPSG:32119 (metres) means the number 5000 is honest metres.
library(dplyr)
library(ggplot2)

buncombe <- nc |> filter(NAME == "Buncombe") |> st_transform(32119)
buf <- st_buffer(buncombe, 5000)   # 5000 metres, because EPSG:32119 is in metres

area_county <- as.numeric(st_area(buncombe)) / 1e6   # km²
area_buffer <- as.numeric(st_area(buf))      / 1e6   # km²
extra_catchment <- area_buffer - area_county

area_county
[1] 1691.168
area_buffer
[1] 2721.692
extra_catchment
[1] 1030.524
ggplot() +
  geom_sf(data = buf, fill = "red", alpha = 0.3) +
  geom_sf(data = buncombe, fill = "blue", alpha = 0.7) +
  theme_minimal() +
  labs(title = "Buncombe County (blue) and its 5 km buffer (red)")

The CRS rule: st_buffer() interprets its dist argument in the units of the layer’s CRS. EPSG:32119 is measured in metres, so 5000 means 5 km—a meaningful catchment. If the layer were still in geographic (lon/lat) coordinates, the units would be degrees, and 5000 would mean 5000 degrees of longitude/latitude—physically meaningless (the whole globe spans only 360 degrees). Buffering in degrees also distorts the ring, because one degree of longitude covers a different ground distance than one degree of latitude. That is why you must transform to a metre-based projected CRS before buffering.

Economic interpretation: The 5 km buffer is a simple model of the county’s economic catchment—the surrounding area whose firms and workers plausibly interact with it. Getting the radius in honest metres is what makes that catchment comparable across regions.

  1. CRS mismatch error occurs when attempting geometric operations on objects with different coordinate reference systems.
nc_a <- st_transform(nc, 32119)
nc_b <- st_transform(nc, 2264)
st_intersection(nc_a[1,], nc_b[1,]) 
Error in `geos_op2_geom()`:
! st_crs(x) == st_crs(y) is not TRUE
# "st_crs(x) == st_crs(y) is not TRUE"
# Fix: Transform nc_b to match nc_a's CRS
nc_b_fixed <- st_transform(nc_b, st_crs(nc_a))
intersection <- st_intersection(nc_a[1,], nc_b_fixed[1,])
Warning: attribute variables are assumed to be spatially constant throughout
all geometries
intersection_area <- st_area(intersection) / 1e6  # km²

Vector Operations – Buffers, Centroids & Topology

  1. Identify county with highest 1979 birth rate, create buffer, and find counties whose centroids fall within buffer.
library(ggplot2)
high_birth_county <- nc[which.max(nc$BIR79), ]
buffer_25km <- st_buffer(high_birth_county, dist = 25000)
centroids <- st_centroid(nc)
Warning: st_centroid assumes attributes are constant over geometries
qualifying <- centroids[st_within(centroids, buffer_25km, sparse=F), ]

n_qualifying <- nrow(qualifying)
n_qualifying
[1] 5
ggplot() +
  geom_sf(data = nc, fill = "white", color = "gray50") +
  geom_sf(data = buffer_25km, fill = "red", alpha = 0.3) +
  geom_sf(data = high_birth_county, fill = "blue", alpha = 0.7) +
  geom_sf(data = qualifying, color = "gold", size = 2) +
  theme_minimal()

Visualization shows spatial extent of demographic influence.

  1. Border analysis and inner buffer calculation:
touching <- nc[st_touches(high_birth_county, nc, sparse = FALSE)[1, ], ]
inner_buffer <- st_buffer(high_birth_county, dist = -10000)
Warning in st_buffer.sfc(st_geometry(x), dist, nQuadSegs, endCapStyle =
endCapStyle, : st_buffer does not correctly buffer longitude/latitude data
dist is assumed to be in decimal degrees (arc_degrees).
inner_area_pct <- as.numeric(st_area(inner_buffer) / st_area(high_birth_county)) * 100

The inner buffer represents the county’s “core” excluding peripheral border zones. A high percentage (>60%) suggests compact development where economic activity concentrates centrally; low percentages indicate elongated/rural counties where borders constitute significant economic interfaces. This core-periphery distinction matters for policy targeting (e.g., infrastructure investment).

  1. Nearest neighbor analysis by centroid distance:
centroids <- st_centroid(nc)
Warning: st_centroid assumes attributes are constant over geometries
dist_matrix <- st_distance(centroids, centroids)
diag(dist_matrix) <- Inf  # Exclude self
nearest_idx <- max.col(-dist_matrix, "first")
nearest_dists <- apply(dist_matrix, 1, min)

min_pair <- which.min(nearest_dists)
max_pair <- which.max(nearest_dists)

ggplot(nc) +
  geom_sf(aes(fill = nearest_dists), color = "white") +
  scale_fill_viridis_c(name = "Distance to\nnearest neighbor (m)") +
  theme_minimal()

Smallest distance: Typically adjacent urban counties. Largest distance: Isolated rural counties. Distance heterogeneity reflects settlement patterns affecting labor market integration.

  1. Distance-band analysis using random candidate sites around a Charlotte center:
library(dplyr)

# Center: Charlotte, projected to metres
center <- st_as_sf(data.frame(lon = -80.84, lat = 35.23),
                   coords = c("lon", "lat"), crs = 4326) |>
  st_transform(32119)

# Match CRS for the join (built-in nc loads in NAD27)
nc84 <- st_transform(nc, 4326)
set.seed(42)
bb <- st_bbox(nc84)

# Scatter random candidate sites across the bounding box
pts <- data.frame(lon = runif(800, bb["xmin"], bb["xmax"]),
                  lat = runif(800, bb["ymin"], bb["ymax"])) |>
  st_as_sf(coords = c("lon", "lat"), crs = 4326)

# Keep only points that fall inside North Carolina
inside <- st_join(pts, nc84, join = st_intersects, left = FALSE) |>
  st_transform(32119)

# Distance to center (km) and bin into bands
inside$dist_km <- as.numeric(st_distance(inside, center)) / 1000
inside$band <- cut(inside$dist_km,
                   breaks = c(0, 10, 25, 50, Inf),
                   labels = c("0-10", "10-25", "25-50", ">50"))

band_counts <- inside |>
  st_drop_geometry() |>
  count(band)
band_counts
   band   n
1  0-10   2
2 10-25   7
3 25-50  13
4   >50 381

With uniformly scattered points, the count per band grows with distance simply because outer bands cover more land—this is the catchment-area logic behind gravity and market-access models, where a location’s reach expands with the radius you are willing to consider. It also serves as a spatial “null” baseline: genuine agglomeration would over-represent the inner bands relative to this uniform scatter.

  1. Disjoint counties from high-SIDS areas:
median_sid <- median(nc$SID74)
high_sid <- nc[nc$SID74 > median_sid, ]
disjoint <- nc[!st_touches(nc, st_union(high_sid), sparse = FALSE)[,1], ]

Geographic pattern: Disjoint counties typically cluster in western mountains or northeastern coastal plains—regions physically separated from central Piedmont (where high-SIDS counties concentrate). Economic mechanism: Health outcome clustering may reflect spatial sorting by income (wealthier households self-select into healthier environments) or spillover effects from urban pollution/healthcare access concentrated in central regions.

Spatial Joins & Real-World Applications

  1. Pollution exposure simulation:
set.seed(123)
library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
sources <- st_sample(st_as_sfc(st_bbox(nc)), 200) |> st_cast("POINT") |> st_sf()
Warning in st_poly_sample(x, size = size, ..., type = type, by_polygon =
by_polygon, : coordinate ranges not computed along great circles; install
package lwgeom to get rid of this warning
Warning in st_poly_sample(x, size = size, ..., type = type, by_polygon =
by_polygon, : coordinate ranges not computed along great circles; install
package lwgeom to get rid of this warning
selected <- sources[sample(nrow(sources), 5), ]
buffers <- st_buffer(selected, dist = runif(5, 5000, 30000))
buffers$zone_id <- 1:5

# County-zone intersections
county_zones <- st_join(nc, buffers, join = st_intersects)
exposure <- aggregate(area ~ NAME, 
                      data = st_intersection(county_zones, buffers) |> 
                        st_as_sf() |> 
                        mutate(area = as.numeric(st_area(geometry))),
                      sum)
Warning: attribute variables are assumed to be spatially constant throughout
all geometries
Warning: attribute variables are assumed to be spatially constant throughout
all geometries
exposure                 
        NAME       area
1      Anson 1386609993
2  Granville    9213896
3 Montgomery   70721395
4     Person   93999085
5   Richmond  383328506
6     Stanly   52438324
7      Union  251781693
8      Vance   12279012
9     Warren  441185980

Counties intersecting multiple zones show cumulative exposure risk. Total exposure score enables environmental justice analysis (e.g., regressing health outcomes on exposure scores).

  1. City-state spatial join:
library(rnaturalearth)
library(sf)
library(dplyr)

# 1. Get US state boundaries (admin level 1)
us_states <- ne_states(country = "United States of America", returnclass = "sf")

# 2. Download populated places (scale 110 = low-res global layer, ~1000 cities worldwide)
places <- ne_download(
  scale = 110,
  type  = "populated_places",
  category = "cultural",
  returnclass = "sf"
)
Reading 'ne_110m_populated_places.zip' from naturalearth...
# 3. Filter to US places only (use the SOV0NAME or ADM0NAME column)
us_places <- places |>
  filter(ADM0NAME == "United States of America")

# 4. Make sure both layers share the same CRS
us_states  <- st_transform(us_states,  crs = 4326)
us_places  <- st_transform(us_places,  crs = 4326)

# 5. Spatial join: assign each city to the state polygon it falls within
joined <- st_join(us_places, us_states["name"], join = st_within)

# 6. Which state has the most cities?
city_counts <- joined |>
  st_drop_geometry() |>
  count(name, sort = TRUE) |>
  rename(state = name, n_cities = n)

cat("State with most populated places:\n")
State with most populated places:
print(head(city_counts, 5))
                 state n_cities
1           California        2
2             Colorado        1
3 District of Columbia        1
4              Florida        1
5              Georgia        1
# 7. Average latitude per state (northernmost -> southernmost)
avg_lat <- joined |>
  mutate(lat = st_coordinates(geometry)[, 2]) |>
  st_drop_geometry() |>
  group_by(state = name) |>
  summarise(avg_lat = mean(lat, na.rm = TRUE),
            n_cities = n()) |>
  arrange(desc(avg_lat))

cat("\nAverage city latitude per state (N ? S):\n")

Average city latitude per state (N ? S):
print(avg_lat)
# A tibble: 8 × 3
  state                avg_lat n_cities
  <chr>                  <dbl>    <int>
1 Illinois                41.8        1
2 New York                40.7        1
3 Colorado                39.7        1
4 District of Columbia    38.9        1
5 California              35.9        2
6 Georgia                 33.7        1
7 Texas                   29.7        1
8 Florida                 25.8        1

California typically contains the most cities. Average latitude per state reveals north-south settlement gradients affecting climate-dependent economic activity.

  1. Firm location simulation and correlation test:
library(sf)
library(units)
udunits database from /usr/share/udunits/udunits2.xml
# Load data (assuming this is already in your environment)
nc <- st_read(system.file("shape/nc.shp", package="sf"), quiet = TRUE)

# 1. Calculate the area of the COUNTIES first, adding it as a column to 'nc'
nc$area_km2 <- st_area(nc) |> set_units("km^2")

set.seed(456)
firms <- st_sample(st_union(nc), 150) |> st_cast("POINT") |> st_sf()
Warning in st_poly_sample(x, size = size, ..., type = type, by_polygon =
by_polygon, : coordinate ranges not computed along great circles; install
package lwgeom to get rid of this warning
firms$revenue <- runif(150, 1, 50)
firms$industry <- sample(c("manufacturing", "retail", "services"), 150, 
                         prob = c(0.3, 0.4, 0.3), replace = TRUE)

# 2. Join firms with nc (including the new area column)
firms_joined <- st_join(firms, nc[, c("NAME", "area_km2")])

# 3. Filter for manufacturing
manufacturing <- firms_joined[firms_joined$industry == "manufacturing", ]

# 4. Run correlation using the columns from the joined data
cor.test(as.numeric(manufacturing$area_km2), 
         manufacturing$revenue)

    Pearson's product-moment correlation

data:  as.numeric(manufacturing$area_km2) and manufacturing$revenue
t = 0.80257, df = 38, p-value = 0.4272
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.1900476  0.4235791
sample estimates:
     cor 
0.129105 

Positive correlation would suggest manufacturing firms locate in larger counties (more land for factories), while negative correlation might indicate urban concentration despite smaller land area.

  1. Border demographic discontinuities:
touch_matrix <- st_touches(nc, nc, sparse = FALSE)
diffs <- data.frame()
for (i in 1:nrow(nc)) {
  for (j in which(touch_matrix[i, ])) {
    if (i < j) {  # Avoid duplicates
      diffs <- rbind(diffs, data.frame(
        county1 = nc$NAME[i], county2 = nc$NAME[j],
        bir_diff = abs(nc$BIR74[i] - nc$BIR74[j]),
        nwbir_diff = abs(nc$NWBIR74[i] - nc$NWBIR74[j])
      ))
    }
  }
}
largest_bir <- diffs[which.max(diffs$bir_diff), ]

Largest discontinuity often occurs at urban-rural borders (e.g., Mecklenburg vs. adjacent rural counties). Economic relevance: Such sharp borders enable spatial regression discontinuity designs to estimate policy effects (e.g., school funding differences across county lines).

Raster Data Fundamentals

  1. Landsat band analysis:
library(stars)
Loading required package: abind
landsat <- read_stars(system.file("tif/L7_ETMs.tif", package = "stars"))
dim(landsat)[3]  # 6 bands
band 
   6 
st_dimensions(landsat)$x$delta  # Resolution: 28.5m
[1] 28.5
band4 <- landsat[,,,4]
plot(band4, main = "Band 4: Near-Infrared")

Band 4 (NIR) is valuable because healthy vegetation strongly reflects NIR radiation—enabling NDVI calculation for crop health monitoring, yield prediction, and agricultural subsidy targeting.

  1. Circular crop and zonal statistics:
library(stars)
library(sf)

pt <- st_point(c(294000, 9116000)) |> 
  st_sfc(crs = st_crs(landsat)) |> 
  st_buffer(dist = 500)

cropped <- st_crop(landsat, pt)

# FIX: Use st_apply() instead of apply()
band_means <- st_apply(cropped, MARGIN = 3, FUN = mean, na.rm = TRUE)

# View result
print(band_means)
stars object with 1 dimensions and 1 attribute
attribute(s):
          Min.  1st Qu.  Median     Mean  3rd Qu.     Max.
mean  57.92547 59.89674 66.6972 69.60559 73.89984 92.38302
dimension(s):
     from to
band    1  6

Mean reflectance per band characterizes land cover within the circle (e.g., high Band 4 = vegetation).

  1. Summarising a raster down to polygons—mean July temperature per county:
library(stars)
climate <- read_stars(system.file("nc/bcsd_obs_1999.nc", package = "stars"), quiet = TRUE)
st_crs(climate) <- 4326                     # the file ships without a CRS
nc4326 <- st_transform(nc, 4326)            # match the raster's CRS before aggregating

# Select the temperature attribute and the July 1999 layer (7th time step)
july_tas <- climate["tas"][ , , , 7]

# Mean July temperature per county (ocean cells are NA)
county_temp <- aggregate(july_tas, nc4326, FUN = mean, na.rm = TRUE)
nc4326$july_tas <- as.numeric(county_temp[[1]])   # drop units for plotting

ggplot(nc4326) +
  geom_sf(aes(fill = july_tas)) +
  scale_fill_viridis_c(name = "Mean July\ntemperature") +
  theme_minimal() +
  labs(title = "Mean July 1999 Temperature per NC County")

The hottest counties are the low-lying coastal and southeastern plains; the cooler counties lie in the western Appalachian mountains, where elevation depresses temperature. Economic interpretation: This per-county summer temperature is a measure of climate exposure—a candidate explanatory variable for outcomes like agricultural yields, energy demand, or labour productivity across counties.

Advanced Integration & Economics Applications

  1. Climate-province integration:
library(geodata)
Loading required package: terra
terra 1.9.27
temp <- worldclim_country(country = "Netherlands", var = "tavg", path = tempdir())
Cached as: /tmp/Rtmp11STQ1/climate/wc2.1_country/NLD_wc2.1_30s_tavg.tif
temp_stars <- st_as_stars(temp)
jan_2020 <- temp_stars[,,,1]  # Assuming first layer is Jan 2020

provinces <- ne_states("Netherlands", returnclass = "sf")
provinces$temp <- aggregate(jan_2020, by = provinces, FUN = mean, na.rm = TRUE)[[1]]

# Economic linkage approach:
# 1. Obtain provincial GDP data (e.g., from CBS Statistics Netherlands)
# 2. Merge: provinces <- merge(provinces, gdp_data, by = "province_name")
# 3. Model: lm(log_gdp_pc ~ temp + controls, data = provinces)
# 4. Address endogeneity via instrumental variables (e.g., coastal proximity)

Critical considerations: Control for urbanization (confounder), use panel data to exploit temporal variation, and account for spatial autocorrelation in errors.

  1. Neighbour-average comparison via st_touches():
nc_proj <- st_transform(nc, 32119)

# County with the highest BIR74 (Mecklenburg, home to Charlotte)
focal <- nc_proj |> slice_max(order_by = BIR74, n = 1)
focal$NAME
[1] "Mecklenburg"
# Counties that share a border with the focal county
touch <- st_touches(focal, nc_proj, sparse = FALSE)
neighbours <- nc_proj[touch[1, ], ]

mean_neighbours <- mean(neighbours$BIR74)
focal$BIR74        # focal county's own births
[1] 21588
mean_neighbours    # average of its bordering neighbours
[1] 4676.6
focal$BIR74 - mean_neighbours   # how much it towers over its local environment
[1] 16911.4
# Map: focal (green), neighbours (gold), rest of state (light backdrop)
ggplot() +
  geom_sf(data = nc_proj, fill = "lightgray", color = "white") +
  geom_sf(data = neighbours, fill = "gold", alpha = 0.7) +
  geom_sf(data = focal, fill = "darkgreen", alpha = 0.8) +
  theme_minimal() +
  labs(title = sprintf("%s and its bordering neighbours", focal$NAME))

The neighbour-average is a simple, intuitive measure of a county’s local environment, and the gap between a county and its neighbours is a transparent stand-in for spatial spillover—no weights matrix or matrix algebra required. Mecklenburg towers over its neighbours, the signature of a single dominant urban core surrounded by smaller counties.

  1. University geocoding and distance:
library(tidygeocoder)
library(sf)

unis <- tibble::tibble(
  name    = c("Radboud University, Nijmegen", "Erasmus University Rotterdam"),
  address = c("Radboud University, Nijmegen", "Erasmus University Rotterdam")
) |> geocode(address, method = "osm")

ru  <- st_as_sf(unis[1,], coords = c("long", "lat"), crs = 4326)
eur <- st_as_sf(unis[2,], coords = c("long", "lat"), crs = 4326)

ru_utm  <- st_transform(ru,  32631)
eur_utm <- st_transform(eur, 32631)
distance_km <- as.numeric(st_distance(ru_utm, eur_utm)) / 1000

buffer_ru  <- st_buffer(ru_utm,  2000)
buffer_eur <- st_buffer(eur_utm, 2000)
overlap <- st_intersection(buffer_ru, buffer_eur)

Distance ≈ 85 km. Overlapping buffer area represents potential student mobility zone or competitive labor market for academic talent.

  1. Climate time-series analysis:
library(stars)
climate <- read_stars(system.file("nc/bcsd_obs_1999.nc", package = "stars"), quiet = TRUE)
dim(climate)[3]
time 
  12 
jan1  <- climate[,,,1]
jan12 <- climate[,,,12]

#adrop() removes the length-1 time dimension entirely, leaving both objects with only x and y, so the subtraction works cleanly.
temp_change <- adrop(jan12) - adrop(jan1)

# Economics link: 
# - Temperature volatility proxies for crop failure risk
# - Use as instrument for agricultural income shocks
# - Regress household consumption on temp_change to estimate risk coping

Daily temperature changes capture short-term climate volatility affecting planting/harvest decisions—critical for modeling farmer risk exposure in developing economies.

  1. Spatial treatment effect estimation:
library(sf)
library(spData)

# Project to a suitable planar CRS (US Albers, metres)
nc_proj <- st_transform(nc, crs = 5070)

nc_proj$treatment <- as.integer(nc_proj$BIR74 > quantile(nc_proj$BIR74, 0.8))

treat_buffer <- st_buffer(nc_proj[nc_proj$treatment == 1, ], dist = 50000)  # 50 km ✓

nc_proj$spillover <- as.integer(
  rowSums(st_intersects(nc_proj, treat_buffer, sparse = FALSE)) > 0 &
    nc_proj$treatment == 0
)

nc_proj$outcome <- 100 + 15*nc_proj$treatment + 8*nc_proj$spillover +
                   rnorm(nrow(nc_proj), 0, 10)

# a) Direct effect (pure control vs treated, no spillover units)
direct_sub <- nc_proj[nc_proj$spillover == 0, ]
stopifnot(length(unique(direct_sub$treatment)) == 2)  # sanity check
direct_model <- lm(outcome ~ treatment, data = direct_sub)
summary(direct_model)$coefficients["treatment", ]
    Estimate   Std. Error      t value     Pr(>|t|) 
1.792513e+01 4.162123e+00 4.306727e+00 1.633613e-04 
# b) Spillover effect (among non-treated only)
spillover_sub <- nc_proj[nc_proj$treatment == 0, ]
spillover_model <- lm(outcome ~ spillover, data = spillover_sub)
summary(spillover_model)$coefficients["spillover", ]
    Estimate   Std. Error      t value     Pr(>|t|) 
1.300168e+01 2.736072e+00 4.751952e+00 9.022925e-06