For each of these, the question is: vector or raster?
A list of every Albert Heijn supermarket in the Netherlands.
A nighttime-lights satellite image used as a proxy for economic activity.
The borders of all Dutch municipalities.
A map of average rainfall across Europe.
Answers:
Vector (points).
Raster (a grid of light values).
Vector (polygons).
Raster (a continuous surface).
The R Toolkit: sf and stars
Two R packages do almost everything in spatial analysis:
sf (Simple Features): for vector data (points, lines, polygons).
stars (Spatiotemporal Arrays): for raster data (grids).
Both are the modern standard and both work together with the tidyverse.
The older rgdal, rgeos, and maptools packages were removed from CRAN in 2023, and sp is now maintenance-only; you will not need any of them.
Along the way we use a few tidyverse tools: the dplyr verbs select(), filter(), and mutate(), and ggplot2 for plotting.
We only need a handful, and we explain each one as it appears.
Vector Data with sf
Loading Data: Municipalities of Utrecht
Let’s get the municipalities (gemeenten) of the Netherlands from Statistics Netherlands using the cbsodataR package you met in the API lecture.
We then keep only the municipalities of the province of Utrecht.
library(sf); library(dplyr); library(ggplot2); library(cbsodataR)# The municipalities of the province of Utrecht (2022)utrecht_names <-c("Amersfoort","Baarn","De Bilt","Bunnik","Bunschoten","Eemnes","Houten","De Ronde Venen","Lopik","Montfoort","Nieuwegein","Oudewater","Renswoude","Rhenen","Soest","Stichtse Vecht","Utrecht","Utrechtse Heuvelrug","Veenendaal","Vijfheerenlanden","Wijk bij Duurstede","Woerden","Woudenberg","IJsselstein","Zeist")# All Dutch municipalities as polygons, then keep the province of Utrechtgemeenten <-cbs_get_sf("gemeente", 2022)utrecht <- gemeenten |>filter(statnaam %in% utrecht_names)ggplot(utrecht) +geom_sf(fill ="grey90", colour ="white") +theme_void()
The Most Important Idea
An sf object is just a data frame with a geometry column.
It has ordinary columns (like statnaam, the municipality name) plus one special column called geometry that stores the shape.
The only new ingredient is geom_sf(), which reads the geometry column and draws the shapes:
ggplot(utrecht) +# 1. the sf data framegeom_sf(aes(fill = area_km2)) +# 2. draw geometry, fill by a columnscale_fill_viridis_c(name ="Area") +# 3. a colour scaletheme_void() +# 4. drop the grey gridlabs(title ="...") # 5. titles
Everything else is the ordinary grammar of ggplot2.
To fill by a different variable (say population), you would simply write aes(fill = population): the same call with a different column.
Coordinate Reference Systems
Why Coordinates Need a System
We have made maps without asking a sneaky question: what do the coordinates actually mean?
The Earth is round, but a map is flat. To flatten it, we use a Coordinate Reference System (CRS).
A CRS is a rule for turning positions on the globe into x/y numbers.
Think of peeling an orange and pressing the peel flat on a table: you cannot do it without tearing or stretching.
Every flat map is a compromise. It distorts area, or shape, or distance.
Two Types of CRS
There are two kinds you must be able to tell apart:
Geographic CRS: longitude/latitude in degrees. Like a global address system: great for saying where something is.
The famous one is EPSG:4326 (the GPS / WGS84 system).
Projected CRS: a flat grid measured in metres. Like a local blueprint: flat and measurable with a ruler, but only honest over a limited area.
For the Netherlands the standard is EPSG:28992 (the “Rijksdriehoek” national grid).
The key trap: a degree is not a metre.
One degree of latitude is about 111 km; one degree of longitude shrinks from 111 km at the equator to 0 at the poles.
The Golden Rule: Match the CRS
Two layers must share the same CRS before you combine them.
It is a ritual: before any operation that touches two spatial datasets, check both.
st_crs() checks the CRS.
st_transform() converts coordinates to another CRS (it actually moves the points).
st_crs(utrecht)$input # what CRS are we in?
[1] "EPSG:28992"
For measuring (distances, buffers) we transform to a metre-based CRS first.
A Common Pitfall: Degrees vs. Metres
What happens if you ignore the rule and measure distance in degrees instead of metres?
You get a number, but it is meaningless: “0.34 degrees” is not a distance you can act on.
Worse, it stretches differently north–south than east–west.
A modern note: since sf version 1.0, areas and distances on lon/lat are computed on the sphere and come out correct in metres.
But buffers still misbehave on lon/lat, so for the next steps we transform to metres first.
Where Is It? Spatial Joins
The Problem: No Common ID
Suppose we have a handful of places (points) in the province, and we want to know which municipality each one falls in.
The two datasets share no common ID column.
The points only have coordinates; the municipalities only have shapes.
A normal left_join() cannot help us here.
We need a spatial join: match rows by where they are, not by a shared key.
Example: Joining Points to Municipalities
Example: Spatial Join
First, build a small, fully reproducible set of well-known places as an sf object with st_as_sf(). Note the ritual: we set the points to EPSG:4326, then transform them to match utrecht.
“How far is each town from Utrecht Centraal? And which town is nearest to which?”
These are continuous questions (a number in metres), unlike the yes/no questions we will see next.
The tool is st_distance().
Give it two sets of points and it returns a distance matrix: every pairwise distance.
Example: Distance Matrix
Because our data is in a metre-based CRS, the distances come back in metres:
dist_matrix <-st_distance(places, places)# distances from Utrecht CS (row 1) to the others, in kmround(as.numeric(dist_matrix[1, ]) /1000, 1)
[1] 0.0 20.4 8.4 31.5 15.6
Row i, column j is the distance from place i to place j.
The first number is 0: that is Utrecht CS to itself. In general the diagonal is zero.
Nearest Neighbour
Which place is closest to each one? For each row we want the smallest distance other than the self-distance of zero.
The clean trick is to set the diagonal to Inf first, then take the minimum:
d <- dist_matrixdiag(d) <-Inf# ignore each place's distance to itselfclosest_idx <-apply(d, 1, which.min) # index of nearest place per rowtibble::tibble(place = places$name,nearest = places$name[closest_idx])
Simple feature collection with 1 feature and 0 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 5.082626 ymin: 52.08417 xmax: 5.082626 ymax: 52.08417
Geodetic CRS: WGS 84
geometry
1 POINT (5.082625 52.08417)
This is how you bootstrap a spatial analysis from a plain list of place names, connecting directly to the scraping and API skills from earlier lectures.
For full street-address geocoding, the tidygeocoder package is the dedicated tool.
Raster Data with stars
Back to the Grid
Recall that a raster is a grid of values, like a geo-located photo.
The stars package (spatiotemporal arrays) handles rasters and works together with sf and the tidyverse.
library(stars)
We will look at one satellite image to see what a raster is, then do the one raster job you are most likely to need: summarizing a raster over polygons.
Reading Raster Data
read_stars() loads raster files (GeoTIFF, NetCDF, and more).
Here is a small Landsat satellite scene that ships with stars:
stars object with 3 dimensions and 1 attribute
attribute(s):
Min. 1st Qu. Median Mean 3rd Qu. Max.
L7_ETMs.tif 1 54 69 68.91242 86 255
dimension(s):
from to offset delta refsys point x/y
x 1 349 288776 28.5 SIRGAS 2000 / UTM zone 25S FALSE [x]
y 1 352 9120761 -28.5 SIRGAS 2000 / UTM zone 25S FALSE [y]
band 1 6 NA NA NA NA
The printout shows the attribute (the values), the dimensions (x, y, and band), and the grid’s extent.
Understanding Bands
A band is one “layer” of light the satellite sensor records.
Just as your phone camera has separate sensors for red, green, and blue to build a colour photo, a satellite captures several bands.
These include light our eyes cannot see, such as infrared, which is useful for measuring vegetation.
plot(landsat[ , , , 1], main ="Band 1") # the 1st band only
Plotting All Bands
By default, plot() shows every band side by side:
plot(landsat, axes =TRUE)
Subsetting and Cropping
You can slice a raster by index (like an array) or crop it to a spatial shape:
# By index: top-left 100×100 cells, first 3 bandssmall <- landsat[ , 1:100, 1:100, 1:3]dim(small)
x y band
100 100 3
# By shape: keep only what falls inside a circle.# The point c(293750, 9115745) sits inside the scene, in the image's own CRS.circle <-st_buffer(st_point(c(293750, 9115745)), 400) |>st_sfc(crs =st_crs(landsat))landsat_crop <- landsat[circle]plot(landsat_crop[ , , , 1], main ="Cropped to a 400 m circle (band 1)")
From Raster to Polygons
The raster job you will actually use most: summarizing a continuous surface to administrative areas.
For example, average rainfall per municipality.
The tool is aggregate(raster, polygons, fn): it summarizes the cells inside each polygon.
This is the natural bridge between the two halves of the lecture: a raster goes in, and one number per polygon comes out, ready to map.
Example: Precipitation per Municipality
Example: Aggregating a Raster over Polygons
We pull daily precipitation over part of the Netherlands from easyclimate, then average it to municipalities.
library(easyclimate); library(cbsodataR)# A region of the Netherlands as polygons (in lon/lat for the climate API)region_nl <-cbs_get_sf("nuts1", 2016) |>st_transform(4326) |> dplyr::filter(statcode =="NL2")
Get the raster of daily precipitation:
precip <-get_daily_climate( region_nl,climatic_var ="Prcp",period =c("2022-01-01", "2022-01-05"),output ="raster") |>st_as_stars()names(precip) <-"prcp"# name the value: precipitationprecip
stars object with 3 dimensions and 1 attribute
attribute(s):
Min. 1st Qu. Median Mean 3rd Qu. Max. NAs
prcp 1.56 2.9 3.7 4.530697 5.06 15.89 27886
dimension(s):
from to offset delta refsys values x/y
x 1 249 5 0.008333 WGS 84 NULL [x]
y 1 135 52.86 -0.008333 WGS 84 NULL [y]
band 1 2 NA NA NA 2022-01-01, 2022-01-05
The five days, side by side:
plot(precip) # the five days, side by side
Now aggregate to municipalities. Averaging the cells inside each polygon only summarizes values; it never measures a distance or area, so we do not need a metre-based CRS here, and we can keep everything in lon/lat to match the raster.
gemeenten16 <-cbs_get_sf("gemeente", 2016) |>st_transform(4326)# index the time dimension by POSITION (the 1st day) — robust to how the# date labels are stored, unlike slicing by a "2022-01-01" character labelavg_precip <-aggregate(precip[ , , , 1], gemeenten16, mean)# aggregate() keeps the rows in the same order as the `by` polygons,# so we can drop the number straight back onto the vector layergemeenten16$prcp <- avg_precip$prcp
We now have one number per municipality, derived from the raster, which we can map like any other attribute:
ggplot(gemeenten16) +geom_sf(aes(fill = prcp), colour =NA) +scale_fill_viridis_c(name ="mm") +theme_void() +labs(title ="Avg. precipitation per municipality, 1 Jan 2022")
A raster, summarized into a vector map: the workflow you will reach for again and again.
Wrapping Up
Where to Find Spatial Data
A short, curated starter list of R packages:
cbsodataR: Dutch official statistics with municipality and region geometries.
rnaturalearth: country, state, and province boundaries worldwide.
osmdata: points of interest from OpenStreetMap (shops, cafés, stations).
geodata: climate, elevation, land use, and administrative boundaries.