Key Teaching Points: - HTML uses nested tags to create a hierarchical Document Object Model (DOM) - Elements have relationships: parent/child (<main> contains <div class="product">), siblings (the two product divs), ancestors/descendants - Attributes like id, class, and custom data-* attributes provide hooks for selection - Right-click → “Inspect” in browsers reveals this structure visually
B. CSS Selector Fundamentals
Selector Pattern
Matches
Example from HTML Above
h2
All <h2> elements
Both book titles
.price
All elements with class="price"
Both price spans
#book1
Element with id="book1"
First book container only
div.product
<div> elements with class “product”
Both book containers
span.stock[data-available="true"]
Stock spans with available=true
Only first book’s stock indicator
main > div
Direct child divs of <main>
Both product divs (not nested deeper elements)
h2 + span
Span immediately following an h2
Each price span (direct sibling after title)
h1 ~ p
All paragraphs after h1 at same level
Footer copyright paragraph
Interactive Exercise:
Open browser developer tools on any news website
Right-click a headline → “Inspect” to see its HTML
In Console tab, test: document.querySelectorAll("h2") to see all matches
Refine selector (change “h2” to something else) until it captures ONLY the desired elements.
C. Basic Scraping Workflow with rvest
library(rvest)library(stringr)library(dplyr)
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
# STEP 1: Read HTML from URLpage <-read_html("https://books.toscrape.com/catalogue/page-1.html")# STEP 2: Select elements using CSS selectorstitles <- page |>html_elements("article.product_pod h3 a") |>html_attr("title") # Extract title attribute directlyprices <- page |>html_elements("p.price_color") |>html_text() |>trimws()availability <- page |>html_elements("p.instock.availability") |>html_text() |>str_trim()# STEP 3: Extract star ratings (stored as class names like "star-rating Three")ratings <- page |>html_elements("article.product_pod p.star-rating") |>html_attr("class") |>str_replace("star-rating ", "") # Extract rating word# STEP 4: Combine into structured data framebooks_df <-data.frame(title = titles,price = prices,stock_info = availability,rating = ratings,stringsAsFactors =FALSE)# Clean price column (remove £ symbol and convert to numeric)books_df$price_numeric <-as.numeric(gsub("[£]", "", books_df$price))
Key points: - Pipe operator (|>) creates readable, sequential workflow - html_text() extracts visible content; html_attr() extracts attribute values - Always clean extracted data (remove currency symbols, whitespace) - Test selectors on small sample before scaling up
D. Advanced Scraping Patterns with Example
library(rvest)library(dplyr)library(stringr)# SCENARIO: Multi-page scraping from books.toscrape.com# This site is EXPLICITLY designed for learning web scraping (ethical & safe to use)# Structure: 50 pages of books at https://books.toscrape.com/catalogue/page-{n}.htmlscrape_books <-function(max_pages =3) { all_books <-list() base_url <-"https://books.toscrape.com/catalogue/page-%d.html"for (page_num in1:max_pages) {# Construct URL without trailing spaces (critical!) page_url <-sprintf(base_url, page_num)cat("Scraping page", page_num, ":", page_url, "\n")# Fetch page with error handling page <-tryCatch({read_html(page_url) }, error =function(e) {cat("ERROR on page", page_num, ":", e$message, "\n")return(NULL) })# Skip if page failed to loadif (is.null(page)) {next }# Extract book containers book_articles <-html_elements(page, "article.product_pod")# Process each book on the page page_data <-lapply(book_articles, function(book) {list(title =html_attr(html_element(book, "h3 a"), "title", default =NA),price =html_text(html_element(book, "p.price_color"), trim =TRUE),stock =html_text(html_element(book, "p.instock.availability"), trim =TRUE),rating =html_attr(html_element(book, "p.star-rating"), "class"),url =paste0("https://books.toscrape.com/catalogue/", html_attr(html_element(book, "h3 a"), "href", default =NA)) ) })# Convert to data frame page_df <-bind_rows(page_data) |>mutate(price_numeric =as.numeric(str_remove(price, "£")),rating =str_replace(rating, "star-rating ", ""),page_scraped = page_num ) all_books[[page_num]] <- page_df# BE POLITE: Wait 1-2 seconds between requests (required by site's terms)Sys.sleep(runif(1, 1, 2)) }# Combine all pages and returnif (length(all_books) >0) {bind_rows(all_books) } else {data.frame() }}# Execute scraping (limit to 3 pages for classroom demonstration)books_data <-scrape_books(max_pages =3)
# A tibble: 6 × 4
title price stock rating
<chr> <chr> <chr> <chr>
1 A Light in the Attic £51.77 In stock Three
2 Tipping the Velvet £53.74 In stock One
3 Soumission £50.10 In stock One
4 Sharp Objects £47.82 In stock Four
5 Sapiens: A Brief History of Humankind £54.23 In stock Five
6 The Requiem Red £22.65 In stock One
Part 2: Student Practice Questions
HTML and DOM Fundamentals
In the following HTML snippet, identify the parent element of the <span class="price"> element:
Write a selector that matches ONLY the price span (not other spans on the page).
Explain the difference between these two selectors: div p versus div > p. Provide an HTML example where they would return different results.
How would you select the third <li> element within an unordered list using CSS selectors?
Write a selector to find all elements with a data-category attribute that contains the value “electronics” (e.g., data-category="home-electronics").
Given this HTML structure, write a selector to target ONLY the “Out of Stock” message:
<div class="product"><h2>Laptop</h2><span class="stock available">In Stock</span></div><div class="product"><h2>Tablet</h2><span class="stock unavailable">Out of Stock</span></div>
rvest Implementation
What is the purpose of the trimws() function when used after html_text()? Provide an example where it would change the extracted result.2
When would you use html_element() (singular) versus html_elements() (plural) in rvest? What happens if the selector doesn’t match anything in each case?
You’re trying to extract image URLs from a webpage. The images appear as <img src="photo.jpg" alt="Product">. Which rvest function would you use to extract the URLs, and what argument would you specify?
After running tables <- page |> html_elements("table") |> html_table(), what R data structure does tables contain? How would you access the second table in this collection?
Why is it important to use optional = TRUE in html_element() when scraping production websites? What error might occur without it?
Workflow and Best Practices
You need to scrape 50 product pages where URLs follow the pattern https://store.com/item?id=1, https://store.com/item?id=2, etc. Write R code to generate these URLs programmatically using paste0().
Why should you include Sys.sleep(2) between scraping requests? What ethical and practical problems does this prevent?
Before scraping a new website, what three checks should you perform to ensure your scraping is legal and ethical?
You’ve scraped raw price data as character strings: c("$19.99", "$24.50", "$9.75"). Write R code to convert this to a numeric vector of values (19.99, 24.50, 9.75).
After successfully scraping data, why is it important to save the raw extracted data (before cleaning) using write_csv or write.csv()? Describe a scenario where this practice would save significant time.