Tutorial 3

Instructions

Open the folder you created for this course in Positron using Open > Open Folder… > Select your Folder.
Inside this folder, create a new Quarto document (tutorial3.qmd).¹
For each question, include:
- The question number and text
- Your R code in a code chunk
- Brief explanation of your approach (for conceptual questions)
Make sure your YAML-header (first lines of your .qmd document) look as approximately as follows:

---
title: Tutorial 3
format: html
author: Your Name And Student No.
---

Render your document to HTML to verify all code executes correctly (click on “Preview” in Positron.)

Part 1: Teacher Demonstration

A. Understanding HTML Structure

<!DOCTYPE html>
<html lang="en">
<head>
    <title>Book Store</title>
</head>
<body>
    <header>
        <h1>Online Bookstore</h1>
        <nav>
            <ul>
                <li><a href="/fiction">Fiction</a></li>
                <li><a href="/nonfiction">Non-Fiction</a></li>
            </ul>
        </nav>
    </header>
    
    <main>
        <div class="product" id="book1">
            <h2 class="title">Data Science Handbook</h2>
            <span class="price">$34.99</span>
            <p class="description">Comprehensive guide to modern data science</p>
            <span class="stock" data-available="true">In Stock</span>
        </div>
        
        <div class="product" id="book2">
            <h2 class="title">Web Scraping Basics</h2>
            <span class="price">$29.99</span>
            <p class="description">Learn to extract data from websites</p>
            <span class="stock" data-available="false">Out of Stock</span>
        </div>
    </main>
    
    <footer>
        <p>© 2026 Bookstore Inc.</p>
    </footer>
</body>
</html>

Key Teaching Points: - HTML uses nested tags to create a hierarchical Document Object Model (DOM) - Elements have relationships: parent/child (<main> contains <div class="product">), siblings (the two product divs), ancestors/descendants - Attributes like id, class, and custom data-* attributes provide hooks for selection - Right-click → “Inspect” in browsers reveals this structure visually

B. CSS Selector Fundamentals

Selector Pattern	Matches	Example from HTML Above
`h2`	All `<h2>` elements	Both book titles
`.price`	All elements with `class="price"`	Both price spans
`#book1`	Element with `id="book1"`	First book container only
`div.product`	`<div>` elements with class “product”	Both book containers
`span.stock[data-available="true"]`	Stock spans with available=true	Only first book’s stock indicator
`main > div`	Direct child divs of `<main>`	Both product divs (not nested deeper elements)
`h2 + span`	Span immediately following an h2	Each price span (adjacent sibling after title)
`h2 ~ span`	All span siblings after an h2 (same parent)	Both the price and stock spans in each product

Interactive Exercise:

Open browser developer tools on any news website
Right-click a headline → “Inspect” to see its HTML
In Console tab, test: document.querySelectorAll("h2") to see all matches
Refine selector (change “h2” to something else) until it captures ONLY the desired elements.

C. Basic Scraping Workflow with `rvest`

library(rvest)
library(stringr)
library(dplyr)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

# STEP 1: Read HTML from URL
page <- read_html("https://books.toscrape.com/catalogue/page-1.html")

# STEP 2: Select elements using CSS selectors
titles <- page |> 
  html_elements("article.product_pod h3 a") |> 
  html_attr("title")  # Extract title attribute directly

prices <- page |> 
  html_elements("p.price_color") |> 
  html_text() |> 
  trimws()

availability <- page |> 
  html_elements("p.instock.availability") |> 
  html_text() |>
  str_trim()

# STEP 3: Extract star ratings (stored as class names like "star-rating Three")
ratings <- page |> 
  html_elements("article.product_pod p.star-rating") |> 
  html_attr("class") |> 
  str_replace("star-rating ", "") # Extract rating word

# STEP 4: Combine into structured data frame
books_df <- data.frame(
  title = titles,
  price = prices,
  stock_info = availability,
  rating = ratings,
  stringsAsFactors = FALSE
)

# Clean price column (remove £ symbol and convert to numeric)
books_df$price_numeric <- as.numeric(gsub("[£]", "", books_df$price))

Key points: - Pipe operator (|>) creates readable, sequential workflow - html_text() extracts visible content; html_attr() extracts attribute values - Always clean extracted data (remove currency symbols, whitespace) - Test selectors on small sample before scaling up

D. Advanced Scraping Patterns with Example

library(rvest)
library(dplyr)
library(stringr)

# SCENARIO: Multi-page scraping from books.toscrape.com
# This site is EXPLICITLY designed for learning web scraping (ethical & safe to use)
# Structure: 50 pages of books at https://books.toscrape.com/catalogue/page-{n}.html

scrape_books <- function(max_pages = 3) {
  all_books <- list()
  base_url <- "https://books.toscrape.com/catalogue/page-%d.html"
  
  for (page_num in 1:max_pages) {
    # Construct URL without trailing spaces (critical!)
    page_url <- sprintf(base_url, page_num)
    
    cat("Scraping page", page_num, ":", page_url, "\n")
    
    # Fetch page with error handling
    page <- tryCatch({
      read_html(page_url)
    }, error = function(e) {
      cat("ERROR on page", page_num, ":", e$message, "\n")
      return(NULL)
    })
    
    # Skip if page failed to load
    if (is.null(page)) {
      next
    }
    
    # Extract book containers
    book_articles <- html_elements(page, "article.product_pod")
    
    # Process each book on the page
    page_data <- lapply(book_articles, function(book) {
      list(
        title = html_attr(html_element(book, "h3 a"), "title", default = NA),
        price = html_text(html_element(book, "p.price_color"), trim = TRUE),
        stock = html_text(html_element(book, "p.instock.availability"), trim = TRUE),
        rating = html_attr(html_element(book, "p.star-rating"), "class"),
        url = paste0("https://books.toscrape.com/catalogue/", 
                     html_attr(html_element(book, "h3 a"), "href", default = NA))
      )
    })

    # Convert to data frame
    page_df <- bind_rows(page_data) |>
      mutate(
        price_numeric = as.numeric(str_remove(price, "£")),
        rating = str_replace(rating, "star-rating ", ""),
        page_scraped = page_num
      )
    
    all_books[[page_num]] <- page_df
    
    # BE POLITE: Wait 1-2 seconds between requests (required by site's terms)
    Sys.sleep(runif(1, 1, 2))
  }
  
  # Combine all pages and return
  if (length(all_books) > 0) {
    bind_rows(all_books)
  } else {
    data.frame()
  }
}

# Execute scraping (limit to 3 pages for classroom demonstration)
books_data <- scrape_books(max_pages = 3)

Scraping page 1 : https://books.toscrape.com/catalogue/page-1.html 
Scraping page 2 : https://books.toscrape.com/catalogue/page-2.html 
Scraping page 3 : https://books.toscrape.com/catalogue/page-3.html

# View results
cat("\nSuccessfully scraped", nrow(books_data), "books from", 
    length(unique(books_data$page_scraped)), "pages\n")


Successfully scraped 60 books from 3 pages

head(books_data[, c("title", "price", "stock", "rating")])

# A tibble: 6 × 4
  title                                 price  stock    rating
  <chr>                                 <chr>  <chr>    <chr> 
1 A Light in the Attic                  £51.77 In stock Three 
2 Tipping the Velvet                    £53.74 In stock One   
3 Soumission                            £50.10 In stock One   
4 Sharp Objects                         £47.82 In stock Four  
5 Sapiens: A Brief History of Humankind £54.23 In stock Five  
6 The Requiem Red                       £22.65 In stock One

Part 2: Student Practice Questions

HTML and DOM Fundamentals

In the following HTML snippet, identify the parent element of the <span class="price"> element:

<div class="product">
  <h2>Book Title</h2>
  <span class="price">$24.99</span>
</div>

What is the difference between an element’s id attribute and its class attribute in HTML? When would you use each for web scraping?

Describe the relationship between these elements in the DOM tree:

<body>
  <div id="container">
    <p>First paragraph</p>
    <section>
      <p>Second paragraph</p>
    </section>
  </div>
</body>

Specifically, what is the relationship between the two <p> elements?

Why can’t the rvest package scrape content that loads dynamically via JavaScript? What alternative tools might handle this scenario?

CSS Selector Syntax

Write a CSS selector that would match all <a> elements that have an href attribute starting with “https://”.

Given this HTML:

<div class="product featured">
  <h3>Special Item</h3>
  <span class="price sale">$19.99</span>
</div>

Write a selector that matches ONLY the price span (not other spans on the page).

Explain the difference between these two selectors: div p versus div > p. Provide an HTML example where they would return different results.
How would you select the third <li> element within an unordered list using CSS selectors?
Write a selector to find all elements with a data-category attribute that contains the value “electronics” (e.g., data-category="home-electronics").

Given this HTML structure, write a selector to target ONLY the “Out of Stock” message:

<div class="product">
  <h2>Laptop</h2>
  <span class="stock available">In Stock</span>
</div>
<div class="product">
  <h2>Tablet</h2>
  <span class="stock unavailable">Out of Stock</span>
</div>

`rvest` Implementation

What is the purpose of the trimws() function when used after html_text()? Provide an example where it would change the extracted result.²
When would you use html_element() (singular) versus html_elements() (plural) in rvest? What happens if the selector doesn’t match anything in each case?
You’re trying to extract image URLs from a webpage. The images appear as <img src="photo.jpg" alt="Product">. Which rvest function would you use to extract the URLs, and what argument would you specify?
After running tables <- page |> html_elements("table") |> html_table(), what R data structure does tables contain? How would you access the second table in this collection?
In the teacher demonstration (Part 1D), the book titles are extracted with html_attr(html_element(book, "h3 a"), "title", default = NA). What does the default argument do, and what value do you get back if a particular book happens to have no <a> element with a title attribute? Why is this helpful when scraping many products at once?

Workflow and Best Practices

You need to scrape 50 product pages where URLs follow the pattern https://store.com/item?id=1, https://store.com/item?id=2, etc. Write R code to generate these URLs programmatically using paste0().
Why should you include Sys.sleep(2) between scraping requests? What ethical and practical problems does this prevent?
Before scraping a new website, what three checks should you perform to ensure your scraping is legal and ethical?
You’ve scraped raw price data as character strings: c("$19.99", "$24.50", "$9.75"). Write R code to convert this to a numeric vector of values (19.99, 24.50, 9.75).
After successfully scraping data, why is it important to save the raw extracted data (before cleaning) using write_csv or write.csv()? Describe a scenario where this practice would save significant time.

Footnotes

File > New File > Quarto Document.↩︎
Hint: check out ?trimws().↩︎