Tutorial 3

Instructions

  1. Create a new Quarto document (tutorial3.qmd) in a folder designated for this course.1
  2. For each question, include:
    • The question number and text
    • Your R code in a code chunk
    • Brief explanation of your approach (for conceptual questions)
  3. Make sure your YAML-header (first lines of your .qmd document) look as approximately as follows:
---
title: Tutorial 3
format: html
author: Your Name And Student No.
---
  1. Render your document to HTML to verify all code executes correctly (click on “Preview” in Positron.)

Part 1: Teacher Demonstration

A. Understanding HTML Structure

<!DOCTYPE html>
<html lang="en">
<head>
    <title>Book Store</title>
</head>
<body>
    <header>
        <h1>Online Bookstore</h1>
        <nav>
            <ul>
                <li><a href="/fiction">Fiction</a></li>
                <li><a href="/nonfiction">Non-Fiction</a></li>
            </ul>
        </nav>
    </header>
    
    <main>
        <div class="product" id="book1">
            <h2 class="title">Data Science Handbook</h2>
            <span class="price">$34.99</span>
            <p class="description">Comprehensive guide to modern data science</p>
            <span class="stock" data-available="true">In Stock</span>
        </div>
        
        <div class="product" id="book2">
            <h2 class="title">Web Scraping Basics</h2>
            <span class="price">$29.99</span>
            <p class="description">Learn to extract data from websites</p>
            <span class="stock" data-available="false">Out of Stock</span>
        </div>
    </main>
    
    <footer>
        <p>© 2026 Bookstore Inc.</p>
    </footer>
</body>
</html>

Key Teaching Points: - HTML uses nested tags to create a hierarchical Document Object Model (DOM) - Elements have relationships: parent/child (<main> contains <div class="product">), siblings (the two product divs), ancestors/descendants - Attributes like id, class, and custom data-* attributes provide hooks for selection - Right-click → “Inspect” in browsers reveals this structure visually

B. CSS Selector Fundamentals

Selector Pattern Matches Example from HTML Above
h2 All <h2> elements Both book titles
.price All elements with class="price" Both price spans
#book1 Element with id="book1" First book container only
div.product <div> elements with class “product” Both book containers
span.stock[data-available="true"] Stock spans with available=true Only first book’s stock indicator
main > div Direct child divs of <main> Both product divs (not nested deeper elements)
h2 + span Span immediately following an h2 Each price span (direct sibling after title)
h1 ~ p All paragraphs after h1 at same level Footer copyright paragraph

Interactive Exercise:

  1. Open browser developer tools on any news website
  2. Right-click a headline → “Inspect” to see its HTML
  3. In Console tab, test: document.querySelectorAll("h2") to see all matches
  4. Refine selector (change “h2” to something else) until it captures ONLY the desired elements.

C. Basic Scraping Workflow with rvest

library(rvest)
library(stringr)
library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
# STEP 1: Read HTML from URL
page <- read_html("https://books.toscrape.com/catalogue/page-1.html")

# STEP 2: Select elements using CSS selectors
titles <- page |> 
  html_elements("article.product_pod h3 a") |> 
  html_attr("title")  # Extract title attribute directly

prices <- page |> 
  html_elements("p.price_color") |> 
  html_text() |> 
  trimws()

availability <- page |> 
  html_elements("p.instock.availability") |> 
  html_text() |>
  str_trim()

# STEP 3: Extract star ratings (stored as class names like "star-rating Three")
ratings <- page |> 
  html_elements("article.product_pod p.star-rating") |> 
  html_attr("class") |> 
  str_replace("star-rating ", "") # Extract rating word

# STEP 4: Combine into structured data frame
books_df <- data.frame(
  title = titles,
  price = prices,
  stock_info = availability,
  rating = ratings,
  stringsAsFactors = FALSE
)

# Clean price column (remove £ symbol and convert to numeric)
books_df$price_numeric <- as.numeric(gsub("[£]", "", books_df$price))

Key points: - Pipe operator (|>) creates readable, sequential workflow - html_text() extracts visible content; html_attr() extracts attribute values - Always clean extracted data (remove currency symbols, whitespace) - Test selectors on small sample before scaling up

D. Advanced Scraping Patterns with Example

library(rvest)
library(dplyr)
library(stringr)

# SCENARIO: Multi-page scraping from books.toscrape.com
# This site is EXPLICITLY designed for learning web scraping (ethical & safe to use)
# Structure: 50 pages of books at https://books.toscrape.com/catalogue/page-{n}.html

scrape_books <- function(max_pages = 3) {
  all_books <- list()
  base_url <- "https://books.toscrape.com/catalogue/page-%d.html"
  
  for (page_num in 1:max_pages) {
    # Construct URL without trailing spaces (critical!)
    page_url <- sprintf(base_url, page_num)
    
    cat("Scraping page", page_num, ":", page_url, "\n")
    
    # Fetch page with error handling
    page <- tryCatch({
      read_html(page_url)
    }, error = function(e) {
      cat("ERROR on page", page_num, ":", e$message, "\n")
      return(NULL)
    })
    
    # Skip if page failed to load
    if (is.null(page)) {
      next
    }
    
    # Extract book containers
    book_articles <- html_elements(page, "article.product_pod")
    
    # Process each book on the page
    page_data <- lapply(book_articles, function(book) {
      list(
        title = html_attr(html_element(book, "h3 a"), "title", default = NA),
        price = html_text(html_element(book, "p.price_color"), trim = TRUE),
        stock = html_text(html_element(book, "p.instock.availability"), trim = TRUE),
        rating = html_attr(html_element(book, "p.star-rating"), "class"),
        url = paste0("https://books.toscrape.com/catalogue/", 
                     html_attr(html_element(book, "h3 a"), "href", default = NA))
      )
    })

    # Convert to data frame
    page_df <- bind_rows(page_data) |>
      mutate(
        price_numeric = as.numeric(str_remove(price, "£")),
        rating = str_replace(rating, "star-rating ", ""),
        page_scraped = page_num
      )
    
    all_books[[page_num]] <- page_df
    
    # BE POLITE: Wait 1-2 seconds between requests (required by site's terms)
    Sys.sleep(runif(1, 1, 2))
  }
  
  # Combine all pages and return
  if (length(all_books) > 0) {
    bind_rows(all_books)
  } else {
    data.frame()
  }
}

# Execute scraping (limit to 3 pages for classroom demonstration)
books_data <- scrape_books(max_pages = 3)
Scraping page 1 : https://books.toscrape.com/catalogue/page-1.html 
Scraping page 2 : https://books.toscrape.com/catalogue/page-2.html 
Scraping page 3 : https://books.toscrape.com/catalogue/page-3.html 
# View results
cat("\nSuccessfully scraped", nrow(books_data), "books from", 
    length(unique(books_data$page_scraped)), "pages\n")

Successfully scraped 60 books from 3 pages
head(books_data[, c("title", "price", "stock", "rating")])
# A tibble: 6 × 4
  title                                 price  stock    rating
  <chr>                                 <chr>  <chr>    <chr> 
1 A Light in the Attic                  £51.77 In stock Three 
2 Tipping the Velvet                    £53.74 In stock One   
3 Soumission                            £50.10 In stock One   
4 Sharp Objects                         £47.82 In stock Four  
5 Sapiens: A Brief History of Humankind £54.23 In stock Five  
6 The Requiem Red                       £22.65 In stock One   

Part 2: Student Practice Questions

HTML and DOM Fundamentals

  1. In the following HTML snippet, identify the parent element of the <span class="price"> element:

    <div class="product">
      <h2>Book Title</h2>
      <span class="price">$24.99</span>
    </div>
  2. What is the difference between an element’s id attribute and its class attribute in HTML? When would you use each for web scraping?

  3. Describe the relationship between these elements in the DOM tree:

    <body>
      <div id="container">
        <p>First paragraph</p>
        <section>
          <p>Second paragraph</p>
        </section>
      </div>
    </body>

    Specifically, what is the relationship between the two <p> elements?

  4. Why can’t the rvest package scrape content that loads dynamically via JavaScript? What alternative tools might handle this scenario?

CSS Selector Syntax

  1. Write a CSS selector that would match all <a> elements that have an href attribute starting with “https://”.

  2. Given this HTML:

    <div class="product featured">
      <h3>Special Item</h3>
      <span class="price sale">$19.99</span>
    </div>

    Write a selector that matches ONLY the price span (not other spans on the page).

  3. Explain the difference between these two selectors: div p versus div > p. Provide an HTML example where they would return different results.

  4. How would you select the third <li> element within an unordered list using CSS selectors?

  5. Write a selector to find all elements with a data-category attribute that contains the value “electronics” (e.g., data-category="home-electronics").

  6. Given this HTML structure, write a selector to target ONLY the “Out of Stock” message:

    <div class="product">
      <h2>Laptop</h2>
      <span class="stock available">In Stock</span>
    </div>
    <div class="product">
      <h2>Tablet</h2>
      <span class="stock unavailable">Out of Stock</span>
    </div>

rvest Implementation

  1. What is the purpose of the trimws() function when used after html_text()? Provide an example where it would change the extracted result.2

  2. When would you use html_element() (singular) versus html_elements() (plural) in rvest? What happens if the selector doesn’t match anything in each case?

  3. You’re trying to extract image URLs from a webpage. The images appear as <img src="photo.jpg" alt="Product">. Which rvest function would you use to extract the URLs, and what argument would you specify?

  4. After running tables <- page |> html_elements("table") |> html_table(), what R data structure does tables contain? How would you access the second table in this collection?

  5. Why is it important to use optional = TRUE in html_element() when scraping production websites? What error might occur without it?

Workflow and Best Practices

  1. You need to scrape 50 product pages where URLs follow the pattern https://store.com/item?id=1, https://store.com/item?id=2, etc. Write R code to generate these URLs programmatically using paste0().

  2. Why should you include Sys.sleep(2) between scraping requests? What ethical and practical problems does this prevent?

  3. Before scraping a new website, what three checks should you perform to ensure your scraping is legal and ethical?

  4. You’ve scraped raw price data as character strings: c("$19.99", "$24.50", "$9.75"). Write R code to convert this to a numeric vector of values (19.99, 24.50, 9.75).

  5. After successfully scraping data, why is it important to save the raw extracted data (before cleaning) using write_csv or write.csv()? Describe a scenario where this practice would save significant time.

Footnotes

  1. File > New File > Quarto Document.↩︎

  2. Hint: check out ?trimws().↩︎