Introduction to Applied Data Science

Lecture 3: Getting Data

Bas Machielsen

Overview

Course Schedule

Event Date Subject
Lecture 1 21-04 Introduction to Data and Data Science
Lecture 2 28-04 Getting Data: API's and Databases
Lecture 3 07-05 Getting Data: Web Scraping
Lecture 4 26-05 Text as Data
Lecture 5 27-05 Introduction to LLMs
Lecture 6 09-06 Prompt Engineering and Structured Data
Lecture 7 16-06 Spatial Data and Geocomputation

Outline Today

  • First part: introduction to web scraping, to HTML and CSS selectors
  • Second part: web scraping implemented in R through the rvest package.

Web Scraping with R: From HTML to Data

Introduction: What is Web Scraping?

  • Web scraping is the process of automatically extracting information from websites.
    • Think of it as teaching your computer to read websites and copy specific information into a spreadsheet, just like you might manually copy prices from an online store.
    • Every day, businesses use web scraping to track competitor prices, monitor news, collect real estate listings, or gather social media data.
    • Instead of clicking through hundreds of pages and copying information by hand, we write code that does this work in seconds.

The Internet: A Simple Model

  • When you visit a website, your computer (the client) sends a request to another computer (the server) asking for a webpage.
  • The server responds by sending back a document written in HTML, which is the language that describes what the webpage should look like.
    • This conversation between your computer and the server follows rules called the HTTP protocol, which stands for Hypertext Transfer Protocol.1
    • Think of HTTP as the postal system of the internet: you send a letter (request) asking for information, and you receive a package (response) containing the webpage.

HTTP Requests and Responses

  • When you type a URL into your browser and press Enter, you’re sending an HTTP GET request.
    • A GET request is like asking “Can I please see this page?”
    • The server then sends back an HTTP response, which includes a status code and the actual content.
    • A status code of 200 means “Success! Here’s what you asked for,” while 404 means “Sorry, I can’t find that page.” Other important status codes include 403 (forbidden, you don’t have permission) and 500 (server error, something went wrong on their end).

What is HTML?

  • When you navigate to a particular website, you are essentially sending a GET request for a HTML text.
  • HTML stands for Hypertext Markup Language, and it’s the skeleton of every webpage.
  • HTML is written as plain text with special markers called tags that tell the browser how to display content.
    • A simple HTML document looks like a nested set of boxes, where each box contains either text or more boxes.
    • Tags come in pairs, with an opening tag like <p> and a closing tag like </p>, and the content goes between them.

HTML Structure: A Document Example

  • Every HTML document starts with <html> and ends with </html>, which wraps everything.
  • Inside, there are two main sections: the <head> contains metadata like the page title, and the <body> contains the visible content.
    • Headings are marked with tags like <h1> for the main title, <h2> for subtitles, and so on.
    • Paragraphs use <p> tags, links use <a> tags, and images use <img> tags.
    • Lists can be ordered (numbered) using <ol> and <li> tags, or unordered (bulleted) using <ul> and <li>.
    • Tables are built with <table>, <tr> (table row), <th> (table header), and <td> (table data) tags.

Example: HTML Page

Example: HTML Page

The following example contains the HTML code for a very basic web page. Try to copy this code in a text document, name it site.html, and open it with your browser.

<html>
  <head>
    Page Title
  </head>
  <body>
    <p>First Paragraph</p>
    <div>Second Paragraph</div>
    <ol>
      <li>Coffee</li>
      <li>Tea</li>
      <li>Milk</li>
    </ol>
  </body>
</html>

Example: HTML Page with Table

Example: HTML Table

The following HTML code contains a table with data inside.

<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Simple Table</title>
</head>
<body>
    <h1>Fruit Prices</h1>
    
    <table border="1">
        <thead>
            <tr>
                <th>Fruit</th>
                <th>Price</th>
                <th>In Stock</th>
            </tr>
        </thead>
        <tbody>
            <tr>
                <td>Apple</td>
                <td>$1.50</td>
                <td>Yes</td>
            </tr>
            <tr>
                <td>Banana</td>
                <td>$0.75</td>
                <td>Yes</td>
            </tr>
            <tr>
                <td>Orange</td>
                <td>$2.00</td>
                <td>No</td>
            </tr>
        </tbody>
    </table>
</body>
</html>

How Browsers Render HTML

  • When your browser receives HTML from a server, it reads through the document and interprets the tags.
  • The browser builds a tree-like structure in memory called the DOM (Document Object Model), which represents the page’s hierarchy.
    • Think of the DOM as a family tree: the <html> tag is the great-grandparent, <body> is a grandparent, and nested tags like <p> and <span> are children and grandchildren.
    • The browser then displays this structure visually, showing text, images, and interactive elements according to the HTML instructions.

CSS and CSS Selectors

What is CSS?

  • As you might have noticed from opening the .html file in your browser, the simple web page doesn’t look very pretty.
  • CSS stands for Cascading Style Sheets, and it’s the language that makes websites look attractive.
  • While HTML provides the structure and content, CSS controls colors, fonts, spacing, layouts, and visual design.
    • Think of HTML as the blueprint of a house (where the rooms and walls go) and CSS as the interior design (paint colors, furniture, decorations).
    • CSS rules can be embedded in HTML or loaded from separate files, and they tell the browser “make all headings blue” or “add space around paragraphs.”

CSS Selectors: Finding Elements

  • CSS selectors are patterns that identify specific elements in an HTML document.
  • The most basic selector is the tag name itself: p selects all paragraphs, h1 selects all main headings, and table selects all tables.
    • Elements can have an id attribute (a unique name for one element) selected with a hash: #main-title finds the element with id="main-title".
    • Elements can also have a class attribute (a name shared by multiple elements) selected with a dot: .product-price finds all elements with class="product-price".
    • You can combine selectors to be more specific: div.important means “find <div> elements that have class ‘important’.”
    • Descendant selectors use spaces: table tr td means “find <td> elements that are inside <tr> elements that are inside <table> elements.”

Why CSS Selectors Matter for Web Scraping

  • When we scrape a website, we don’t want all the HTML—we want specific pieces of information.
  • CSS selectors let us precisely target the data we need, like prices, product names, or dates.
    • Imagine a webpage with 100 paragraphs, but you only want the paragraph with class “author-bio”—CSS selectors make this possible.
    • By inspecting a website’s HTML (using browser developer tools), we can discover which selectors will extract our desired data.

The HTML Family Tree: Understanding Relationships

  • HTML elements are nested inside each other, creating a family tree structure that we call the Document Object Model (DOM).
    • When one element is directly inside another, we call the outer element the “parent” and the inner element the “child.”
    • For example, in <div><p>Hello</p></div>, the <div> is the parent and the <p> is the child.
  • Elements that share the same parent are called “siblings”—they’re at the same level in the tree.
  • An element that contains another element (directly or through multiple levels) is called an “ancestor,” while the contained element is a “descendant.”
  • Think of it like a real family: your parents are your ancestors, your children are your descendants, and your brothers and sisters are your siblings.

Basic CSS Selectors: The Foundation

  • The tag selector targets all elements of a specific type: p selects all paragraphs, div selects all divs, and a selects all links.
  • The class selector uses a dot and targets elements with a specific class attribute: .price selects all elements with class="price".
  • The id selector uses a hash and targets the one element with a specific id: #header selects the element with id="header".
  • The universal selector is an asterisk * that selects every element on the page (rarely used in scraping, but good to know).
    • You can combine tag and class selectors for more specificity: p.important means “paragraphs that have class ‘important’,” not just any element with that class.
    • Similarly, div#main means “the div element with id ‘main’,” though this is redundant since ids should be unique anyway.

Example: HTML Code and CSS Selectors

Example: HTML Code and CSS Selectors

How would you select all <p> elements? And would you select all elements with class price?

<!DOCTYPE html>
<html>
<body>
    <div id="header">Header</div>
    <p>Regular paragraph</p>
    <p class="important">Important paragraph</p>
    <span class="price">$10</span>
    <a href="#">Link</a>
</body>
</html>

Relationship Selectors

  • The descendant selector uses a space and finds elements nested anywhere inside another: div p means “any <p> that is somewhere inside a <div>,” no matter how deep.
  • The child selector uses > and finds only direct children: div > p means “any <p> that is directly inside a <div>,” not grandchildren or deeper.
  • The adjacent sibling selector uses + and finds an element immediately after another: h1 + p means “a <p> that comes right after an <h1>.”
  • The general sibling selector uses ~ and finds all siblings after an element: h1 ~ p means “all <p> elements that come after an <h1> at the same level.”
    • These relationship selectors help when classes or ids aren’t available, or when you need to be very precise about which elements you want.

Example: Relationship Selectors

Example: HTML Code and CSS Selectors

div p matches all 3 paragraphs inside divs (including the nested one).

div > p only matches the first p in the second div (direct child).

h1 + p only matches the first p after h1.

h1 ~ p matches all 3 paragraphs after h1.

<!DOCTYPE html>
<html>
<body>
    <div>
        <p>Descendant: div p (any p inside div)</p>
        <section>
            <p>Also matches div p (nested deeper)</p>
        </section>
    </div>
    
    <div>
        <p>Child: div > p (direct child only)</p>
        <section><p>NOT a child of div (it's grandchild)</p></section>
    </div>
    
    <h1>Title</h1>
    <p>Adjacent sibling: h1 + p (right after h1)</p>
    <p>General sibling: h1 ~ p (any p after h1)</p>
    <p>Also matches: h1 ~ p</p>
</body>
</html>

Attribute Selectors: When Class and ID Aren’t Enough

  • Attribute selectors let you target elements based on any of their attributes, not just class and id.
  • The basic form [attribute] selects elements that have that attribute: [href] selects all elements with an href attribute (usually links).
    • You can match exact values with [attribute="value"]: [type="submit"] selects input elements with type submit (submit buttons).
    • The [attribute*="value"] form matches if the value appears anywhere: [href*="products"] finds links containing “products” in the URL.
    • The [attribute^="value"] form matches if the value starts with something: [href^="https"] finds all secure https links.
    • The [attribute$="value"] form matches if the value ends with something: [src$=".jpg"] finds all images with jpg extension.
    • These are powerful when scraping sites that use data attributes like data-price="29.99" or when filtering by URL patterns.

Example: Attribute Selectors

Example: Attribute Selectors

[href] matches all 3 <a> elements (they all have href). [type="submit"] matches only the submit button. [href*="products"] - 2 links (Shoes and Shirts contain “products”). [href^="https"] - 2 links (Shoes and Shirts start with “https”). [src$=".jpg"]- only the photo.jpg image (ends with .jpg). [data-price] - both product divs (have data-price attribute).

<!DOCTYPE html>
<html>
<body>
    <a href="https://example.com/products/shoes">Shoes</a>
    <a href="http://example.com/about">About</a>
    <a href="https://example.com/products/shirts">Shirts</a>
    
    <input type="text" placeholder="Name">
    <input type="submit" value="Submit">
    
    <img src="photo.jpg" alt="Photo">
    <img src="logo.png" alt="Logo">
    
    <div data-price="29.99">Product A</div>
    <div data-price="49.99">Product B</div>
</body>
</html>

Combining Selectors: Building Complex Queries

  • You can chain multiple conditions together without spaces to make selectors more specific: div.product.featured means “divs that have both ‘product’ and ‘featured’ classes.”
  • You can combine relationship selectors: table.prices > tbody > tr > td.amount navigates through the table structure to find specific cells.
    • Comma-separated selectors work like “or”: h1, h2, h3 means “select all h1 OR h2 OR h3 elements.”
    • The :nth-child() pseudo-class lets you select by position: tr:nth-child(2) selects the second row, li:nth-child(odd) selects odd-numbered list items.
    • The :first-child and :last-child pseudo-classes select the first or last child of a parent: tr:first-child gets the first row in a table.
    • The :not() pseudo-class excludes elements: p:not(.advertisement) means “all paragraphs except those with class ‘advertisement’.”

Other Examples: From Webpage to Selector

Example: CSS Selectors

Example 1: You want to scrape product prices from a list where each price is in <span class="price">$29.99</span>—use .price or span.price.

Example 2: You want the main article text, which is in <div id="article-content"><p>Text here...</p></div>—use #article-content p to get all paragraphs inside.

Example 3: You want table data only from the second column of each row: table tr td:nth-child(2) selects the second <td> in every row.

Example 4: You want all external links (those starting with http): a[href^="http"] finds them, excluding internal relative links.

Example 5: You want author names that appear right after article titles: h2.article-title + p.author uses the adjacent sibling selector.

The key is to inspect the HTML structure, identify unique patterns in tags, classes, ids, or relationships, and build the most specific selector that captures only what you need.

Browser Developer Tools: Your X-Ray Vision

  • Every modern browser (Chrome, Firefox, Brave) has built-in developer tools that let you inspect HTML and CSS.
    • You can right-click on any element on a webpage and select “Inspect” or “Inspect Element” to see its HTML code.
    • The inspector shows you the element’s tag name, its attributes (like id and class), and its position in the DOM tree.
    • This is essential for web scraping because it tells you exactly which CSS selector to use to extract that element’s data.

Web Scraping in R

Introduction to R and the rvest Package

  • R has a perfect library for web scraping that allows you to scrape large amounts of data precisely with only a few lines of code.
    • The name “rvest” is a play on “harvest”—we’re harvesting data from the web.
    • You install rvest once with install.packages("rvest") and then load it in your script with library(rvest).
#install.packages('rvest')
library(rvest)

Basic rvest Workflow: Four Steps

  1. Read the HTML from a URL using the read_html() function, which fetches the page and parses the HTML.
  2. Select the elements you want using html_elements() (for multiple elements) or html_element() (for a single element) with a CSS selector.
  3. Extract the data you need using functions like html_text() to get text content, html_attr() to get attribute values, or html_table() to extract tables.
  4. Store or process the extracted data in R, often converting it to a data frame for analysis.

A Simple Example: Scraping a Title

Example: Scraping a Title

Let’s say we want to scrape the main heading from a webpage at “https://example.com”. We start by reading the page: page <- read_html("https://example.com"). Then we select the h1 element: title_element <- html_element(page, "h1"). Finally, we extract the text: title_text <- html_text(title_element). Now title_text contains the heading as a character string we can use in R:

library(rvest)
page <- read_html("https://example.com")
title_element <- html_element(page, "h1")
title_text <- html_text(title_element)
cat(title_text)
Example Domain

The Pipe Operator

Scraping Multiple Elements

  • Often we want to extract multiple items, like all product names on a page.
    • If each product name has class “product-title”, we use: names <- page |> html_elements(".product-title") |> html_text().
    • The pipe operator |> (or %>% in older R) passes the result of one function to the next, making code more readable.
    • This gives us a vector of all product names, which we can then convert to a data frame column.

Example: Previous Example With Pipe

Here is the previous example with the pipe.

read_html("https://example.com") |> 
  html_elements("h1") |>
  html_text()
[1] "Example Domain"

What is the Pipe Operator?

  • The pipe operator in R takes the output of one function and “pipes” it as the first input to the next function, creating a readable chain of operations.
  • R has two pipe operators: the native pipe |> (introduced in R 4.1.0) and the magrittr pipe %>% (from the tidyverse)—they work almost identically for our purposes.
    • Think of the pipe like an assembly line in a factory: raw materials (HTML) enter, pass through different processing stations (functions), and finished products (clean data) come out.
    • Without pipes, code becomes nested and hard to read: html_text(html_element(read_html(url), "h1")) forces you to read from inside-out.
    • With pipes, the same code reads naturally from left to right: read_html(url) |> html_element("h1") |> html_text() shows the logical flow of operations.

Pipes Are Perfect for Web Scraping

  • Web scraping involves a natural sequence of steps: fetch the page, find elements, extract data, clean data, and store results.
  • Pipes let you express this sequence exactly as you think about it, making code that reads like instructions: “take this URL, then read the HTML, then find all product divs, then extract the text.”
    • Without pipes, you’d need to create intermediate variables for each step: page <- read_html(url), then elements <- html_elements(page, ".product"), then text <- html_text(elements)—this clutters your workspace.
    • With pipes, you write one continuous flow: url |> read_html() |> html_elements(".product") |> html_text() keeps your environment clean and your intention clear.

Best Practices with Pipes in Scraping

  • Keep each pipe chain focused on one logical task: fetch and parse, or extract and clean, but don’t create overly long chains that do everything.
    • If a pipe chain gets longer than 5-6 steps, consider breaking it into named intermediate steps for clarity and debugging.
  • Use indentation to make long pipes readable: put each pipe operation on its own line, especially in scripts you’ll share or return to later.
    • Remember that pipes pass the result forward, so if a function returns NULL or an empty result, the rest of the chain may fail—add error checking when needed.
  • The pipe is a tool for clarity, not a requirement: if code is clearer without pipes (like when working with multiple parallel operations), don’t force it.

Special Cases in rvest

Extracting Attributes

  • Sometimes the data we want isn’t in the visible text but in an HTML attribute.
    • For example, links are stored in the href attribute of <a> tags: <a href="https://example.com">Click here</a>.
    • We extract this with: links <- page |> html_elements("a") |> html_attr("href").
    • Similarly, image URLs are in the src attribute of <img> tags: html_attr("src").

Example: Attribute Selectors

Consider the following html code. We extract the href attribute within the <a> tags.

library(rvest)

html_code <- r"(
<html>
<body>
    <a href="https://example.com/page1">Visit Page 1</a>
    <a href="https://example.com/page2">Visit Page 2</a>
    
    <img src="https://example.com/photo1.jpg" alt="Photo 1">
    <img src="https://example.com/photo2.jpg" alt="Photo 2">
    
    <div data-price="29.99">Product</div>
</body>
</html>
)"

read_html(html_code) |>
  html_elements("a") |>
  html_attr("href")
[1] "https://example.com/page1" "https://example.com/page2"

Scraping Tables

  • Many websites display data in HTML tables, which are particularly easy to scrape.
    • The html_table() function automatically converts HTML tables into R data frames.
    • Example: tables <- page |> html_elements("table") |> html_table() gives you a list of all tables on the page.
    • If there’s only one table you want, you can access it with tables[[1]], and you immediately have a clean data frame.

Example: HTML Table

library(rvest)
read_html("https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(PPP)") |>
  html_element("table:nth-of-type(2)") |>
  html_table()
# A tibble: 209 × 4
   `Country or territory` `IMF(2026)[a][5]` `World Bank(2023–24)[b][6]`
   <chr>                  <chr>             <chr>                      
 1 World                  219,220,067       197,428,072                
 2 China[n 1][n 2]        43,491,520        38,190,085                 
 3 United States          31,821,293        29,184,890                 
 4 India                  19,143,371        19,143,371                 
 5 Russia                 7,340,795         7,340,795                  
 6 Japan                  6,923,253         6,923,253                  
 7 Germany                6,323,531         6,037,852                  
 8 Indonesia              5,358,279         4,662,888                  
 9 Brazil                 5,161,140         4,734,651                  
10 France                 4,657,190         4,201,560                  
# ℹ 199 more rows
# ℹ 1 more variable: `CIA(2023–24)[c][7][8][9]` <chr>

Building a Data Frame from Scraped Data

  • After scraping multiple pieces of information, we typically want to organize them into a structured data frame.
  • For example, if we scraped product names, prices, and ratings separately, we combine them with data.frame() or tibble().
    • Example: products <- data.frame(name = names, price = prices, rating = ratings).
    • Now we have a tidy dataset ready for analysis, visualization, or export to CSV.

Scraping Multiple Pages

  • Real-world scraping often involves collecting data from many pages, not just one.
    • For example, an online store might have 50 pages of product listings, and we want data from all of them.
  • The key insight is that these pages usually have predictable URLs that follow a pattern.
  • Page 1 might be “https://store.com/products?page=1”, page 2 is “?page=2”, and so on.
  • We can generate a sequence of page numbers using R’s colon operator: pages <- 1:10 creates a vector from 1 to 10.
    • We then construct URLs programmatically: url <- paste0("https://store.com/products?page=", page_number).
    • The paste0() function concatenates strings without spaces, building our full URL.

Using a Loop to Scrape Multiple Pages

  • A for loop lets us repeat the same scraping process for each page.
  • Basic structure: for (page_number in 1:10) { ... } runs the code inside the braces 10 times, once for each page.
    • Inside the loop, we build the URL, scrape the page, extract data, and store results.
    • We typically create an empty list or data frame before the loop and append results during each iteration.

Example: Loop Structure for Scraping

Example: Loop for Scraping

This shows the complete workflow from loop setup to final data combination.

# Create empty list to store results
all_products <- list()

# Loop through pages
for (i in 1:5) {
  # Build URL
  url <- paste0("https://store.com/products?page=", i)
  
  # Scrape page
  page <- read_html(url)
  products <- page |> html_elements(".product") |> html_text()
  
  # Store results
  all_products[[i]] <- products
  
  # Be polite: wait between requests
  Sys.sleep(2)
}

# Combine all results
final_data <- unlist(all_products)

Real-Life Example

Real-Life Example with Loop

  • We will scrape data about politicians from this website.
    • After visiting the website and inspecting the HTML code (right click > inspect), we found that:
    • Each Lower House member is located in a <div> of class m-card__content:
    • CSS selector: div.m-card__content
    • Inside this <div>, the name is located in a <h3> and the political party is located in a <span>
    • The latter <span> has class u-text-size--small. The selector being span.u-text-size--small.
    • The remaining data is stored inside a <table> object
  • We will extract these data on the basis of these selectors for each politician on the page

Scraping Code

Example: HTML Table

The following code scrapes the data of the parliamentarians and puts it in a data.frame.

library(rvest); library(dplyr)

# Scrape the webpage
url <- "https://www.tweedekamer.nl/kamerleden_en_commissies/alle_kamerleden"
webpage <- read_html(url)
out <- list()

# Extract all the tables containing the data pieces
politician_tables <- html_elements(webpage, "div.m-card__content")

# Implement a for-loop over these elements
for (table in politician_tables) {
  name_pol <- html_elements(table, 'h3') |> html_text() |> trimws()
  party_pol <- html_elements(table, 'span.u-text-sm') |> html_text2()
  
  table_data <- html_elements(table, 'table') |> html_table()
  table_df <- table_data[[1]]
  
  # Reshape from long to wide
  rearranged_table <- data.frame(t(table_df$X2))
  colnames(rearranged_table) <- table_df$X1
  
  # Add name and party
  rearranged_table$name <- name_pol
  rearranged_table$party <- party_pol
  
  # Append to list
  out[[length(out) + 1]] <- rearranged_table
}

# Bind the data together using base R
data <- bind_rows(out)

Inspect Output

  • This is what the dataset looks like:
data |> head(10)
     Woonplaats: Leeftijd: Anciënniteit:                name           party
1        Utrecht   42 jaar     812 dagen    Ismail el Abassi            DENK
2      Amsterdam   34 jaar     105 dagen        Fatihya Abdi GroenLinks-PvdA
3       Maarssen   44 jaar     105 dagen       Elles van Ark             CDA
4      Groningen   27 jaar     105 dagen         Etkin Armut             CDA
5  `s-Gravenhage   47 jaar     105 dagen    Robert van Asten             D66
6      Rotterdam   34 jaar    1792 dagen  Stephan van Baarle            DENK
7      Eindhoven   40 jaar     812 dagen      Mpanzu Bamenga             D66
8      Wassenaar   40 jaar    3037 dagen        Bente Becker             VVD
9      Groningen   42 jaar    3261 dagen    Sandra Beckerman              SP
10     Rotterdam   50 jaar     105 dagen Fatimazhra Belhirch             D66

Further Considerations

Bringing It Together

  • In rvest, you pass CSS selectors as strings to the html_elements() or html_element() functions.
    • Example: page |> html_elements("div.product > h3.title") finds all product titles that are direct children of product divs.
    • Example: page |> html_element("#main-table tbody tr:nth-child(1) td:nth-child(3)") gets the first row, third column of a specific table.
  • If your selector returns nothing, either the elements don’t exist or your selector is wrong—use browser developer tools to test selectors.
  • Browser consoles let you test CSS selectors directly: open developer tools, go to the Console tab, and type document.querySelectorAll("your-selector") to see what it matches.
  • Start with simple selectors and make them more specific only when needed—overly complex selectors are fragile and break easily when websites change.

Being a Polite Scraper: Rate Limiting

  • When scraping, we’re making requests to someone else’s server, which costs them money and resources.
  • It’s important to add delays between requests so we don’t overwhelm the server or get blocked.
  • The Sys.sleep(2) function pauses R for 2 seconds before continuing—this is being a good internet citizen.
  • Many websites have a “robots.txt” file (at https://example.com/robots.txt) that specifies rules for automated access.

Common Challenges and Troubleshooting

  • Some websites load content dynamically with JavaScript after the initial HTML loads, which rvest cannot handle (you’d need tools like RSelenium).
  • CSS selectors might break if a website redesigns and changes its HTML structure.
  • Websites might block your requests if they detect automated access—this is why being polite and following rules matters.
  • Always test your scraping code on a small sample (like 2-3 pages) before running it on hundreds of pages.

Saving Your Scraped Data

  • Once you’ve collected data, save it to avoid having to re-scrape if something goes wrong.
  • Use write.csv(data, "scraped_data.csv") to save as CSV, which works well with other tools.
  • For R-specific formats that preserve data types, use saveRDS(data, "scraped_data.rds").
  • Always save raw scraped data before cleaning or processing it, so you can go back if needed.

Next Steps and Resources

  • The rvest documentation and vignettes provide detailed examples and advanced techniques.
  • Websites like SelectorGadget (a browser extension) can help you find the right CSS selectors interactively.
  • Practice on simple, scraper-friendly websites before tackling complex ones.
  • In future: combine your new scraping skills with R’s data analysis tools (dplyr, ggplot2) to extract insights from web data.

Recapitulation

Recapitulation

  • We’ve learned that websites are HTML documents sent over HTTP that browsers render into visual pages.
  • CSS selectors give us a precise language for identifying specific elements in the HTML structure.
  • The rvest package in R makes it straightforward to fetch pages, select elements, and extract data.
  • Loops allow us to scale our scraping to multiple pages efficiently.
  • With these tools, you can now transform web content into structured data for economic analysis and research.