Introduction to Applied Data Science

Lecture 2: Getting Data

Bas Machielsen

Overview

Course Schedule

Event Date Subject
Lecture 1 21-04 Introduction to Data and Data Science
Lecture 2 28-04 Getting Data: API's and Databases
Lecture 3 07-05 Getting Data: Web Scraping
Lecture 4 26-05 Text as Data
Lecture 5 27-05 Introduction to LLMs
Lecture 6 09-06 Prompt Engineering and Structured Data
Lecture 7 16-06 Spatial Data and Geocomputation

Outline Today

  • First part: focus on the basic tools of R programming. Given that you have read a local file – an .xlsx or .csv file perhaps – in R, how do you do things with it, and retrieve access to the relevant parts?
  • Second part: with these tools in hand, how do we access and leverage larger databases from the Internet? How do we communicate with these servers through R?

Basics of R Programming

Getting Started

  • At its basics, R is basically a calculator on steroids.
  • We can type an arithmetic expression into our script, then source it into the console and receive a result:
2+2
[1] 4
  • There is a huge range of mathematical functions in R, some of the most useful include log, exp, and sqrt:
sqrt(4)
[1] 2
  • When you run code as we’ve done above, the result of the code (or value) is only displayed in the console.
  • It is usually much more practical to store the value(s) in an object.

Objects

  • An object in R is simply a container that holds data, whether that’s a single number, a text string, a table of data, or even the results of a complex analysis.
  • You create objects using the assignment operator, which looks like this: <-
  • When you assign something to a name, like writing my_number <- 42, you’re storing that value in your computer’s memory under that name.
  • Everything you work with in R – numbers, datasets, graphs, statistical models – exists as an object that you can name, modify, and reuse.

Example Object Creation

Example: Creating an Object and Calling It

The following code creates an object with 5 numbers called my_numbers. You can call the object by typing my_numbers in the R console or by referring to it in your code as such.

my_numbers <- c(1,2,5,6,7)
my_numbers
[1] 1 2 5 6 7

The following code creates an object my_text, containing a string (text that is being perceived as text by the programming language – delimited by quotation marks to distinguish it from names of objects).

my_text <- "Hello, World!"
my_text
[1] "Hello, World!"

Functions

  • Functions are the verbs of R: they perform actions on objects.
  • A function takes inputs (called arguments), does something with them, and returns an output.
  • You recognize functions because they’re always followed by parentheses, like mean(my_numbers) or read.csv("data.csv").
  • R comes with hundreds of built-in functions, and you can also load additional functions from packages or write your own.
  • Understanding that functions transform inputs into outputs is crucial for programming: you feed objects into functions, and functions give you new objects back.

Example Function

Example: Function Applied to Object

This code creates an object called my_numbers, and then uses several built-in functions to compute several aspects of that object.

my_numbers <- c(1,2,3,4,5)
mean(my_numbers)
[1] 3
var(my_numbers)
[1] 2.5
median(my_numbers)
[1] 3
length(my_numbers)
[1] 5

Memory

  • When you create objects in R, they live in your computer’s RAM, which is temporary memory that exists only while R is running.
  • You can see all the objects currently in memory by looking at your Environment panel in Positron or by typing ls()
  • This memory is fast and convenient for analysis, but it’s also temporary: when you close R, everything in memory disappears unless you explicitly save it.
  • This is actually a feature, not a bug—it means each time you start R, you begin with a clean slate and can reproduce your work from scratch.

Permanent Storage

  • To preserve your work permanently, you need to save things to your hard drive.
  • There are two main things worth saving: your R script (the code itself, saved as a .Rfile or a .qmd file) and your data or results (often saved as .csv, .RData, or other file formats).
  • The script is more important than saving your workspace because it contains the instructions to recreate everything—this is the essence of reproducible research.
  • When you write code that loads data, transforms it, and produces results, you can run that script anytime to get the same outcome, which is far more valuable than saving a single snapshot of your memory.

Data Structures

  • R has several fundamental types of objects that you’ll use constantly.
    • Vectors are the simplest: they’re one-dimensional sequences of values, all of the same type, like a list of numbers or a list of words.
    • Data frames are the workhorses of data analysis: they’re rectangular tables where each column is a vector, similar to a spreadsheet in Excel.
    • Lists are flexible containers that can hold anything—numbers, vectors, data frames, even other lists—making them useful for storing complex or heterogeneous information.
    • Matrices are like data frames but require all values to be of the same type, which makes certain mathematical operations faster.

Data Structures

  • Understanding data types is just as important as understanding object structures.
    • Numbers in R can be integers or doubles (numbers with decimal points), and R usually handles the distinction automatically.
    • Character strings are text values, always enclosed in quotes, like “hello” or “2024”.
    • Logical values are simply TRUE or FALSE, and they’re fundamental for filtering data and controlling program flow.
    • Factors are R’s way of representing categorical data, like “male/female” or “low/medium/high”, and they behave differently from plain character strings in important ways.

Examples: Data Structures

Example: Various Data Structures

The following code creates a vector containing various numbers (5:16 is used to generate all numbers from 5 until and including 16).

vector <- c(1, 2, 3, 5:16)
vector
 [1]  1  2  3  5  6  7  8  9 10 11 12 13 14 15 16

The following code creates a data.frame. A data.frame is simply a container of other objects of the same length which do not have to be of the same type, and is analogous to an Excel spreadsheet.

data.frame(
  a=1:2,
  b=c("Person 1", "Person 2"), 
  age=c(20, 25), 
  class=rep("Applied Data Science", 2)
  )
  a        b age                class
1 1 Person 1  20 Applied Data Science
2 2 Person 2  25 Applied Data Science

Examples: Data Structures

Example: Various Data Structures (Continued)

The following code creates a list. Like a data.frame, a list is also simply a container of other objects, but not necessarily of the same length, which do not have to be of the same type.

my_list <- list(
  a = 1,
  b = c("Hello", "World"),
  df = data.frame(name=c("A", "B"), age=c(20, 25)),
  d = list(a=1:10)
)

my_list
$a
[1] 1

$b
[1] "Hello" "World"

$df
  name age
1    A  20
2    B  25

$d
$d$a
 [1]  1  2  3  4  5  6  7  8  9 10

Debugging and Error Messages

  • Error messages in R can look intimidating, but they’re actually your friends—they’re R’s way of telling you exactly what went wrong.
  • When R encounters an error, it stops executing your code and prints a message that describes the problem.
  • The most important part of an error message is usually at the end, where R tells you what type of error occurred, like “object not found” or “unexpected symbol”.
  • Learning to read error messages carefully, rather than panicking when you see red text, is one of the most valuable skills in programming.
    • Many beginners ignore the actual content of error messages, but these messages often contain the exact information you need to fix the problem.

Common Errors

  • Common errors have common causes that you’ll learn to recognize quickly:
    • “Object not found” means you’re trying to use something that doesn’t exist in memory—maybe you misspelled the name, forgot to create it, or forgot to load the package it comes from.
    • “Unexpected symbol” usually means you have a syntax error, like a missing comma, an unclosed parenthesis, or a quote mark you forgot to close.
    • “Cannot open file” means R can’t find the file you’re trying to load, which typically indicates a working directory problem or an incorrect file path.
    • “Subscript out of bounds” means you’re trying to access an element that doesn’t exist, like asking for the 100th row of a dataset that only has 50 rows.

Strategies for Debugging

  • The first rule of debugging is to isolate the problem by running your code line by line or section by section.
    • Don’t try to run a long script all at once when something’s wrong—instead, execute each line individually to see exactly where the error occurs.
    • In Positron, you can place your cursor on a line and press Ctrl+Enter (or Cmd+Enter on Mac) to run just that line. Once you know which line causes the error, you can focus your attention on understanding what that specific line is trying to do and why it’s failing.
  • Checking your objects is one of the most powerful debugging techniques.
    • Use functions like str() to examine the structure of an object, head() to see the first few rows of a dataset, class() to check what type of object you have, and dim() to see the dimensions of a data frame or matrix.
    • Often, errors occur because an object isn’t what you think it is—maybe your data frame is actually a list, or your numeric column got read in as character strings.
    • Print intermediate results to the console to verify that each step of your code is doing what you expect.

Getting Help

  • R has built-in documentation for every function, which you can access by typing ?function_name or help(function_name).
    • This documentation shows you what arguments the function takes, what it returns, and usually includes examples at the bottom.
    • The examples are particularly valuable because they show you working code that you can run and modify. While R’s documentation can be technical, learning to read it will make you much more self-sufficient as a programmer.
  • When built-in help isn’t enough, the internet is full of resources—but you need to search effectively.
    • Copy the the code and the key part of your error message into ChatGPT or Google along with “R” and the package name if relevant.
    • Stack Overflow is a question-and-answer site where millions of R questions have already been answered, and searching there often leads you directly to solutions.1
    • When asking for help online or from classmates, always provide a minimal reproducible example: the smallest piece of code that demonstrates your problem, including any necessary data.

Example: Debugging

Example: Debugging

When you create a text string in R, you must enclose it in quotes, like this: sentence <- "Hello, my name is Bas". The quotes tell R that you want to store this literal text, not refer to something already in memory.

If you forget the quotes and type sentence <- Hello, R will produce an error message that says “object ‘Hello’ not found”.

This error occurs because R interprets anything written without quotes as the name of an object that should already exist in memory.

This distinction between quoted and unquoted text is fundamental to how R works. When you type Hello without quotes, you’re asking R to look up the value stored in an object called Hello. When you type “Hello” with quotes, you’re providing the literal text “Hello” itself.

This is why you can do something like greeting <- "Hello" to store the text, and then later type greeting without quotes to retrieve it—the first “Hello” is a value you’re storing, while the second greeting is the name of the object you’re retrieving.

Example Debugging (2)

Example: Debugging (2)

Computer languages, including R, are very particular about spaces—or more accurately, about where you can and cannot use them. You cannot use spaces in object names: my object <- 5 will cause an error, while my_object <- 5 or myObject <- 5 work perfectly.

This is because R uses spaces to separate different elements of your code, so when it sees my object, it thinks you’re referring to two different things. The same rule applies to function names, variable names, and anything else you create in R—if you want multiple words, connect them with underscores, dots, or capital letters (a convention called camelCase), but never spaces.

Example Debugging (3)

Example: Overwriting

One of the most important things to understand about R is that when you assign a value to an object name that already exists, R will silently overwrite the old value without warning you.

If you create data <- 10 and then later write data <- 20, the original value of 10 is gone forever—there’s no undo button, no recycle bin, and R won’t ask “are you sure?” before replacing it. This is by design: R assumes you know what you’re doing and executes your commands exactly as written. This means you need to be intentional about your object names and aware of what already exists in your environment.

Overwriting becomes particularly dangerous when working with datasets. Imagine you load a dataset with my_data <- read.csv("survey_results.csv") and then spend time cleaning it: removing missing values, filtering rows, creating new columns.

If you accidentally run my_data <- read.csv("survey_results.csv") again, all your cleaning work disappears instantly because you’ve replaced the modified dataset with the original file. This is why experienced programmers often create new object names for each major transformation, like my_data_clean or my_data_filtered, rather than constantly overwriting the same object.

Accessing Objects and Indexing

  • Once you have data stored in R, you need ways to access specific pieces of it rather than always working with the entire object.
    • R uses square brackets [] for indexing, which means extracting specific elements based on their position or characteristics.
    • For a simple vector, you can access individual elements by their position: if numbers <- c(10, 20, 30, 40), then numbers[2] gives you 20, the second element.
    • You can also access multiple elements at once using numbers[c(1, 3)] to get the first and third elements, or numbers[2:4] to get elements two through four.
    • Understanding that R counts starting from 1 (not 0 like some other languages) is important for avoiding off-by-one errors.

Accessing Objects and Indexing (2)

  • Data frames require two-dimensional indexing because they have both rows and columns.
    • The syntax is dataframe[row, column], where dataframe is the name of your data.frame object, and you specify which rows and which columns you want.
    • For example, my_data[5, 3] gives you the value in the fifth row and third column, while my_data[5, ] gives you the entire fifth row (notice the comma with nothing after it).
    • Similarly, my_data[, 3] gives you the entire third column.
    • You can use numbers for positions, but you can also use column names: my_data[5, "age"] gets the age value from the fifth row, which is often clearer and less error-prone than remembering column numbers.

Examples: Indexing

Example: Indexing Using Various Data Structures

In this example, we use various data structures, a vector called ages, a data.frame called students and a list called course_info to see how various sub-objects can be retrieved.

# Creating example data
# A simple vector
ages <- c(22, 25, 19, 31, 28, 23)

# A data frame (like a table)
students <- data.frame(
  name = c("A", "B", "C", "D", "E", "F"),
  age = c(22, 25, 19, 31, 28, 23),
  grade = c(7.5, 8.2, 6.8, 9.1, 7.0, 8.5),
  country = c("Netherlands", "Netherlands", "China", "USA", "Belgium", "Belgium")
)

# A list (can hold different types of things)
course_info <- list(
  course_name = "Programming in R",
  students_enrolled = 45,
  grades = c(7.5, 8.2, 6.8, 9.1, 7.0),
  passed = TRUE
)
# ===== INDEXING VECTORS =====
ages[1]              # Get first element: 22
[1] 22
ages[3]              # Get third element: 19
[1] 19
ages[c(1, 3, 5)]     # Get first, third, and fifth elements
[1] 22 19 28
ages[2:4]            # Get elements 2 through 4
[1] 25 19 31
# ===== INDEXING DATA FRAMES =====
# By position [row, column]
students[2, 3]       # Second row, third column (Bas's grade): 8.2
[1] 8.2
students[2, ]        # Entire second row (all info about Bas)
  name age grade     country
2    B  25   8.2 Netherlands
students[, 3]        # Entire third column (all grades)
[1] 7.5 8.2 6.8 9.1 7.0 8.5
students[1:3, ]      # First three rows
  name age grade     country
1    A  22   7.5 Netherlands
2    B  25   8.2 Netherlands
3    C  19   6.8       China
# By column name
students[2, "grade"]           # B's grade using column name
[1] 8.2
students[, "age"]              # All ages
[1] 22 25 19 31 28 23
students[1:3, c("name", "age")] # First 3 rows, only name and age columns
  name age
1    A  22
2    B  25
3    C  19
# Extracting columns (three different ways)
students$age                   # Using $ - returns a vector
[1] 22 25 19 31 28 23
students[["age"]]              # Using [[ ]] - returns a vector
[1] 22 25 19 31 28 23
students["age"]                # Using [ ] - returns a data frame
  age
1  22
2  25
3  19
4  31
5  28
6  23
# Check the difference
class(students$age)            # "numeric" - it's a vector
[1] "numeric"
class(students["age"])         # "data.frame" - it's a data frame
[1] "data.frame"
# ===== LOGICAL INDEXING =====
# Find students older than 24
students[students$age > 24, ]
  name age grade     country
2    B  25   8.2 Netherlands
4    D  31   9.1         USA
5    E  28   7.0     Belgium
# Find Dutch students
students[students$country == "Netherlands", ]
  name age grade     country
1    A  22   7.5 Netherlands
2    B  25   8.2 Netherlands
# Find Dutch students older than 24
students[students$age > 24 & students$country == "Netherlands", ]
  name age grade     country
2    B  25   8.2 Netherlands
# Find students with grades above 8
students[students$grade > 8, ]
  name age grade     country
2    B  25   8.2 Netherlands
4    D  31   9.1         USA
6    F  23   8.5     Belgium
# Get only names of students with grades above 8
students[students$grade > 8, "name"]
[1] "B" "D" "F"
# ===== INDEXING LISTS =====
# Single bracket - returns a list
course_info[1]                 # A list containing the course name
$course_name
[1] "Programming in R"
class(course_info[1])          # "list"
[1] "list"
# Double bracket - returns the actual element
course_info[[1]]               # The actual course name: "Programming in R"
[1] "Programming in R"
class(course_info[[1]])        # "character"
[1] "character"
# By name with $
course_info$course_name        # "Programming in R"
[1] "Programming in R"
course_info$students_enrolled  # 45
[1] 45
# By name with [[]]
course_info[["grades"]]        # The vector of grades
[1] 7.5 8.2 6.8 9.1 7.0
# Chaining indexing for nested structures
course_info[["grades"]][2]     # Second grade from the grades vector: 8.2
[1] 8.2
# ===== MODIFYING VALUES =====
# Change a specific value
students[2, "grade"] <- 8.5    # Change B's grade to 8.5

# Change an entire column
students$grade <- students$grade + 0.5  # Add 0.5 to all grades (generous grading!)

# Add a new column
students$passed <- students$grade >= 6  # Create a column showing who passed

# View the modified data frame
students
  name age grade     country passed
1    A  22   8.0 Netherlands   TRUE
2    B  25   9.0 Netherlands   TRUE
3    C  19   7.3       China   TRUE
4    D  31   9.6         USA   TRUE
5    E  28   7.5     Belgium   TRUE
6    F  23   9.0     Belgium   TRUE

Getting Data from the Internet: APIs

What is an API?

  • API stands for Application Programming Interface, but that technical definition doesn’t tell you much about what it actually does.
    • Think of an API as a waiter in a restaurant: you (the customer) want food from the kitchen, but you can’t just walk into the kitchen and grab it yourself.
    • Instead, you tell the waiter what you want from the menu, the waiter takes your order to the kitchen, and then brings back your food.
    • An API works the same way: it’s an intermediary that takes your request for data, sends it to a server (the kitchen), and brings back the data you requested.
    • The key advantage is that APIs provide a controlled, organized way to access data—just like a restaurant menu limits what you can order and how you order it.

Why communicate through an API?

  • So far, we’ve focused on the fundamentals of R: creating objects, manipulating data, and reading files from your computer.
    • But in the real world of data science, your data often doesn’t live in a tidy CSV file sitting in your project folder.
    • Modern data is increasingly stored in databases—large, structured collections maintained on servers—or made available through APIs, which are interfaces that let programs talk to each other over the internet.
    • Learning to access these sources transforms R from a tool for analyzing data you already have into a tool for gathering data from the wider world.
  • The skills you’ve just learned—understanding objects, functions, file paths, and data structures—form the foundation for everything we’ll do next.
    • When you query a database or make an API request, you’re still creating objects in R’s memory, still using functions with specific arguments, and still working with data frames and lists.
    • The main difference is that instead of typing read.csv("my_file.csv") to load a local file, you’ll use functions that reach out across the internet to fetch data from remote sources. The programming principles remain the same; we’re simply expanding where your data can come from.

Why Do We Use APIs Instead of Downloading Files?

  • You might wonder: why not just download a CSV file with all the data you need?
    • For many real-world data sources, downloading everything simply isn’t practical—imagine trying to download all of Twitter’s tweets or all weather measurements ever recorded.
    • APIs let you request only the specific data you need: perhaps just today’s weather in Amsterdam, or tweets from the last hour containing a specific hashtag.
    • Data accessed through APIs is also always current—when you make a request, you get the latest information available, not a file that was created last week.
  • Many organizations use APIs to control access to their data, ensuring that users can access what they’re authorized to see without giving them the entire database.
  • APIs also reduce the burden on servers because you’re only requesting small amounts of data at a time, rather than forcing the server to prepare massive files for download.

The Client-Server Model: A Conversation Between Computers

  • When you use an API, you’re engaging in a conversation between two computers: your computer (the client) and a remote computer (the server).
    • Your R code acts as the client, sending a request across the internet to a server that holds the data you want.
    • The server receives your request, processes it, retrieves the relevant data, and sends back a response.
    • This is happening constantly on the internet: every time you visit a website, your browser (the client) requests the page from a web server, which sends back the HTML, images, and other content.1
  • With APIs, instead of requesting web pages designed for humans to read, you’re requesting data in formats designed for programs to process.
  • Understanding this back-and-forth communication is essential: you send a request, you wait, and then you receive a response.

HTTP: The Language of the Internet

  • HTTP stands for Hypertext Transfer Protocol, and it’s the standardized way that clients and servers communicate on the internet.
    • Think of HTTP as the rules of a formal conversation: there are specific ways to ask for things, specific ways to respond, and everyone on the internet follows these same rules.
    • When you type a URL into your browser or when R makes an API request, you’re using HTTP to communicate.
  • An HTTP request has several components: the URL (where you’re sending the request), the method (what you want to do), and sometimes additional information like parameters or authentication.
  • The server responds with an HTTP response that includes a status code (did it work?), headers (metadata about the response), and the actual data you requested.
    • Understanding HTTP isn’t just academic—when things go wrong with an API, the HTTP status codes and error messages will tell you exactly what the problem is.

HTTP Methods: Different Types of Requests

  • HTTP defines several “methods” or “verbs” that describe what you want to do with the data.
  • The most common method is GET, which means “please give me this data”—this is what you’ll use most often when retrieving information from an API.
    • POST means “please accept this data I’m sending you,” which you might use when submitting a form or uploading information.
    • PUT and DELETE are used for updating and removing data, respectively, though many public APIs don’t allow these operations for security reasons.
    • For this course, we’ll focus almost entirely on GET requests because we’re interested in retrieving data, not modifying it on the server.
    • When you use R to call an API, you’ll typically be making GET requests to ask for specific pieces of information.

Anatomy of an API Request: URLs and Parameters

  • An API request starts with a base URL that points to the API server, like https://api.weather.com.
  • After the base URL, you add an “endpoint,” which is like a specific department in the kitchen—it tells the API what type of data you want.
    • For example, https://api.weather.com/current might be the endpoint for current weather conditions.
    • You pass additional details through “parameters,” which are key-value pairs added to the URL after a question mark, like ?city=Amsterdam&units=metric.
  • Parameters let you customize your request: you’re saying “I want current weather, specifically for Amsterdam, and please give me the temperature in Celsius.”
  • Multiple parameters are separated by ampersands (&), allowing you to build complex, specific requests: ?city=Amsterdam&units=metric&language=en.

Example: Breaking Down a Real API Request

  • Let’s look at a complete API request: https://api.openweathermap.org/data/2.5/weather?q=Amsterdam&units=metric&appid=YOUR_KEY.
    • The protocol is https://, which means we’re using secure HTTP (the ‘s’ stands for secure).
    • The domain is api.openweathermap.org, which is the server we’re communicating with.
    • The endpoint is /data/2.5/weather, which tells the API we want current weather data.
    • The parameters start after the ?: q=Amsterdam specifies the city, units=metric requests Celsius instead of Fahrenheit, and appid=YOUR_KEY provides authentication.
  • When R sends this request, the server processes it and sends back the current weather data for Amsterdam in a structured format.

HTTP Status Codes: Did Your Request Succeed?

  • Every HTTP response includes a status code—a three-digit number that tells you whether your request succeeded or failed.
    • Status codes in the 200s mean success: 200 means “OK, here’s your data,” which is what you want to see.
    • Status codes in the 400s mean client errors—you did something wrong: 400 means “bad request” (your request was malformed), 404 means “not found” (the endpoint doesn’t exist), 401 means “unauthorized” (you need to provide authentication).
    • Status codes in the 500s mean server errors—the server had a problem: 500 means “internal server error” (something crashed on their end), 503 means “service unavailable” (the server is overloaded or down for maintenance).
  • Learning these codes helps you debug: if you get a 401, you know to check your API key; if you get a 404, you know to check your URL.
  • R will often show you these status codes when API requests fail, and understanding them will save you hours of frustration.

JSON: The Language of API Data

  • Most modern APIs return data in JSON format, which stands for JavaScript Object Notation.
  • JSON is a text-based format for representing structured data—it’s human-readable but also easy for computers to parse.
    • A JSON object looks like this: {"name": "Anna", "age": 22, "city": "Amsterdam"}, with key-value pairs enclosed in curly braces.
    • JSON arrays are lists of values in square brackets: [10, 20, 30, 40] or ["Amsterdam", "Brussels", "Chennai"].
  • JSON can nest objects and arrays inside each other to represent complex data structures: {"students": [{"name": "Anna", "age": 22}, {"name": "Bill", "age": 25}]}.
  • The beauty of JSON is that it maps naturally to R’s data structures: JSON objects become lists, JSON arrays become vectors, and JSON arrays of objects become data frames.

Example: JSON Data

Example: JSON Data

This example uses { curly braces } for objects (things with labels), and [ square brackets ] for lists (things in a sequence).

{
  "country": "United States",
  "code": "US",
  "economy": {
    "gdp": 23000000000,
    "currency": "USD"
  },
  "years": [2020, 2021, 2022]
}

Converting JSON to R Data Structures

  • When you receive JSON data from an API, R doesn’t automatically know how to work with it—it arrives as plain text.
    • You need to parse the JSON, which means converting it from text into R objects that you can manipulate.
    • R has packages like jsonlite that do this conversion automatically with a function called fromJSON().
    • A simple JSON object like {"temperature": 15, "humidity": 80} becomes an R list with elements you can access using $.
    • A JSON array of objects representing a table becomes an R data frame, which is exactly what you want for most data analysis.
  • Understanding this conversion process helps you anticipate what kind of R object you’ll get back from an API call.

JSON to Data Frame

  • Imagine an API returns this JSON: [{"name": "Anna", "score": 85}, {"name": "Bill", "score": 92}, {"name": "Chen", "score": 78}].
    • This is a JSON array (the outer square brackets) containing three objects (the curly braces), each with the same structure.
    • When you parse this with jsonlite::fromJSON(), R recognizes the regular structure and automatically creates a data frame.
  • The resulting data frame will have two columns (name and score) and three rows (one for each student).
    • This is why APIs and data frames work so well together: APIs often return lists of records, which are exactly what data frames are designed to hold.
    • Once you have your data frame, you can use all the indexing, filtering, and analysis techniques you’ve already learned.

Example: JSON to Data Frame

Example: JSON Data in R

In this example, we use the jsonlite library to parse the information from a query to the World Bank API.

# Install the package (do this once)
# install.packages("jsonlite")
library(jsonlite)

# The URL (The address of the data) - Click this link in your browser and select "raw data" to compare!
# Let's get Brazil's GDP from the World Bank
url <- "http://api.worldbank.org/v2/country/br/indicator/NY.GDP.MKTP.CD?format=json"

# Fetch and Convert
raw_data <- fromJSON(url)

# Look at what we got
str(raw_data)
List of 2
 $ :List of 6
  ..$ page       : int 1
  ..$ pages      : int 2
  ..$ per_page   : int 50
  ..$ total      : int 66
  ..$ sourceid   : chr "2"
  ..$ lastupdated: chr "2026-04-08"
 $ :'data.frame':   50 obs. of  8 variables:
  ..$ indicator      :'data.frame': 50 obs. of  2 variables:
  .. ..$ id   : chr [1:50] "NY.GDP.MKTP.CD" "NY.GDP.MKTP.CD" "NY.GDP.MKTP.CD" "NY.GDP.MKTP.CD" ...
  .. ..$ value: chr [1:50] "GDP (current US$)" "GDP (current US$)" "GDP (current US$)" "GDP (current US$)" ...
  ..$ country        :'data.frame': 50 obs. of  2 variables:
  .. ..$ id   : chr [1:50] "BR" "BR" "BR" "BR" ...
  .. ..$ value: chr [1:50] "Brazil" "Brazil" "Brazil" "Brazil" ...
  ..$ countryiso3code: chr [1:50] "BRA" "BRA" "BRA" "BRA" ...
  ..$ date           : chr [1:50] "2025" "2024" "2023" "2022" ...
  ..$ value          : num [1:50] NA 2.19e+12 2.19e+12 1.95e+12 1.67e+12 ...
  ..$ unit           : chr [1:50] "" "" "" "" ...
  ..$ obs_status     : chr [1:50] "" "" "" "" ...
  ..$ decimal        : int [1:50] 0 0 0 0 0 0 0 0 0 0 ...

The fromJSON() function often returns a list. You usually have to look inside the list to find the actual data frame.

# The World Bank API returns a list of length 2.
# The metadata is in item [[1]], the actual data is in item [[2]].
economics_df <- raw_data[[2]]

# Look at the first few rows
head(economics_df)
    indicator.id   indicator.value country.id country.value countryiso3code
1 NY.GDP.MKTP.CD GDP (current US$)         BR        Brazil             BRA
2 NY.GDP.MKTP.CD GDP (current US$)         BR        Brazil             BRA
3 NY.GDP.MKTP.CD GDP (current US$)         BR        Brazil             BRA
4 NY.GDP.MKTP.CD GDP (current US$)         BR        Brazil             BRA
5 NY.GDP.MKTP.CD GDP (current US$)         BR        Brazil             BRA
6 NY.GDP.MKTP.CD GDP (current US$)         BR        Brazil             BRA
  date        value unit obs_status decimal
1 2025           NA                       0
2 2024 2.185822e+12                       0
3 2023 2.191132e+12                       0
4 2022 1.951924e+12                       0
5 2021 1.670647e+12                       0
6 2020 1.476107e+12                       0

Authentication: Proving Who You Are

  • Many APIs require authentication, which means you need to prove your identity before accessing their data.
  • Authentication serves several purposes: it allows the API provider to track who’s using their service, enforce usage limits, and restrict access to paying customers or approved users.
    • Think of authentication like showing your student ID to enter the university building—it proves you’re allowed to be there and helps track who’s inside.
    • Without authentication, anyone could make unlimited requests to an API, potentially overloading the server or accessing data they shouldn’t see.
    • For public data APIs, authentication is often free—you just need to register for an account to get your credentials.
  • For commercial or sensitive data, you might need to pay for access or be specifically authorized by the organization.

API Keys: Your Personal Password

  • The most common form of authentication is an API key, which is essentially a long, random string of characters that acts as your password.
  • A typical API key looks like this: a3f7d9e2b4c8f1a5e9d3c7b2a6f4e8d1, and it uniquely identifies you to the API.
    • You obtain an API key by registering on the API provider’s website, creating an account, and generating a key through their dashboard.
    • You then include this key in every API request you make, usually as a parameter: ?api_key=a3f7d9e2b4c8f1a5e9d3c7b2a6f4e8d1.
    • The server checks your key with every request: if it’s valid, you get your data; if not, you receive a 401 Unauthorized error.
    • API keys allow providers to monitor your usage, enforce rate limits (like “1000 requests per day”), and revoke access if necessary.

Protecting Your API Keys

  • API keys are sensitive information—they’re like passwords and should be treated with the same security.
  • Never include your API key directly in your R script if you plan to share that script with others or upload it to GitHub.
    • If someone gets your API key, they can make requests as you, potentially using up your quota or accessing data you’ve paid for.
    • Best practice is to store API keys in a separate file that you don’t share, or in environment variables that R can read without exposing the key in your code.
  • In R, you can use a .Renviron file to store keys like WEATHER_API_KEY=your_key_here, then access it in your code with Sys.getenv("WEATHER_API_KEY").1
  • Many beginners accidentally publish their API keys publicly—being careful about this from the start will save you from security problems later.

Rate Limiting: Playing Nice with APIs

  • Most APIs implement rate limiting, which restricts how many requests you can make in a given time period.
    • For example, a free API might allow 60 requests per hour or 1000 requests per day.
    • Rate limiting prevents abuse: without it, someone could overwhelm the server with millions of requests, making the service unavailable for everyone else.
    • If you exceed the rate limit, you’ll typically receive a 429 “Too Many Requests” status code, and your request will be rejected.
    • This is why you should be thoughtful about your API calls—don’t request the same data repeatedly when you could save it once, and avoid putting API calls inside loops that run thousands of times.
    • Understanding rate limits helps you design efficient code: batch your requests, cache results when possible, and respect the API provider’s resources.

Making API Calls in R: The httr Package

  • R has several packages for making HTTP requests, but httr is one of the most popular and user-friendly.
  • The basic function is GET(), which sends an HTTP GET request to a URL and returns the response.
    • A simple API call looks like this: response <- GET("https://api.example.com/data"), which stores the server’s response in an object.
    • You can add parameters using the query argument: GET("https://api.example.com/data", query = list(city = "Amsterdam", limit = 10)).
  • The httr package handles authentication too: you can pass your API key through parameters or use special authentication functions.
  • Once you receive the response, you can check the status code with status_code(response) and extract the content with content(response).

Example: httr in Action

Example: The httr Package

This example uses a weather API that does not require authentication to retrieve information about the real-time weather in Amsterdam.

# Example: Getting Weather Data from Open-Meteo API
# Open-Meteo provides free weather data without requiring an API key
library(httr)
library(jsonlite)

# Step 1: Build the API URL
base_url <- "https://api.open-meteo.com/v1/forecast"

# Step 2: Define the parameters we want to send
# We need to specify latitude and longitude for Amsterdam
params <- list(
  latitude = 52.37,      # Amsterdam's latitude
  longitude = 4.89,      # Amsterdam's longitude
  current = "temperature_2m,relative_humidity_2m,wind_speed_10m",
  timezone = "Europe/Amsterdam"
)

# Step 3: Make the API request
response <- GET(base_url, query = params)

# Step 4: Check if the request was successful
status_code(response)  # Should be 200 if successful
[1] 200
# Step 5: Extract and parse the JSON content
weather_data <- content(response, as = "text")  # Get raw JSON as text
weather_parsed <- fromJSON(weather_data)        # Parse JSON to R object

# Step 6: Look at the structure of what we received
str(weather_parsed)
List of 9
 $ latitude             : num 52.4
 $ longitude            : num 4.9
 $ generationtime_ms    : num 0.0446
 $ utc_offset_seconds   : int 7200
 $ timezone             : chr "Europe/Amsterdam"
 $ timezone_abbreviation: chr "GMT+2"
 $ elevation            : num 11
 $ current_units        :List of 5
  ..$ time                : chr "iso8601"
  ..$ interval            : chr "seconds"
  ..$ temperature_2m      : chr "°C"
  ..$ relative_humidity_2m: chr "%"
  ..$ wind_speed_10m      : chr "km/h"
 $ current              :List of 5
  ..$ time                : chr "2026-04-15T14:15"
  ..$ interval            : int 900
  ..$ temperature_2m      : num 17.7
  ..$ relative_humidity_2m: int 39
  ..$ wind_speed_10m      : num 16.9
# Step 7: Extract the specific information we want
current_temp <- weather_parsed$current$temperature_2m
current_humidity <- weather_parsed$current$relative_humidity_2m
current_wind <- weather_parsed$current$wind_speed_10m

# Print the results
cat("Current weather in Amsterdam:\n")
Current weather in Amsterdam:
cat("Temperature:", current_temp, "°C\n")
Temperature: 17.7 °C
cat("Humidity:", current_humidity, "%\n")
Humidity: 39 %
cat("Wind Speed:", current_wind, "km/h\n")
Wind Speed: 16.9 km/h

Getting Data from the Internet: Databases

Databases: Structured Storage for Large Datasets

  • While APIs are great for accessing external data over the internet, databases are designed for storing and querying large amounts of structured data.
    • Think of a database as a sophisticated, high-performance version of a spreadsheet that can handle millions of rows and complex relationships between tables.
    • Databases live on servers (either local or remote) and use specialized software called a Database Management System (DBMS) to store and retrieve data efficiently.
    • The most common type of database is a relational database, which stores data in tables (similar to data frames) with rows and columns.
    • What makes databases powerful is their ability to store multiple related tables and quickly find exactly the data you need, even from billions of records.
    • Many organizations store their data in databases rather than files because databases provide better performance, security, and concurrent access for multiple users.

SQL: The Language of Databases

  • SQL (Structured Query Language) is the standardized language for communicating with relational databases.
    • Just as HTTP is the protocol for web communication, SQL is the protocol for database communication—you write SQL queries to ask databases for data.
    • A SQL query is a statement that describes what data you want, and the database figures out the most efficient way to retrieve it.
    • The most basic SQL query is SELECT, which retrieves data from a table: SELECT name, age FROM students WHERE age > 20.
    • This query says: “From the students table, give me the name and age columns, but only for rows where age is greater than 20.”

Connecting R to Databases

  • R can connect directly to databases, allowing you to send SQL queries and receive results as data frames.
  • The DBI package provides a standard interface for connecting to databases, regardless of which specific database system you’re using.
    • You first establish a connection with something like: con <- dbConnect(RMySQL::MySQL(), host = "database.server.com", user = "username", password = "password", dbname = "mydata").
    • Once connected, you can send SQL queries using dbGetQuery(con, "SELECT * FROM students WHERE grade > 8"), which returns the results as a data frame.
    • This is incredibly powerful: you can work with datasets far too large to fit in your computer’s memory by querying just the portions you need.
    • The combination of R’s analytical capabilities and databases’ storage and querying power makes it possible to analyze data at scales impossible with CSV files alone.

Workflow

Practical Workflow: From API/Database to Analysis

  • A typical data science workflow starts with identifying where your data lives: Is it available through an API? Stored in a database? Available as a download?
  • You then write R code to connect to that source: constructing API requests with proper authentication, or establishing database connections with the right credentials.
    • Once connected, you retrieve the specific data you need, being mindful of rate limits for APIs or query efficiency for databases.
    • The data arrives in R as a data frame (or can be converted to one).
    • Often you’ll save the retrieved data locally (as a CSV or RDS file) so you don’t need to re-query the API or database every time you run your analysis.
  • This workflow—connect, retrieve, analyze, save—is fundamental to modern data science and turns R into a tool for accessing data from anywhere in the world.

Why This Matters for Economics Students

  • As economists, you’ll work with data from central banks, statistical agencies, financial markets, and research databases—most of which provide APIs.
  • Understanding APIs means you can build analyses that update automatically with new data, rather than manually downloading files every week.
    • Many economic indicators (GDP, unemployment, inflation, stock prices) are available through APIs, allowing you to create real-time dashboards and analyses.
    • Research in economics increasingly involves large datasets stored in databases that you’ll need to query efficiently.
    • These skills aren’t just for programming specialists—they’re becoming standard tools for empirical research in economics.
  • Learning to access data programmatically, rather than through point-and-click interfaces, makes your research reproducible, scalable, and professional.

Recapitulation

Recapitulation

  • Basics of R: data is stored in objects (such as vectors, data frames, and lists) and manipulated using functions.
  • We covered techniques for accessing specific data through indexing (using [], $, and [row, column] syntax).
  • We introduced APIs (Application Programming Interfaces) as the primary method for programs to communicate over the internet.
    • We talked about the client-server model, where R uses the HTTP protocol to fetch up-to-date, specific data rather than downloading massive static files.
  • We examined the technical details of API requests, including endpoints, parameters, and JSON. The lecture also highlighted the importance of authentication (via API keys) and respecting rate limits.
  • Finally, we touched upon databases as structured storage for large-scale data that R can access directly.