Introduction to Applied Data Science

Lecture 6: Prompt Engineering and Structured Data

Bas Machielsen

Overview

Course Schedule

Event	Date	Subject
Lecture 1	21-04	Introduction to Data and Data Science
Lecture 2	28-04	Getting Data: API's and Databases
Lecture 3	07-05	Getting Data: Web Scraping
Lecture 4	26-05	Text as Data
Lecture 5	27-05	Introduction to LLMs
Lecture 6	09-06	Prompt Engineering and Structured Data
Lecture 7	16-06	Spatial Data and Geocomputation

Outline Today

First part: introduction to prompts and prompt engineering.
Second part: using LLMs to extract structured data from unstructured text.

LLM Fundamentals

The `ellmer` Package

First, if you haven’t done so install and load the ellmer package in R.
- Create a chat session by specifying which LLM provider and model you want to use.
- Send prompts to the model using the chat function and receive responses.
- You can see how many tokens you’ve used and the cost of your conversation by printing the chat object.

library(ellmer)
chat <- chat_ollama(model="gemma3")
chat$chat("What is the capital of the Netherlands?")

The capital of the Netherlands is **Amsterdam**. 

It's the largest city and the political, cultural, and economic center of the 
country. 😊 

Do you want to know anything more about Amsterdam or the Netherlands?

print(chat$get_tokens())

# A tibble: 1 × 5
  input output cached_input cost       input_preview                            
  <dbl>  <dbl>        <dbl> <ellmr_dl> <chr>                                    
1    17     47            0 NA         Text[What is the capital of the Netherla…

Why Use LLMs in Data Science?

LLMs can help you write code faster, even if you’re just learning to program.
- They excel at extracting structured information from messy, unstructured text data.
- They can explain code you don’t understand or help debug errors.
- They allow you to quickly prototype solutions to problems that might otherwise require specialized tools.
- Even an 90% correct solution from an LLM can save you significant time compared to starting from scratch.
Extract structured data (.json, .csv etc.) from unstructured text (articles, reviews, documents).

How to Interact with LLMs?

It all starts with a prompt, which is the text (typically a question or a request) that you send to the LLM.
This starts a conversation, a sequence of turns that alternate between user prompts and model responses.
- Inside the model, both the prompt and response are represented by a sequence of tokens, which represent either individual words or subcomponents of a word.
- The tokens are used to compute the cost of using a model and to measure the size of the context, the combination of the current prompt and any previous prompts and responses used to generate the next response.

Providers and Models

The ellmer package is organized on the basis of functions: one function for each provider. Within each function, you can specify which model to use.
- A provider is a web API that gives access to one or more models.
- The distinction is a bit subtle because providers are often synonymous with a model, like OpenAI and GPT, Anthropic and Claude, and Google and Gemini.
- But other providers, like Ollama, can host many different models, typically open source models like LLaMa and Mistral.
- Still other providers support both open and closed models, typically by partnering with a company that provides a popular closed model.
- For example, Azure OpenAI offers both open source models and OpenAI’s GPT, while AWS Bedrock offers both open source models and Anthropic’s Claude.

Prompts and Conversations

A prompt is the text (usually a question or request) that you send to the LLM.
- A conversation is a sequence of turns that alternate between your prompts and the model’s responses.
- Each request to the LLM includes not only your current prompt but also all previous prompts and responses in the conversation.
- This means conversations get more expensive as they get longer, so it’s best to keep them focused and concise.

Tokens

LLMs don’t process words directly—they convert text into tokens (which represent words or parts of words).
On average, one English word equals about 1.5 tokens.
A typical page of text might use 375-400 tokens, while a complete book could use 75,000-150,000 tokens.
Tokens matter because they determine the cost of using an LLM and how much context the model can process at once.

System Prompts

A system prompt is set when you start a conversation and affects every response from the model.
- It’s used to give the LLM additional instructions that shape its behavior.
- For example, you might ask it to always respond in a specific programming style or to assume you’re a beginner.
When programming with LLMs, you should work on improving your system prompt to get better results.

Example: System Prompt

The following example shows how the system prompt influences the LLM’s output.

Example: System Prompts and Conversations

library(ellmer)
chat <- chat_ollama(
  model = "gemma3",
  system = "You are an expert R programmer who writes clean, efficient, and well-commented code. Return only code, no explanations."
)
chat$chat("Write a function that calculates the Fibonacci sequence.")

```r
#' Calculate the Fibonacci sequence up to a given number of terms.
#'
#' This function calculates the Fibonacci sequence up to a specified number of 
terms using an iterative approach.
#'
#' @param n The number of terms in the Fibonacci sequence to generate.
#' @return A numeric vector containing the Fibonacci sequence up to n terms.
#' @examples
#' fibonacci_sequence(5)
#' fibonacci_sequence(10)
#' @export
fibonacci_sequence <- function(n) {
  # Check if the input is a valid positive integer.
  if (!is.numeric(n) || n <= 0) {
    stop("Input must be a positive integer.")
  }

  # Initialize the sequence with the first two Fibonacci numbers.
  if (n == 1) {
    return(c(0))
  }

  if (n == 2) {
    return(c(0, 1))
  }

  # Initialize a vector to store the Fibonacci sequence.
  fib_sequence <- c(0, 1)

  # Iterate to generate the remaining Fibonacci numbers.
  for (i in 3:n) {
    # Calculate the next Fibonacci number by summing the previous two.
    next_fib <- fib_sequence[i - 1] + fib_sequence[i - 2]

    # Append the next Fibonacci number to the sequence.
    fib_sequence <- c(fib_sequence, next_fib)
  }

  # Return the complete Fibonacci sequence.
  return(fib_sequence)
}
```

User Prompt

A user prompt is the specific request or question you send to the LLM during a conversation.
- This is where you describe the task you want the model to perform.
- The quality of your user prompt greatly influences the quality of the model’s response.
- Effective user prompts are clear, specific, and provide necessary context.

Treat AI like an infinitely patient new coworker who forgets everything you tell them each new conversation, one that comes highly recommended but whose actual abilities are not that clear. Two parts of this are analogous to working with humans (being new on the job and being a coworker) and two of them are very alien (forgetting everything and being infinitely patient). We should start with where AIs are closest to humans, because that is the key to good-enough prompting.

Local vs. Cloud LLMs

So far, we have used the chat_ollama() function from the ellmer package, allowing us to deploy local LLMs hosted on our own machine using the Ollama app.
Local LLMs have several advantages:
- No data leaves your machine, enhancing privacy and security.
- No ongoing costs—once you have the model, you can use it without paying per token.
However, for some tasks, cloud-based LLMs (like DeepSeek, OpenAI’s GPT-4 or Anthropic’s Claude) may offer better performance or capabilities.
The ellmer package supports both local and cloud LLMs, allowing you to choose the best option for your needs.

Setting Up Cloud LLMs

To use cloud-based LLMs, you’ll typically need to set up an account with the provider (e.g., DeepSeek, OpenAI, Anthropic).
- You must generate credentials on the DeepSeek website.
- The API is billed per token; DeepSeek offers competitive pricing with a generous free tier.
- Top up credits if needed via the DeepSeek platform.
1. Go to the Platform: Navigate to platform.deepseek.com in your web browser.
2. Sign Up/Log In: Create a new account or log in with your existing credentials.
3. Top Up Credits (if needed):
  - Go to the Billing or Top-Up section in your account.
  - Add a small amount of credits (e.g., $5) to get started. DeepSeek’s pricing is very affordable.

Creating API Key

Create the Key:
- Go to the API Keys section in your account dashboard.
- Click Create new API key.
- Give it a name (e.g., “R-ellmer-project”).
- Important: Copy the key string immediately. You will not be able to see it again once you close that window.

Setting Up API Key in R

The usethis package provides a helper function to edit your .Renviron file.
- This is a text file that runs every time R starts, allowing you to save secrets (like API keys) securely so you don’t have to type them into your scripts.
- Run the following command in your R console: usethis::edit_r_environ()
- A text editor will open displaying your .Renviron file (it might be empty).
- Add a new line with the following format, pasting the key you copied from the website:
- DEEPSEEK_API_KEY=sk-...
- Save and close the file.
- Restart your R session to load the new environment variable.

Using Cloud LLMs in `ellmer`

The ellmer package is designed to look for specific environment variables automatically.
- Because you named your variable DEEPSEEK_API_KEY in the previous step, ellmer will find it without you needing to copy-paste it into your code.
- You simply initialize the chat object. You do not need to provide the key as an argument.

library(ellmer)

# Create a chat object
# ellmer automatically finds DEEPSEEK_API_KEY in the background
chat <- chat_deepseek(model = "deepseek-chat")

# Send a message
chat$chat("Hello, tell me a short fun fact about the R programming language.")

Here's a fun fact: **R was named partly after the names of its creators (Ross 
Ihaka and Robert Gentleman), but also as a play on the letter 'S' — the name of
a previous statistical language that R was designed to be similar to.** So, 
it's both a pun on "S" and the initials of its two "R" creators

Prompt Engineering

What is Prompt Engineering?

Prompt engineering is the art and science of writing effective prompts to get the best responses from an LLM.
A well-designed prompt is clear, detailed, and gives the model enough context to understand what you need.
Good prompts often include examples of what you want (showing the model what “good” looks like).
Iterating and refining your prompts is key—your first attempt rarely produces the perfect result.

Best Practices

Be specific and detailed about what you want the LLM to do.
Provide examples when possible (e.g., “convert this code to use tidyverse style, like this example…”).
Break complex tasks into smaller steps and ask the LLM to work through them one at a time.
Specify the format you want for the output (e.g., “respond with only R code” or “give me a bullet-point list”).
Don’t be afraid to experiment—if a prompt doesn’t work, try rephrasing it or adding more context.

Structured Data

Structured Data Extraction: The Problem

Much of the world’s data exists in unstructured text—emails, customer reviews, social media posts, documents.
Traditional data analysis requires data in structured formats (like tables with rows and columns).
Converting unstructured text to structured data manually is extremely time-consuming.
This is where LLMs can be incredibly powerful—they’re excellent at finding patterns and extracting information from text.

Why Not Prompt Engineering?

When using an LLM to extract data from text or images, you can ask the chatbot to format it in JSON or any other format that you like.
- This works well most of the time, but there’s no guarantee that you’ll get the exact format you want.
- In particular, if you’re trying to get JSON, you’ll find that it’s typically surrounded in ```json, and you’ll occasionally get text that isn’t valid JSON.
- To avoid these problems, you can use a recent LLM feature: structured data (aka structured output).
- With structured data, you supply the type specification that defines the object structure you want and the LLM ensures that’s what you’ll get back.

Key Examples of Structured Data

LLMs are highly effective at converting unstructured data into structured formats.
- Although they aren’t infallible, they automate the heavy lifting of information extraction, drastically cutting down on manual processing.

Examples: Structured Data

Article summaries: Extract key points from lengthy reports or articles to create concise summaries for decision-makers.
Entity recognition: Identify and extract entities such as names, dates, and locations from unstructured text to create structured datasets.
Sentiment analysis: Extract sentiment scores and associated entities from customer reviews or social media posts to gain insights into public opinion.
Classification: Classify text into predefined categories, such as spam detection or topic classification.
Image/PDF input: Extract data from images or PDFs, such as tables or forms, to automate data entry processes.

Structured Data Basics

To extract structured data call $chat_structured() instead of $chat().
- You’ll also need to define a type specification that describes the structure of the data that you want (more on that shortly). Here’s a simple example that extracts two specific values from a string:

library(ellmer)
chat <- chat_ollama(model="gemma3")
chat$chat_structured(
  "My name is Susan and I'm 13 years old",
  type = type_object(
    name = type_string(),
    age = type_number()
  )
)

$name
[1] "Susan"

$age
[1] 13

Structured Data Extraction: How It Works

You provide the LLM with unstructured text and tell it exactly what structured information you want to extract.
The LLM reads the text and returns the information in the format you specified (e.g., a data.frame or list).
In ellmer, you can request data in specific formats that are easy to work with in R.
Even if the LLM isn’t 100% accurate, it often gets you 80-90% of the way there, which you can then verify or correct.

Structured Data Schemes

It seems useful to tell the LLM exactly what you’re looking for when dealing with structured data.
This guarantees that the LLM will only return JSON, that the JSON will have the fields that you expect.
- ellmer will convert it into an R data structure (like a data.frame)
The type_() functions define your desired data schema, and they fall into three main categories starting with scalars.
- Scalars represent single values and include type_boolean(), type_integer(), type_number(), type_string(), and type_enum().
- These correspond to single logical, integer, double, string, and factor values in R respectively.
- Each type function accepts a description argument that tells the LLM what the data represents, which helps guide extraction accuracy.

Arrays: Vectors and Lists

Arrays represent collections of values of the same type and are created with type_array().
- These collections can be of undetermined length, meaning that it allows for an arbitrary number of “matches” found in the input data.
- The item argument specifies what type each element should be, allowing you to create structures similar to R’s atomic vectors.
- For example, type_array(type_string()) creates a character vector, while type_array(type_integer()) creates an integer vector.
- You can also nest arrays within arrays to create list-like structures with well-defined types.

Objects: Named Lists

Objects represent collections of named values and are created with type_object().
Objects can contain any combination of scalars, arrays, and other objects, making them similar to named lists in R.
For example, type_object(name = type_string(), age = type_integer(), hobbies = type_array(type_string())) defines a person object with a name, age, and list of hobbies.
- The first argument to type_object() should be a description of what the object represents.

Creating Data Frames

To extract data frames from a single prompt, use type_array(type_object(...)) where the object defines each row.
- For example, extracting multiple people from text requires type_array(type_object(name = type_string(), age = type_integer())) to ensure all fields align into a proper data frame.
- This row-oriented approach matches how most languages handle tabular data.

Example: Data Frame

library(ellmer)
prompt <- r"(
* John Smith. Age: 30. Height: 180 cm. Weight: 80 kg.
* Jane Doe. Age: 25. Height: 5'5". Weight: 110 lb.
* Jose Rodriguez. Age: 40. Height: 190 cm. Weight: 90 kg.
* June Lee | Age: 35 | Height 175 cm | Weight: 70 kg
)"

type_people <- type_array(
  type_object(
    name = type_string(),
    age = type_integer(),
    height = type_number("in m"),
    weight = type_number("in kg")
  )
)

chat <- chat_ollama(model="gemma3")
chat$chat_structured(prompt, type = type_people)

# A tibble: 4 × 4
  name             age height weight
  <chr>          <int>  <dbl>  <dbl>
1 John Smith        30  180       80
2 Jane Doe          25    5.5    110
3 Jose Rodriguez    40  190       90
4 June Lee          35  175       70

Multi-Prompt Extraction

If you need to extract data from multiple prompts, you can use parallel_chat_structured(). It takes the same arguments as $chat_structured() with two exceptions:
- It needs a chat object since it’s a standalone function, not a method, and it can take a vector of prompts.

Example: Multi-item Extraction

library(ellmer)
prompts <- list(
  "I go by Alex. 42 years on this planet and counting.",
  "Pleased to meet you! I'm Jamal, age 27.",
  "They call me Li Wei. Nineteen years young.",
  "Fatima here. Just celebrated my 35th birthday last week.",
  "The name's Robert - 51 years old and proud of it.",
  "Kwame here - just hit the big 5-0 this year."
)
type_person <- type_object(
  name = type_string(),
  age = type_number()
)
chat <- chat_ollama(model = "gemma3")
parallel_chat_structured(chat, prompts, type = type_person)

# A tibble: 6 × 2
  name     age
  <chr>  <dbl>
1 Alex      42
2 Jamal     27
3 Li Wei    19
4 Fatima    35
5 Robert    51
6 Kwame     50

Handling Missing Values

By default, type_ functions use required = TRUE, which can lead to hallucinations when data doesn’t exist in the source.
- Setting required = FALSE allows fields to be missing, resulting in NA values in R when the LLM cannot find the requested information.
- This is crucial for robust extraction where not all fields may be present in every input.
- Always test with both positive examples (containing the data) and negative examples (missing the data) to validate your extraction logic.

Tool Calling

Introduction to Tool Calling

Tool calling allows chat models to request the execution of external functions that you define and provide.
- When making a chat request, you advertise one or more tools (defined by name, description, and arguments), and the model can respond with tool call requests.
- These are requests from the model to you to execute a function with given arguments, not direct executions by the model itself.
- You execute the functions and return results by submitting another chat request with the conversation history plus the results, allowing the model to use those results or make additional tool calls.

What Does Tool Calling Do?

The chat model does not directly execute any external tools.
- The model only makes requests for the caller to execute them on its behalf.
The correct flow is:
- User sends a request
- The assistant determines a tool is needed and requests the user run it
- The user executes the tool and returns the result
- The assistant uses that result to generate the final answer.
The value the chat model provides is knowing when to call a tool, what arguments to pass, and how to interpret the results in formulating its response.

Creating Tools

To create a tool, you first write a regular R function, then wrap it with tool() to provide metadata the model needs.
- The tool() function requires a name, description, and an arguments list that uses type functions like type_string(), type_integer(), etc.

Example: Creating Tools

get_current_time <- function(tz = "UTC") {
  format(Sys.time(), tz = tz, usetz = TRUE)
}

get_current_time <- tool(get_current_time, 
  name = "get_current_time", 
  description = "Returns the current time.", 
  arguments = list(tz = type_string("Time zone to display...", required = FALSE))
  )

Automatic Tool Generation

Writing tool definitions manually can be tedious, so ellmer provides create_tool_def() to automatically generate the tool() call for you.
- This function uses an LLM to analyze your function and generate the appropriate metadata.
- While create_tool_def() is a significant time-saver, you must review the generated code before using it to ensure accuracy.
- Once created, a tool is still just a special type of function that you can call directly.

Registering Tools

To give your chat object access to a tool, use the $register_tool() method with your tool function.
- Once registered, the model automatically decides when to use the tool based on user queries, with no further guidance needed.
- For example, after registering a get_current_time tool, asking “How long ago did Neil Armstrong land on the moon?” will cause the model to call the tool to get the current date before calculating the answer.
- You can inspect the chat history to see where and how the model used the tool.

Tool Engineering

When designing tools, include guidance in the description to optimize model behavior, such as “For efficiency, request all weather updates using a single tool call” to encourage batch processing.
Tools can perform various actions: calling APIs, reading/writing databases, running simulations, calling other AI models.
The key is keeping the interface simple while enabling powerful capabilities, allowing the model to orchestrate complex workflows through multiple tool calls when needed.

Example Tool Calling

The following is an example using the 2026 African Cup of Nations (a football tournament), which was won by Senegal.
- Most LLM’s have a knowledge cut-off somewhere in 2025, meaning they do not “know” the winner yet.

Example: Tool Calling

library(ellmer)
library(rvest)
library(httr2)

# First, let's try WITHOUT a tool - the LLM won't know the answer
chat_no_tool <- chat_ollama(model = "llama3.2:3b")
chat_no_tool$chat("Who won the Africa Cup of Nations in 2025-2026?")

I don't have information about the 2025-2026 Africa Cup of Nations winner as my
knowledge cutoff is December 2023, and I do not have real-time information. The
tournament will take place in an future year and its results are not yet known 
to me.

# Tool that searches Wikipedia first, then fetches the best match
fetch_wikipedia <- tool(
  function(search_query) {
    # Step 1: Use Wikipedia search API to find the correct page title
    search_url <- paste0(
      "https://en.wikipedia.org/w/api.php?action=query&list=search&srsearch=",
      URLencode(search_query),
      "&format=json&srlimit=1"
    )
    search_result <- request(search_url) |>
      req_perform() |>
      resp_body_json()

    if (length(search_result$query$search) == 0) {
      return("No Wikipedia page found for that query.")
    }

    # Get the top result's title
    page_title <- search_result$query$search[[1]]$title

    # Step 2: Fetch the actual Wikipedia page
    url <- paste0("https://en.wikipedia.org/wiki/", URLencode(page_title, reserved = TRUE))
    paragraphs <- read_html(url) |>
      html_elements("#mw-content-text p") |>
      html_text2()

    # Get the first 5 substantial paragraphs (filter out very short ones)
    substantial_paragraphs <- paragraphs[nchar(paragraphs) > 50]
    content <- paste(head(substantial_paragraphs, 5), collapse = "\n\n")

    return(paste0("Page found: '", page_title, "'\n\n", content))
  },
  name = "fetch_wikipedia",
  description = "Searches Wikipedia and fetches content from the best matching page. Returns the main text content from the article.",
  arguments = list(
    search_query = type_string(
      "A natural language search query to find the right Wikipedia page (e.g., '2025 Africa Cup of Nations'). Do NOT use underscores — write it naturally."
    )
  )
)

# Now create a chat WITH the tool
chat_with_tool <- chat_ollama(
  model = "llama3.1",
  system_prompt = "You have access to a tool called fetch_wikipedia that searches Wikipedia and fetches content. Use this tool whenever you need to look up factual information. Use it as few times as possible."
)
chat_with_tool$register_tool(fetch_wikipedia)
chat_with_tool$chat("Who won the Africa Cup of Nations in 2025-2026?")

The winner of the Africa Cup of Nations in 2025-2026 is originally reported as 
Senegal but later overturned due to controversy, with Morocco being awarded a 
3-0 victory by the CAF Appeal Board on March 17th, 2026.

# Examine the chat history to see the tool call
print(chat_with_tool)

<Chat Ollama/llama3.1 turns=5 input=986 output=84>
── system ──────────────────────────────────────────────────────────────────────
You have access to a tool called fetch_wikipedia that searches Wikipedia and fetches content. Use this tool whenever you need to look up factual information. Use it as few times as possible.
── user ────────────────────────────────────────────────────────────────────────
Who won the Africa Cup of Nations in 2025-2026?
── assistant [input=250 output=29] ─────────────────────────────────────────────
[tool request (call_w0urxgy8)]: fetch_wikipedia(search_query = "Africa Cup of Nations 2025-2026 winner")
── user ────────────────────────────────────────────────────────────────────────
[tool result  (call_w0urxgy8)]:
Page found: '2025 Africa Cup of Nations'

The 2025 Africa Cup of Nations[note 1] was the 35th edition of the Africa Cup of Nations, a biennial football tournament contested by the men's national teams of the Confederation of African Football (CAF). It was hosted by Morocco — the second time the country had hosted the tournament, after 1988 — following the stripping of hosting rights from Guinea due to inadequate preparations.[2][3]

Due to a scheduling conflict with the expanded 2025 FIFA Club World Cup in June and July, the Africa Cup of Nations was played between 21 December 2025 and 18 January 2026 — the first time it was held over the Christmas and New Year period.[4]

Defending champions Ivory Coast were eliminated in the quarter-finals by Egypt.[5] The final between Senegal and hosts Morocco was marred by controversy after Senegal walked off the pitch for 17 minutes in protest over a disallowed goal and a VAR decision in stoppage time. Senegal returned to the pitch and won 1–0 after extra time,[6] but on 17 March 2026 the CAF Appeal Board ruled that Senegal had forfeited the final through their actions, awarding Morocco a 3–0 victory.[7] Senegal have indicated that they will appeal to the Court of Arbitration for Sport.[8] It was Morocco's second Africa Cup of Nations title, after their first in 1976.

CAF stripped Cameroon from hosting the 2019 edition of the tournament on 30 November 2018 due to lack of speed of progress in preparations,[9] but accepted former CAF president Ahmad Ahmad's request to stage the next edition in 2021. Consequently, the original hosts of 2021, Ivory Coast, became hosts of the 2023 edition with Guinea instead hosting the 2025 edition, which until then had no hosts.[10] The CAF President confirmed the timetable shift after a meeting with Ivorian president Alassane Ouattara in Abidjan, Ivory Coast on 30 January 2019.[11] On 30 September 2022, current CAF president Patrice Motsepe announced that Guinea had been stripped as host for the 2025 edition due to inadequacy and speed of progress in hosting preparations.[2] Consequently, a new process was re-opened for a replacement host bidder.[12][13] On 27 September 2023, the 2025 edition was awarded to Morocco[3] and the 2027 edition to Kenya, Tanzania, and Uganda.[14][15]

The tournament mascot, named Assad (Arabic: أسد) was revealed on 8 December 2025. It was a Barbary lion, a reference to Morocco's national animal and nickname of the national team of Morocco.[16][17]
── assistant [input=736 output=55] ─────────────────────────────────────────────
The winner of the Africa Cup of Nations in 2025-2026 is originally reported as Senegal but later overturned due to controversy, with Morocco being awarded a 3-0 victory by the CAF Appeal Board on March 17th, 2026.

Retrieval-Augmented Generation

Hallucination

Large language models produce fluent and confident responses, but they often generate information that is plausible yet incorrect.
- These “hallucinations” occur because LLMs operate on text patterns without a concept of factual truth or falsehood.
- LLMs generate text based solely on similarity to patterns in their training data, with no awareness of whether the output is accurate.
- This fundamental limitation makes LLMs unreliable for tasks where accuracy matters.

RAG as a Solution

RAG addresses hallucinations by retrieving relevant excerpts from trusted, vetted sources before generating responses.
- Instead of generating from memory, the LLM is asked to summarize, paraphrase, or answer questions using only the retrieved material.
- This grounds responses in known content and significantly reduces hallucination risk.
- RAG shifts the LLM’s role from open-ended generation to working with provided source material.
- While RAG greatly reduces hallucinations, it doesn’t eliminate them entirely, so presenting links back to original sources allows users to verify details and check context.

Creating a Knowledge Store

The first stage of RAG is preparing a knowledge store—a database of processed content with embeddings.
This is done with the help of the ragnar library in R.
Use ragnar_store_create() to initialize a store, specifying the location and embedding function.
- You can use open-source models via embed_ollama(), commercial providers via embed_openai(), embed_google_vertex(), embed_bedrock(), or embed_databricks().
- Once created, the embedding provider is fixed for that store, but you can always create a new store with a different provider if needed.

Converting and Collecting Documents

Start by collecting all documents you want to include using list.files() for local files or ragnar_find_links() for websites.
Convert each document to markdown format using read_as_markdown(), which handles various formats including PDF, DOCX, PPTX, HTML, and more.
- Markdown is preferred because it’s plain text, keeps token counts low, and works well for both humans and LLMs.
- The function returns a MarkdownDocument object containing normalized markdown text with origin metadata.

Building Store Index

Insert processed chunks into the store using ragnar_store_insert(), which automatically generates embeddings using the function specified during store creation.
Repeat the read-chunk-insert pipeline for every document in your collection.
Once all documents are processed, call ragnar_store_build_index() to finalize the store and build the search index.
At this point, the store is ready for retrieval operations.

Retrieving Content with Dual Search Methods

Use ragnar_retrieve() to search the store using both vector similarity search (VSS) and BM25 keyword matching.
- VSS retrieves chunks with embeddings similar to the query embedding, enabling semantic search for conceptually related content even with different words.
- BM25 uses conventional text search with techniques like stemming and term frequency to find content containing specific words or phrases.¹

Registering RAG as an LLM Tool

Finally, register ragnar_retrieve() as a tool with ellmer’s Chat object using ragnar_register_tool_retrieve().
- This allows the LLM to rephrase unclear questions, ask follow-up questions, or search multiple times if needed.
The registered tool is intentionally simple, requiring only a query string to minimize opportunities for LLM errors.
This tool-based approach leverages the LLM’s ability to orchestrate complex information-seeking workflows.

Example: RAG in Action

Example: RAG

set.seed(123)
library(ragnar)
# Define a store by a location and an embeddings model
store_location <- "bas.ragnar.duckdb"
store <- ragnar_store_create(
  store_location,
  embed = \(x) ragnar::embed_ollama(x, model = "embeddinggemma")
)

# Add an item to the store
path <- "http://basmachielsen.nl/posts/measurement_error"

## Read the webpage and cut it into chunks
chunks <- read_as_markdown(path) |>
  markdown_chunk()

## Insert these chunks in the store
ragnar_store_insert(store, chunks)

## Write the store
ragnar_store_build_index(store)

# Initialize the LLM
client <- ellmer::chat_ollama(
  model = "llama3.1", 
  system_prompt = "You answer questions about econometrics based on a knowledge store. Always call the knowledge store, and answer concisely, in two or three sentences in text.",
  params = params(temperature=0))

# Register the tool with the LLM
ragnar_register_tool_retrieve(
  client, store, top_k = 1,
  description = "Knowledge about measurement error."
)

# Ask a question to be answered on the basis of the knowledge in the store
client$chat("What is measurement error?")

Measurement error is a type of error that occurs when the independent variable 
in a regression analysis is not perfectly measured. It can be represented as X 
= X* + e, where X is the observed variable, X* is the true variable, and e is 
the measurement error. This error can lead to attenuation bias, which causes 
the estimated slope coefficient to be biased towards zero.

Practical Considerations

Practical Considerations: Cost

LLMs charge per token used (both input tokens you send and output tokens they return).
- State-of-the-art models cost around $2-3 per million input tokens and $10-15 per million output tokens.
- Cheaper models can cost as little as $0.10-$0.40 per million tokens.
Even $10 of credit gives you plenty of room for experimentation, especially with cheaper models.
Costs grow quadratically with conversation length, so shorter conversations save money.

Practical Considerations: Accuracy

LLMs are not perfect—they can make mistakes or “hallucinate” incorrect information.
- Avoid using LLMs where 100% accuracy is critical without verification.
- However, for many tasks, an 80% accurate solution is still extremely valuable and saves time.
Always review and verify the output, especially for important analyses or decisions.
Think of LLMs as helpful assistants that need supervision, not as infallible experts.

Getting Started with Your First Prompt

Start simple: ask the LLM to explain a basic concept or help you write simple code.
- Gradually increase complexity as you become more comfortable with how the model responds.
- Pay attention to what works and what doesn’t—this will help you write better prompts over time.
Don’t hesitate to start a new conversation if you’re not getting good results—sometimes a fresh start with a better prompt works better than a long back-and-forth.
- Remember: prompt engineering is a skill that improves with practice.

Recapitulation

As economics students, you’ll work with diverse data sources—many in unstructured formats.
LLMs can help you transform text data (news articles, policy documents, survey responses) into analyzable datasets.
They can accelerate your learning of R by explaining code, debugging errors, and suggesting solutions.
Combining traditional statistical skills with LLM capabilities gives you a powerful toolkit for modern economic analysis.
The ability to engineer effective prompts and extract structured data from text is becoming an essential skill in data-driven fields.

Introduction to Applied Data Science

Overview

Course Schedule

Outline Today

LLM Fundamentals

The ellmer Package

Why Use LLMs in Data Science?

How to Interact with LLMs?

Providers and Models

Prompts and Conversations

Tokens

System Prompts

Example: System Prompt

User Prompt

Local vs. Cloud LLMs

Local vs. Cloud LLMs

Setting Up Cloud LLMs

Creating API Key

Setting Up API Key in R

Using Cloud LLMs in ellmer

Prompt Engineering

What is Prompt Engineering?

Best Practices

Structured Data

Structured Data Extraction: The Problem

Why Not Prompt Engineering?

Key Examples of Structured Data

Structured Data Basics

Structured Data Extraction: How It Works

Structured Data Schemes

Arrays: Vectors and Lists

Objects: Named Lists

Creating Data Frames

Multi-Prompt Extraction

Handling Missing Values

Tool Calling

Introduction to Tool Calling

What Does Tool Calling Do?

Creating Tools

Automatic Tool Generation

Registering Tools

Tool Engineering

Example Tool Calling

Retrieval-Augmented Generation

Hallucination

RAG as a Solution

Creating a Knowledge Store

Converting and Collecting Documents

Building Store Index

Retrieving Content with Dual Search Methods

Registering RAG as an LLM Tool

Example: RAG in Action

Practical Considerations

Practical Considerations: Cost

Practical Considerations: Accuracy

Getting Started with Your First Prompt

Recapitulation

Recapitulation

The `ellmer` Package

Using Cloud LLMs in `ellmer`