Introduction to Applied Data Science

Lecture 6: Prompt Engineering and Structured Data

Bas Machielsen

Overview

Course Schedule

Event Date Subject
Lecture 1 21-04 Introduction to Data and Data Science
Lecture 2 28-04 Getting Data: API's and Databases
Lecture 3 07-05 Getting Data: Web Scraping
Lecture 4 26-05 Text as Data
Lecture 5 27-05 Introduction to LLMs
Lecture 6 09-06 Prompt Engineering and Structured Data
Lecture 7 16-06 Spatial Data and Geocomputation

Outline Today

  • First part: introduction to prompts and prompt engineering.
  • Second part: using LLMs to extract structured data from unstructured text.

LLM Fundamentals

The ellmer Package

  • First, if you haven’t done so install and load the ellmer package in R.
    • Create a chat session by specifying which LLM provider and model you want to use.
    • Send prompts to the model using the chat function and receive responses.
    • You can see how many tokens you’ve used and the cost of your conversation by printing the chat object.
library(ellmer)
chat <- chat_ollama(model="gemma3")
chat$chat("What is the capital of the Netherlands?")
The capital of the Netherlands is **Amsterdam**. 

It’s the largest city and the political, cultural, and economic center of the 
country. 😊 

Do you want to know anything more about Amsterdam or the Netherlands?
print(chat$get_tokens())
# A tibble: 1 × 5
  input output cached_input cost       input_preview                            
  <dbl>  <dbl>        <dbl> <ellmr_dl> <chr>                                    
1    17     47            0 NA         Text[What is the capital of the Netherla…

Why Use LLMs in Data Science?

  • LLMs can help you write code faster, even if you’re just learning to program.
    • They excel at extracting structured information from messy, unstructured text data.
    • They can explain code you don’t understand or help debug errors.
    • They allow you to quickly prototype solutions to problems that might otherwise require specialized tools.
    • Even an 90% correct solution from an LLM can save you significant time compared to starting from scratch.
  • Extract structured data (.json, .csv etc.) from unstructured text (articles, reviews, documents).

How to Interact with LLMs?

  • It all starts with a prompt, which is the text (typically a question or a request) that you send to the LLM.
  • This starts a conversation, a sequence of turns that alternate between user prompts and model responses.
    • Inside the model, both the prompt and response are represented by a sequence of tokens, which represent either individual words or subcomponents of a word.
    • The tokens are used to compute the cost of using a model and to measure the size of the context, the combination of the current prompt and any previous prompts and responses used to generate the next response.

Providers and Models

  • The ellmer package is organized on the basis of functions: one function for each provider. Within each function, you can specify which model to use.
    • A provider is a web API that gives access to one or more models.
    • The distinction is a bit subtle because providers are often synonymous with a model, like OpenAI and GPT, Anthropic and Claude, and Google and Gemini.
    • But other providers, like Ollama, can host many different models, typically open source models like LLaMa and Mistral.
    • Still other providers support both open and closed models, typically by partnering with a company that provides a popular closed model.
    • For example, Azure OpenAI offers both open source models and OpenAI’s GPT, while AWS Bedrock offers both open source models and Anthropic’s Claude.

Prompts and Conversations

  • A prompt is the text (usually a question or request) that you send to the LLM.
    • A conversation is a sequence of turns that alternate between your prompts and the model’s responses.
    • Each request to the LLM includes not only your current prompt but also all previous prompts and responses in the conversation.
    • This means conversations get more expensive as they get longer, so it’s best to keep them focused and concise.

Tokens

  • LLMs don’t process words directly—they convert text into tokens (which represent words or parts of words).
  • On average, one English word equals about 1.5 tokens.
  • A typical page of text might use 375-400 tokens, while a complete book could use 75,000-150,000 tokens.
  • Tokens matter because they determine the cost of using an LLM and how much context the model can process at once.

System Prompts

  • A system prompt is set when you start a conversation and affects every response from the model.
    • It’s used to give the LLM additional instructions that shape its behavior.
    • For example, you might ask it to always respond in a specific programming style or to assume you’re a beginner.
  • When programming with LLMs, you should work on improving your system prompt to get better results.

Example: System Prompt

  • The following example shows how the system prompt influences the LLM’s output.

Example: System Prompts and Conversations

library(ellmer)
chat <- chat_ollama(
  model = "gemma3",
  system = "You are an expert R programmer who writes clean, efficient, and well-commented code. Return only code, no explanations."
)
chat$chat("Write a function that calculates the Fibonacci sequence.")
```R
#' Calculate the Fibonacci sequence up to a specified length.
#'
#' This function generates a vector containing the Fibonacci sequence up to a
#' specified number of terms.
#'
#' @param n The number of terms in the Fibonacci sequence to generate.
#'          Must be a non-negative integer.
#' @param start_index An integer indicating the starting index (1-based) of the
sequence.
#'                   Defaults to 1.
#'
#' @return A numeric vector containing the Fibonacci sequence.  Returns an
#'         empty vector if n is 0.
#'
#' @examples
#' fibonacci_sequence(5)
#' fibonacci_sequence(10, start_index = 2)
#'
#' @export
fibonacci_sequence <- function(n, start_index = 1) {
  # Input validation
  if (!is.numeric(n) || n <= 0 || n != round(n)) {
    stop("n must be a positive integer.")
  }

  if (!is.numeric(start_index) || start_index != round(start_index)) {
    stop("start_index must be a integer")
  }
  

  # Initialize the sequence
  if (n == 0) {
    return(numeric(0))
  }

  sequence <- vector(length = n, mode = "numeric")

  # Base cases
  if (start_index == 1) {
    sequence[1] <- 0
    sequence[2] <- 1
  } else {
    sequence[1] <- 1
    sequence[2] <- 1
  }

  # Generate the rest of the sequence
  for (i in 3:n) {
    sequence[i] <- sequence[i - 1] + sequence[i - 2]
  }

  return(sequence)
}
```

User Prompt

  • A user prompt is the specific request or question you send to the LLM during a conversation.
    • This is where you describe the task you want the model to perform.
    • The quality of your user prompt greatly influences the quality of the model’s response.
    • Effective user prompts are clear, specific, and provide necessary context.

Treat AI like an infinitely patient new coworker who forgets everything you tell them each new conversation, one that comes highly recommended but whose actual abilities are not that clear. Two parts of this are analogous to working with humans (being new on the job and being a coworker) and two of them are very alien (forgetting everything and being infinitely patient). We should start with where AIs are closest to humans, because that is the key to good-enough prompting.

Local vs. Cloud LLMs

Local vs. Cloud LLMs

  • So far, we have used the chat_ollama() function from the ellmer package, allowing us to deploy local LLMs hosted on our own machine using the Ollama app.
  • Local LLMs have several advantages:
    • No data leaves your machine, enhancing privacy and security.
    • No ongoing costs—once you have the model, you can use it without paying per token.
  • However, for some tasks, cloud-based LLMs (like OpenAI’s GPT-4 or Anthropic’s Claude) may offer better performance or capabilities.
  • The ellmer package supports both local and cloud LLMs, allowing you to choose the best option for your needs.

Setting Up Cloud LLMs

  • To use cloud-based LLMs, you’ll typically need to set up an account with the provider (e.g., OpenAI, Anthropic).
    • You must generate credentials on the OpenAI website.
    • Note that the API is a developer product and is billed separately from a standard “ChatGPT Plus” subscription.
    • You usually need to buy “credits” (pre-paid billing) to use the API.
    1. Go to the Platform: Navigate to platform.openai.com in your web browser.
    2. Sign Up/Log In: Create a new account or log in with your existing OpenAI credentials.
    3. Set Up Billing:
      • On the left sidebar (or under the Settings gear icon), look for Billing.
      • Add a payment method and purchase a small amount of credits (e.g., $5 or $10) to get started.
      • Without credits, the API will likely return an error.

Creating API Key

  1. Create the Key:
    • Go to Dashboard or click the API Keys section (often found under “Your Profile” or “Project” settings on the left).
    • Click + Create new secret key.
    • Give it a name (e.g., “R-ellmer-project”).
    • Important: Copy the key string (it starts with sk-...) immediately. You will not be able to see it again once you close that window.

Setting Up API Key in R

  • The usethis package provides a helper function to edit your .Renviron file.
    • This is a text file that runs every time R starts, allowing you to save secrets (like API keys) securely so you don’t have to type them into your scripts.
    • Run the following command in your R console: usethis::edit_r_environ()
    • A text editor will open displaying your .Renviron file (it might be empty).
    • Add a new line with the following format, pasting the key you copied from the website:
    • OPENAI_API_KEY=sk-proj-123456789...
    • Save and close the file.
    • Restart your R session to load the new environment variable.

Using Cloud LLMs in ellmer

  • The ellmer package is designed to look for specific environment variables automatically.
    • Because you named your variable OPENAI_API_KEY in the previous step, ellmer will find it without you needing to copy-paste it into your code.
    • You simply initialize the chat object. You do not need to provide the key as an argument.
library(ellmer)

# Create a chat object
# ellmer automatically finds OPENAI_API_KEY in the background
chat <- chat_openai(model = "gpt-4o")

# Send a message
chat$chat("Hello, tell me a short fun fact about the R programming language.")
The R programming language was named partly as a play on the name of its 
predecessors, "S" and "S-PLUS." The creators of R, Ross Ihaka and Robert 
Gentleman, both have first names starting with "R," which also influenced the 
naming.

Prompt Engineering

What is Prompt Engineering?

  • Prompt engineering is the art and science of writing effective prompts to get the best responses from an LLM.
  • A well-designed prompt is clear, detailed, and gives the model enough context to understand what you need.
  • Good prompts often include examples of what you want (showing the model what “good” looks like).
  • Iterating and refining your prompts is key—your first attempt rarely produces the perfect result.

Best Practices

  • Be specific and detailed about what you want the LLM to do.
  • Provide examples when possible (e.g., “convert this code to use tidyverse style, like this example…”).
  • Break complex tasks into smaller steps and ask the LLM to work through them one at a time.
  • Specify the format you want for the output (e.g., “respond with only R code” or “give me a bullet-point list”).
  • Don’t be afraid to experiment—if a prompt doesn’t work, try rephrasing it or adding more context.

Structured Data

Structured Data Extraction: The Problem

  • Much of the world’s data exists in unstructured text—emails, customer reviews, social media posts, documents.
  • Traditional data analysis requires data in structured formats (like tables with rows and columns).
  • Converting unstructured text to structured data manually is extremely time-consuming.
  • This is where LLMs can be incredibly powerful—they’re excellent at finding patterns and extracting information from text.

Why Not Prompt Engineering?

  • When using an LLM to extract data from text or images, you can ask the chatbot to format it in JSON or any other format that you like.
    • This works well most of the time, but there’s no guarantee that you’ll get the exact format you want.
    • In particular, if you’re trying to get JSON, you’ll find that it’s typically surrounded in ```json, and you’ll occasionally get text that isn’t valid JSON.
    • To avoid these problems, you can use a recent LLM feature: structured data (aka structured output).
    • With structured data, you supply the type specification that defines the object structure you want and the LLM ensures that’s what you’ll get back.

Key Examples of Structured Data

  • LLMs are highly effective at converting unstructured data into structured formats.
    • Although they aren’t infallible, they automate the heavy lifting of information extraction, drastically cutting down on manual processing.

Examples: Structured Data

  1. Article summaries: Extract key points from lengthy reports or articles to create concise summaries for decision-makers.
  2. Entity recognition: Identify and extract entities such as names, dates, and locations from unstructured text to create structured datasets.
  3. Sentiment analysis: Extract sentiment scores and associated entities from customer reviews or social media posts to gain insights into public opinion.
  4. Classification: Classify text into predefined categories, such as spam detection or topic classification.
  5. Image/PDF input: Extract data from images or PDFs, such as tables or forms, to automate data entry processes.

Structured Data Basics

  • To extract structured data call $chat_structured() instead of $chat().
    • You’ll also need to define a type specification that describes the structure of the data that you want (more on that shortly). Here’s a simple example that extracts two specific values from a string:
library(ellmer)
chat <- chat_ollama(model="gemma3")
chat$chat_structured(
  "My name is Susan and I'm 13 years old",
  type = type_object(
    name = type_string(),
    age = type_number()
  )
)
$name
[1] "Susan"

$age
[1] 13

Structured Data Extraction: How It Works

  • You provide the LLM with unstructured text and tell it exactly what structured information you want to extract.
  • The LLM reads the text and returns the information in the format you specified (e.g., a data.frame or list).
  • In ellmer, you can request data in specific formats that are easy to work with in R.
  • Even if the LLM isn’t 100% accurate, it often gets you 80-90% of the way there, which you can then verify or correct.

Structured Data Schemes

  • It seems useful to tell the LLM exactly what you’re looking for when dealing with structured data.
  • This guarantees that the LLM will only return JSON, that the JSON will have the fields that you expect.
    • ellmer will convert it into an R data structure (like a data.frame)
  • The type_() functions define your desired data schema, and they fall into three main categories starting with scalars.
    • Scalars represent single values and include type_boolean(), type_integer(), type_number(), type_string(), and type_enum().
    • These correspond to single logical, integer, double, string, and factor values in R respectively.
    • Each type function accepts a description argument that tells the LLM what the data represents, which helps guide extraction accuracy.

Arrays: Vectors and Lists

  • Arrays represent collections of values of the same type and are created with type_array().
    • These collections can be of undetermined length, meaning that it allows for an arbitrary number of “matches” found in the input data.
    • The item argument specifies what type each element should be, allowing you to create structures similar to R’s atomic vectors.
    • For example, type_array(type_string()) creates a character vector, while type_array(type_integer()) creates an integer vector.
    • You can also nest arrays within arrays to create list-like structures with well-defined types.

Objects: Named Lists

  • Objects represent collections of named values and are created with type_object().
  • Objects can contain any combination of scalars, arrays, and other objects, making them similar to named lists in R.
  • For example, type_object(name = type_string(), age = type_integer(), hobbies = type_array(type_string())) defines a person object with a name, age, and list of hobbies.
    • The first argument to type_object() should be a description of what the object represents.

Creating Data Frames

  • To extract data frames from a single prompt, use type_array(type_object(...)) where the object defines each row.
    • For example, extracting multiple people from text requires type_array(type_object(name = type_string(), age = type_integer())) to ensure all fields align into a proper data frame.
    • This row-oriented approach matches how most languages handle tabular data.

Example: Data Frame

library(ellmer)
prompt <- r"(
* John Smith. Age: 30. Height: 180 cm. Weight: 80 kg.
* Jane Doe. Age: 25. Height: 5'5". Weight: 110 lb.
* Jose Rodriguez. Age: 40. Height: 190 cm. Weight: 90 kg.
* June Lee | Age: 35 | Height 175 cm | Weight: 70 kg
)"

type_people <- type_array(
  type_object(
    name = type_string(),
    age = type_integer(),
    height = type_number("in m"),
    weight = type_number("in kg")
  )
)

chat <- chat_ollama(model="gemma3")
chat$chat_structured(prompt, type = type_people)
# A tibble: 4 × 4
  name             age height weight
  <chr>          <int>  <dbl>  <dbl>
1 John Smith        30  180       80
2 Jane Doe          25    5.5    110
3 Jose Rodriguez    40  190       90
4 June Lee          35  175       70

Multi-Prompt Extraction

  • If you need to extract data from multiple prompts, you can use parallel_chat_structured(). It takes the same arguments as $chat_structured() with two exceptions:
    • It needs a chat object since it’s a standalone function, not a method, and it can take a vector of prompts.

Example: Multi-item Extraction

library(ellmer)
prompts <- list(
  "I go by Alex. 42 years on this planet and counting.",
  "Pleased to meet you! I'm Jamal, age 27.",
  "They call me Li Wei. Nineteen years young.",
  "Fatima here. Just celebrated my 35th birthday last week.",
  "The name's Robert - 51 years old and proud of it.",
  "Kwame here - just hit the big 5-0 this year."
)
type_person <- type_object(
  name = type_string(),
  age = type_number()
)
chat <- chat_ollama(model = "gemma3")
parallel_chat_structured(chat, prompts, type = type_person)
# A tibble: 6 × 2
  name     age
  <chr>  <dbl>
1 Alex      42
2 Jamal     27
3 Li Wei    19
4 Fatima    35
5 Robert    51
6 Kwame     50

Handling Missing Values

  • By default, type_ functions use required = TRUE, which can lead to hallucinations when data doesn’t exist in the source.
    • Setting required = FALSE allows fields to be missing, resulting in NA values in R when the LLM cannot find the requested information.
    • This is crucial for robust extraction where not all fields may be present in every input.
    • Always test with both positive examples (containing the data) and negative examples (missing the data) to validate your extraction logic.

Tool Calling

Introduction to Tool Calling

  • Tool calling allows chat models to request the execution of external functions that you define and provide.
    • When making a chat request, you advertise one or more tools (defined by name, description, and arguments), and the model can respond with tool call requests.
    • These are requests from the model to you to execute a function with given arguments, not direct executions by the model itself.
    • You execute the functions and return results by submitting another chat request with the conversation history plus the results, allowing the model to use those results or make additional tool calls.

What Does Tool Calling Do?

  • The chat model does not directly execute any external tools.
    • The model only makes requests for the caller to execute them on its behalf.
  • The correct flow is:
    • User sends a request
    • The assistant determines a tool is needed and requests the user run it
    • The user executes the tool and returns the result
    • The assistant uses that result to generate the final answer.
  • The value the chat model provides is knowing when to call a tool, what arguments to pass, and how to interpret the results in formulating its response.

Creating Tools

  • To create a tool, you first write a regular R function, then wrap it with tool() to provide metadata the model needs.
    • The tool() function requires a name, description, and an arguments list that uses type functions like type_string(), type_integer(), etc.

Example: Creating Tools

get_current_time <- function(tz = "UTC") {
  format(Sys.time(), tz = tz, usetz = TRUE)
}

get_current_time <- tool(get_current_time, 
  name = "get_current_time", 
  description = "Returns the current time.", 
  arguments = list(tz = type_string("Time zone to display...", required = FALSE))
  )

Automatic Tool Generation

  • Writing tool definitions manually can be tedious, so ellmer provides create_tool_def() to automatically generate the tool() call for you.
    • This function uses an LLM to analyze your function and generate the appropriate metadata.
    • While create_tool_def() is a significant time-saver, you must review the generated code before using it to ensure accuracy.
    • Once created, a tool is still just a special type of function that you can call directly.

Registering Tools

  • To give your chat object access to a tool, use the $register_tool() method with your tool function.
    • Once registered, the model automatically decides when to use the tool based on user queries, with no further guidance needed.
    • For example, after registering a get_current_time tool, asking “How long ago did Neil Armstrong land on the moon?” will cause the model to call the tool to get the current date before calculating the answer.
    • You can inspect the chat history to see where and how the model used the tool.

Tool Engineering

  • When designing tools, include guidance in the description to optimize model behavior, such as “For efficiency, request all weather updates using a single tool call” to encourage batch processing.
  • Tools can perform various actions: calling APIs, reading/writing databases, running simulations, calling other AI models.
  • The key is keeping the interface simple while enabling powerful capabilities, allowing the model to orchestrate complex workflows through multiple tool calls when needed.

Example Tool Calling

  • The following is an example using the 2026 African Cup of Nations (a football tournament), which was won by Senegal.
    • Most LLM’s have a knowledge cut-off somewhere in 2025, meaning they do not “know” the winner yet.

Example: Tool Calling

library(ellmer)
library(rvest)
library(httr2)

# First, let's try WITHOUT a tool - the LLM won't know the answer
chat_no_tool <- chat_ollama(model = "llama3.2:3b")
chat_no_tool$chat("Who won the Africa Cup of Nations in 2025-2026?")
I don't have information on the winner of the Africa Cup of Nations for the 
2025-2026 tournament as my knowledge cutoff is 01 March 2023, and I do not have
real-time information. Is there something else I can help you with?
# Tool that searches Wikipedia first, then fetches the best match
fetch_wikipedia <- tool(
  function(search_query) {
    # Step 1: Use Wikipedia search API to find the correct page title
    search_url <- paste0(
      "https://en.wikipedia.org/w/api.php?action=query&list=search&srsearch=",
      URLencode(search_query),
      "&format=json&srlimit=1"
    )
    search_result <- request(search_url) |>
      req_perform() |>
      resp_body_json()

    if (length(search_result$query$search) == 0) {
      return("No Wikipedia page found for that query.")
    }

    # Get the top result's title
    page_title <- search_result$query$search[[1]]$title

    # Step 2: Fetch the actual Wikipedia page
    url <- paste0("https://en.wikipedia.org/wiki/", URLencode(page_title, reserved = TRUE))
    paragraphs <- read_html(url) |>
      html_elements("#mw-content-text p") |>
      html_text2()

    # Get the first 5 substantial paragraphs (filter out very short ones)
    substantial_paragraphs <- paragraphs[nchar(paragraphs) > 50]
    content <- paste(head(substantial_paragraphs, 5), collapse = "\n\n")

    return(paste0("Page found: '", page_title, "'\n\n", content))
  },
  name = "fetch_wikipedia",
  description = "Searches Wikipedia and fetches content from the best matching page. Returns the main text content from the article.",
  arguments = list(
    search_query = type_string(
      "A natural language search query to find the right Wikipedia page (e.g., '2025 Africa Cup of Nations'). Do NOT use underscores — write it naturally."
    )
  )
)

# Now create a chat WITH the tool
chat_with_tool <- chat_ollama(
  model = "llama3.1",
  system_prompt = "You have access to a tool called fetch_wikipedia that searches Wikipedia and fetches content. Use this tool whenever you need to look up factual information. Use it as few times as possible."
)
chat_with_tool$register_tool(fetch_wikipedia)
chat_with_tool$chat("Who won the Africa Cup of Nations in 2025-2026?")
The winner of the Africa Cup of Nations in 2025-2026 was Senegal. They defeated
hosts Morocco 1–0 in the final after extra time.
# Examine the chat history to see the tool call
print(chat_with_tool)
<Chat Ollama/llama3.1 turns=5 input=1029 output=62>
── system ──────────────────────────────────────────────────────────────────────
You have access to a tool called fetch_wikipedia that searches Wikipedia and fetches content. Use this tool whenever you need to look up factual information. Use it as few times as possible.
── user ────────────────────────────────────────────────────────────────────────
Who won the Africa Cup of Nations in 2025-2026?
── assistant [input=250 output=27] ─────────────────────────────────────────────

[tool request (call_t8i7l6a2)]: fetch_wikipedia(search_query = "2025-26 Africa Cup of Nations winner")
── user ────────────────────────────────────────────────────────────────────────
[tool result  (call_t8i7l6a2)]:
Page found: '2025 Africa Cup of Nations'

The 2025 Africa Cup of Nations, known in short as the 2025 AFCON or CAN 2025 and for sponsorship purposes as the TotalEnergies 2025 Africa Cup of Nations, was the 35th edition of the biennial Africa Cup of Nations tournament organised by the Confederation of African Football (CAF).[2] It was the second edition hosted by Morocco after 1988.[3] Morocco was originally scheduled to host the 2015 edition, but withdrew due to fears stemming from the Western African Ebola virus epidemic.[4]

Due to FIFA expanding its Club World Cup competition to 32 teams and having it scheduled for June and July 2025, this edition of the tournament was played between 21 December 2025 and 18 January 2026.[5] It was the first time that the tournament was played over the Christmas and New Year period. The situation was further complicated by the addition of two extra match days scheduled for the last two weeks of January in the expanded 2025–26 UEFA Champions League season.[6]

This edition of the tournament was scheduled to be the second after 2019 to take place during the northern hemisphere's summer (June–July), in order to reduce scheduling conflicts with European club teams and competitions;[7] the previous 2023 edition was moved to January and February 2024 owing to the adverse summer weather conditions in Ivory Coast.[8]Guinea was originally set as hosts for this edition of the tournament, but had its hosting rights stripped after affirming its inadequacy of hosting preparations.[9][10] After a second bidding process,[11] Morocco was named as the new hosts on 27 September 2023.[12]

Defending champions Ivory Coast were eliminated in the quarter-finals by Egypt.[13]Senegal secured their second title after defeating hosts Morocco 1–0 in the final after extra time.[14]

CAF stripped Cameroon from hosting the 2019 edition of the tournament on 30 November 2018 due to lack of speed of progress in preparations,[15] but accepted former CAF president Ahmad Ahmad's request to stage the next edition in 2021. Consequently, the original hosts of 2021, Ivory Coast, became hosts of the 2023 edition with Guinea instead hosting the 2025 edition, which until then had no hosts.[16] The CAF President confirmed the timetable shift after a meeting with Ivorian president Alassane Ouattara in Abidjan, Ivory Coast on 30 January 2019.[17] On 30 September 2022, current CAF president Patrice Motsepe announced that Guinea had been stripped as host for the 2025 edition due to inadequacy and speed of progress in hosting preparations.[9] Consequently, a new process was re-opened for a replacement host bidder.[11][18] On 27 September 2023, the 2025 edition was awarded to Morocco[12] and the 2027 edition to Kenya, Tanzania, and Uganda.[19][20]
── assistant [input=779 output=35] ─────────────────────────────────────────────
The winner of the Africa Cup of Nations in 2025-2026 was Senegal. They defeated hosts Morocco 1–0 in the final after extra time.

Retrieval-Augmented Generation

Hallucination

  • Large language models produce fluent and confident responses, but they often generate information that is plausible yet incorrect.
    • These “hallucinations” occur because LLMs operate on text patterns without a concept of factual truth or falsehood.
    • LLMs generate text based solely on similarity to patterns in their training data, with no awareness of whether the output is accurate.
    • This fundamental limitation makes LLMs unreliable for tasks where accuracy matters.

RAG as a Solution

  • RAG addresses hallucinations by retrieving relevant excerpts from trusted, vetted sources before generating responses.
    • Instead of generating from memory, the LLM is asked to summarize, paraphrase, or answer questions using only the retrieved material.
    • This grounds responses in known content and significantly reduces hallucination risk.
    • RAG shifts the LLM’s role from open-ended generation to working with provided source material.
    • While RAG greatly reduces hallucinations, it doesn’t eliminate them entirely, so presenting links back to original sources allows users to verify details and check context.

Creating a Knowledge Store

  • The first stage of RAG is preparing a knowledge store—a database of processed content with embeddings.
  • This is done with the help of the ragnar library in R.
  • Use ragnar_store_create() to initialize a store, specifying the location and embedding function.
    • You can use open-source models via embed_ollama(), commercial providers via embed_openai(), embed_google_vertex(), embed_bedrock(), or embed_databricks().
    • Once created, the embedding provider is fixed for that store, but you can always create a new store with a different provider if needed.

Converting and Collecting Documents

  • Start by collecting all documents you want to include using list.files() for local files or ragnar_find_links() for websites.
  • Convert each document to markdown format using read_as_markdown(), which handles various formats including PDF, DOCX, PPTX, HTML, and more.
    • Markdown is preferred because it’s plain text, keeps token counts low, and works well for both humans and LLMs.
    • The function returns a MarkdownDocument object containing normalized markdown text with origin metadata.

Building Store Index

  • Insert processed chunks into the store using ragnar_store_insert(), which automatically generates embeddings using the function specified during store creation.
  • Repeat the read-chunk-insert pipeline for every document in your collection.
  • Once all documents are processed, call ragnar_store_build_index() to finalize the store and build the search index.
  • At this point, the store is ready for retrieval operations.

Retrieving Content with Dual Search Methods

  • Use ragnar_retrieve() to search the store using both vector similarity search (VSS) and BM25 keyword matching.
    • VSS retrieves chunks with embeddings similar to the query embedding, enabling semantic search for conceptually related content even with different words.
    • BM25 uses conventional text search with techniques like stemming and term frequency to find content containing specific words or phrases.1

Registering RAG as an LLM Tool

  • Finally, register ragnar_retrieve() as a tool with ellmer’s Chat object using ragnar_register_tool_retrieve().
    • This allows the LLM to rephrase unclear questions, ask follow-up questions, or search multiple times if needed.
  • The registered tool is intentionally simple, requiring only a query string to minimize opportunities for LLM errors.
  • This tool-based approach leverages the LLM’s ability to orchestrate complex information-seeking workflows.

Example: RAG in Action

Example: RAG

set.seed(123)
library(ragnar)
# Define a store by a location and an embeddings model
store_location <- "bas.ragnar.duckdb"
store <- ragnar_store_create(
  store_location,
  embed = \(x) ragnar::embed_ollama(x, model = "embeddinggemma")
)

# Add an item to the store
path <- "http://basmachielsen.nl/posts/measurement_error"

## Read the webpage and cut it into chunks
chunks <- read_as_markdown(path) |>
  markdown_chunk()

## Insert these chunks in the store
ragnar_store_insert(store, chunks)

## Write the store
ragnar_store_build_index(store)

# Initialize the LLM
client <- ellmer::chat_ollama(
  model = "llama3.1", 
  system_prompt = "You answer questions about econometrics based on a knowledge store. Always call the knowledge store, and answer concisely, in two or three sentences in text.",
  params = params(temperature=0))

# Register the tool with the LLM
ragnar_register_tool_retrieve(
  client, store, top_k = 1,
  description = "Knowledge about measurement error."
)

# Ask a question to be answered on the basis of the knowledge in the store
client$chat("What is measurement error?")
Measurement error is a type of error that occurs when the independent variable 
in a regression analysis is not perfectly measured. It can be represented as X 
= X* + e, where X is the observed variable, X* is the true but unobservable 
variable, and e is the measurement error. This error can lead to attenuation 
bias, which causes the estimated slope coefficient to be biased towards zero.

Practical Considerations

Practical Considerations: Cost

  • LLMs charge per token used (both input tokens you send and output tokens they return).
    • State-of-the-art models cost around $2-3 per million input tokens and $10-15 per million output tokens.
    • Cheaper models can cost as little as $0.10-$0.40 per million tokens.
  • Even $10 of credit gives you plenty of room for experimentation, especially with cheaper models.
  • Costs grow quadratically with conversation length, so shorter conversations save money.

Practical Considerations: Accuracy

  • LLMs are not perfect—they can make mistakes or “hallucinate” incorrect information.
    • Avoid using LLMs where 100% accuracy is critical without verification.
    • However, for many tasks, an 80% accurate solution is still extremely valuable and saves time.
  • Always review and verify the output, especially for important analyses or decisions.
  • Think of LLMs as helpful assistants that need supervision, not as infallible experts.

Getting Started with Your First Prompt

  • Start simple: ask the LLM to explain a basic concept or help you write simple code.
    • Gradually increase complexity as you become more comfortable with how the model responds.
    • Pay attention to what works and what doesn’t—this will help you write better prompts over time.
  • Don’t hesitate to start a new conversation if you’re not getting good results—sometimes a fresh start with a better prompt works better than a long back-and-forth.
    • Remember: prompt engineering is a skill that improves with practice.

Recapitulation

Recapitulation

  • As economics students, you’ll work with diverse data sources—many in unstructured formats.
  • LLMs can help you transform text data (news articles, policy documents, survey responses) into analyzable datasets.
  • They can accelerate your learning of R by explaining code, debugging errors, and suggesting solutions.
  • Combining traditional statistical skills with LLM capabilities gives you a powerful toolkit for modern economic analysis.
  • The ability to engineer effective prompts and extract structured data from text is becoming an essential skill in data-driven fields.