Introduction to Applied Data Science

Lecture 5: Introduction to LLMs

Bas Machielsen

Overview

Course Schedule

Event	Date	Subject
Lecture 1	21-04	Introduction to Data and Data Science
Lecture 2	28-04	Getting Data: API's and Databases
Lecture 3	07-05	Getting Data: Web Scraping
Lecture 4	26-05	Text as Data
Lecture 5	27-05	Introduction to LLMs
Lecture 6	09-06	Prompt Engineering and Structured Data
Lecture 7	16-06	Spatial Data and Geocomputation

Outline Today

First part: introduction to embeddings on a more technical level.
Second part: transformers, attention mechanisms, and interaction with LLM’s through R.

Introduction to LLMs

What Are Large Language Models?

Large Language Models (LLMs) are neural networks trained to predict and generate text.
- They learn patterns from vast amounts of text data, such as books, articles, and websites.
- Unlike traditional text mining methods that count words or find patterns, LLMs understand context and relationships between words.
- They can complete sentences, answer questions, translate languages, and even write code.
- Examples include GPT-4, Claude, and Gemini, which power many AI applications you use daily.

Predicting the Next Word

Imagine you read: “The capital of France is ___“.
Your brain immediately thinks “Paris” because you’ve learned this relationship.
LLMs work similarly but at a massive scale.
- They are trained to predict the next word given all previous words in a sentence.
- During training, the model sees millions of examples like “The capital of France is Paris” and learns patterns.
- This simple task of next-word prediction is powerful enough to enable translation, summarization, and conversation.

Tokenization

Computers cannot directly process text, so we convert words into numbers.
- This process is called tokenization, where text is split into tokens (words or subwords).
- Each unique token gets assigned a numerical ID from a vocabulary.
- For example, the sentence “I love R programming” might become tokens: [“I”, “love”, “R”, “programming”]. These tokens are then converted to IDs: [245, 1342, 876, 3421].
- The model works entirely with these numerical representations.

Word Embeddings: Capturing Meaning

Each token ID is converted into a vector (a list of numbers) called an embedding.
These embeddings capture the meaning and relationships between words.
- For example, if we use 4-dimensional embeddings: “king” might be [0.8, 0.3, 0.1, 0.5] and “queen” might be [0.7, 0.4, 0.1, 0.6].
- Words with similar meanings have similar vectors.
- The famous example: king - man + woman $\approx$ queen works because of these mathematical relationships.
- In real LLMs, embeddings have hundreds or thousands of dimensions.

Mathematics Behind LLMs

Neural Networks: The Foundation

A neural network is a mathematical function that transforms inputs into outputs through layers.
Each layer applies a transformation: $\text{output} = \text{activation}(\text{weights} \times \text{input} + \text{bias})$.
The weights are parameters that the model learns during training.
- Multiple layers are stacked, allowing the network to learn complex patterns.
- For text, the input is token embeddings and the output is predictions for the next token.
- The network learns which patterns in the input predict which outputs.

In-depth Look at Word2Vec (CBOW)

One of the simplest and most influential models for learning word embeddings is Word2Vec, specifically the Continuous Bag of Words (CBOW) variant.
Before we can input words into the network, we must represent them numerically.
One-hot encoding represents each word as a vector of length $V$ (vocabulary size). For word $i$, the vector is all zeros except position i, which is 1.
- If “economy” is word 457 in our vocabulary of 10,000 words, it becomes a 10,000-dimensional vector: $[0, 0, ..., 0, 1, 0, ..., 0]$ where the 1 is in position 457.
- This is extremely sparse but allows us to index into our weight matrices.

Weight Matrices

CBOW uses two weight matrices that we learn during training.
$W_{in}$ is a $V \times N$ matrix where $V$ is vocabulary size and $N$ is embedding dimension (typically 100-300).
Each row of $W_{in}$ represents the input embedding for a word.
$W_{out}$ is an N × V matrix where each column represents the output embedding for a word.
- These matrices are initialized randomly and then updated through something called backpropagation.
- After training, we typically use $W_{in}$ as our final word embeddings, though some implementations average $W_{in}$ and $W_{out}^T$.

Forward Pass

Given a context window of $C$ words around our target, we input $C$ one-hot vectors: $x_1, x_2, \dots, x_C$.
Each $x_c$ is a $V$-dimensional one-hot vector.
- We multiply each by $W_{in}$ to get word vectors: $v_c = W_{in}^T x_c$. ¹
- We average these context vectors to get the hidden layer: $h = (1/C) \sum_c v_c$.
- This hidden layer $h$ is an N-dimensional vector representing the average context.
- The averaging is why we call it a “bag of words”—word order in the context doesn’t matter.

From Hidden to Output

From the hidden layer $h$, we compute scores for every word in the vocabulary.
The score for word $w$ is: $u_w = W_{out}^T h$ (where we’re taking the w-th column of $W_{out}$). This gives us $V$ scores, one for each possible output word.
- We convert these scores to probabilities using the softmax function: $P(w|context) = \exp(u_w) / \sum_{w'=1}^V \exp(u_w')$.
- The softmax ensures all probabilities sum to 1. The denominator requires computing scores for all $V$ words, which is computationally expensive for large vocabularies.¹

The Training Objective

Quick reminder. $u_w$ is the pre-softmax score for vocabulary word $w$ (one number per word, computed from the hidden layer $h$). $w_{target}$ is the actual word that appeared in the training corpus given the surrounding context — the word the model should predict.
After applying the softmax, the model’s predicted probability for the correct word is: \[P(w_{target} \mid \text{context}) = \frac{\exp(u_{w_{target}})}{\sum_{w'=1}^V \exp(u_{w'})}\]
We want this probability to be as high as possible. Taking the log of the softmax (and using $\log(a/b) = \log a - \log b$): \[\log P(w_{target} \mid \text{context}) = u_{w_{target}} - \log \sum_{w=1}^V \exp(u_w)\]

The Training Objective (2)

Equivalently, we minimize the negative log-likelihood, which we call the loss $L$: \[L = -u_{w_{target}} + \log \sum_{w=1}^V \exp(u_w)\]
Intuition: the first term goes down when the score $u_{w_{target}}$ of the correct word goes up (good); the second term goes up whenever any word gets a high score (it acts like a penalty that forces the model to be selective).
Across the entire corpus of $T$ training examples, we minimize the average loss: \[J = -\frac{1}{T} \sum_{t=1}^T \log P\bigl(w_{target}^{(t)} \mid \text{context}^{(t)}\bigr)\]
We then adjust the weight matrices $W_{in}$ and $W_{out}$ to make $J$ smaller, using gradient descent (covered in the next section).

Schematic Overview

Here is a schematic overview of the Word2Vec model.

Word2Vec in R

Example: Word2Vec in R

We use the real corpus of Jane Austen books to check whether Word2Vec can recover a relationship such as Mother $\approx$ Father - Man + Woman.

library(word2vec)
library(janeaustenr)
library(dplyr)
library(stringr)

# 1. Prepare a corpus (Jane Austen's books)
# We collapse all lines into one long character string per book, then clean it slightly
austen_data <- austen_books() |>
  group_by(book) |>
  summarise(text = paste(text, collapse = " ")) |>
  mutate(text = tolower(text)) |>
  mutate(text = str_remove_all(text, "[[:punct:]]")) |>
  pull(text)

# 2. Train the model
# We use a smaller dimension (dim=15) because the data size is small (~13k unique words)
model <- word2vec(x = austen_data, type = "cbow", dim = 15, iter = 20)

# 3. Test a relationship that actually exists in this specific text
# (Austen writes about "father", "mother", "brother", "sister" often)
emb <- as.matrix(model)

# Try: Father - Man + Woman = Mother
vector <- emb["father", , drop = FALSE] - 
          emb["man", , drop = FALSE] + 
          emb["woman", , drop = FALSE]

predict(model, newdata = vector, type = "nearest", top_n = 5)

$father
     term similarity rank
1  father  0.9583777    1
2  sister  0.9306000    2
3 brother  0.9280139    3
4  mother  0.9245074    4
5 husband  0.9084545    5

Training LLMs

The Training Process: Learning from Data

Training an LLM involves showing it millions of text examples.
For each example, the model predicts the next word and compares it to the actual next word.
We measure the error using a loss function, typically cross-entropy loss:

\[L = - \sum y_i \log(\hat{y}_i),\]
where $y_i$ is the true probability and $\hat{y}_i$ is the predicted probability.
Using algorithms like backpropagation and gradient descent, we adjust the model’s weights to minimize this loss.
This process repeats for billions of examples until the model learns to predict accurately.

The Cross-Entropy Loss

Example: Cross-Entropy Loss

Suppose we want to predict the next word after “I love”.

The vocabulary has three possible words: “R” (ID: 0), “Python” (ID: 1), “Java” (ID: 2).

The correct answer is “R”, so the true distribution is y = [1, 0, 0].

Our model predicts probabilities $\hat{y}$ = [0.7, 0.2, 0.1].

The cross-entropy loss is: $L = -(1\times\log(0.7) + 0\times\log(0.2) + 0\times\log(0.1)) = -log(0.7) ≈ 0.357$. If the model predicted $\hat{y}$ = [0.3, 0.5, 0.2], the loss would be $-log(0.3) \approx 1.204$, which is worse. Lower loss means better predictions.

Gradient Descent: Finding the Minimum

Gradient descent is an optimization algorithm that finds the best parameters for our model.
Imagine you’re standing on a hill in fog and want to reach the valley—you’d take steps downhill.
- The loss function is like a landscape, and we want to find its lowest point.
- The gradient tells us which direction is uphill, so we move in the opposite direction.
- Mathematically, we update each weight using: $w_{new} = w_{old} - \alpha \times \frac{\partial L}{\partial w}$, where $\alpha$ is the learning rate (step size).
- The gradient $\frac{\partial L}{\partial w}$ tells us how much the loss changes when we change weight $w$.
- We repeat this process thousands or millions of times until the loss stops decreasing.

Learning Rate

The learning rate $\alpha$ controls how big our steps are during optimization.
- If $\alpha$ is too large, we might overshoot the minimum and bounce around erratically.
- If $\alpha$ is too small, training takes forever and we might get stuck in local minima.
Typical learning rates are small numbers like 0.001 or 0.0001.
- If the gradient says “you’re climbing steeply,” a learning rate tells you “take a 10-centimeter step down.”
Modern optimizers like Adam adapt the learning rate automatically during training.
- Choosing the right learning rate is crucial—it’s often the most important hyperparameter to tune.

Example: Gradient Descent

Example: Gradient Descent

Let’s minimize a simple loss function: $L(w) = (w - 3)^2$.

The true minimum is at $w = 3$ where $L = 0$.

Suppose we start with $w = 0$ and use learning rate $α = 0.1$. The gradient is $\frac{\partial L}{\partial w} = 2(w - 3)$.

Iteration 1: $\text{gradient} = 2(0 - 3) = -6$, so $w_{new} = 0 - 0.1\times(-6) = 0.6$, and $L = (0.6-3)^2 = 5.76$.

Iteration 2: $\text{gradient} = 2(0.6 - 3) = -4.8$, so $w_{new} = 0.6 - 0.1\times(-4.8) = 1.08$, and $L = (1.08-3)^2 = 3$.69.

Iteration 3: $\text{gradient} = 2(1.08 - 3) = -3.84$, so $w_{new} = 1.08 - 0.1\times(-3.84) = 1.46$, and $L = 2.37$. We continue this process, and $w$ gradually approaches 3 while $L$ approaches 0.

Gradient Descent in R

Example: Gradient Descent

# Simple gradient descent to minimize L(w) = (w - 3)^2
gradient_descent <- function(start_w, learning_rate, n_iterations) {
  w <- start_w
  history <- data.frame(iteration = integer(), w = numeric(), L = numeric())
  for (i in 1:n_iterations) {
    gradient <- 2 * (w - 3)  # Derivative of L(w)
    w <- w - learning_rate * gradient  # Update rule
    L <- (w - 3)^2  # Compute loss
    history <- rbind(history, data.frame(iteration = i, w = w, L = L))
  }
  return(history)
}

# Run gradient descent
result <- gradient_descent(start_w = 0, learning_rate = 0.1, n_iterations = 20)
print(result)

   iteration        w           L
1          1 0.600000 5.760000000
2          2 1.080000 3.686400000
3          3 1.464000 2.359296000
4          4 1.771200 1.509949440
5          5 2.016960 0.966367642
6          6 2.213568 0.618475291
7          7 2.370854 0.395824186
8          8 2.496684 0.253327479
9          9 2.597347 0.162129587
10        10 2.677877 0.103762935
11        11 2.742302 0.066408279
12        12 2.793842 0.042501298
13        13 2.835073 0.027200831
14        14 2.868059 0.017408532
15        15 2.894447 0.011141460
16        16 2.915558 0.007130535
17        17 2.932446 0.004563542
18        18 2.945957 0.002920667
19        19 2.956765 0.001869227
20        20 2.965412 0.001196305

Backpropagation

Backpropagation is the algorithm for computing gradients in more complex neural networks.
A neural network is a composition of functions: $y = f_n(...f_2(f_1(x))...)$.
To find how the loss $L$ depends on early weights, we need the chain rule from calculus.
The chain rule says: $∂L/∂w_1 = (∂L/∂y) × (∂y/∂f_2) × (∂f_2/∂f_1) × (∂f_1/∂w_1)$.
Backpropagation computes these derivatives by working backwards from the output to the input.
- We start with $∂L/∂y$ (how wrong our prediction was), then multiply by derivatives at each layer going backward. This “backward pass” efficiently computes all gradients in one sweep through the network.

Example: Backpropagation

Example: Backpropagation

Consider a tiny network: input $x$, one hidden layer with $h = 2x + 1$, and output $ŷ = 3h$.

The true output is $y = 20$, and input is $x = 2$. Forward pass: $h = 2(2) + 1 = 5$, then $ŷ = 3(5) = 15$. The loss is $L = (ŷ - y)² = (15 - 20)² = 25$.

Now we compute gradients backward. First, $∂L/∂ŷ = 2(ŷ - y) = 2(15 - 20) = -10$ (how loss changes with output).

Next, $∂ŷ/∂h = 3$ (from $ŷ = 3h$), so $∂L/∂h = (∂L/∂ŷ) × (∂ŷ/∂h) = -10 × 3 = -30$. This tells us if we increase $h$, the loss decreases by 30 times that increase. This is the chain rule in action.

The Limitation of Simple Neural Networks

Early neural networks processed words one at a time or in fixed windows.
They struggled with long-range dependencies: understanding how words far apart relate to each other.
- For example, in “The economist who studied inflation for decades finally published”, the model needs to remember “economist” to predict “published”.
Simple networks forget information from earlier in the sequence.
They also process words sequentially, which is slow and cannot capture complex relationships.
This is where transformers revolutionized the field.

Transformers

The Transformer Architecture: A Breakthrough

The transformer, introduced in 2017, processes all words simultaneously rather than sequentially.
It uses an attention mechanism to identify which words are relevant to each other.
Unlike previous architectures, transformers can directly connect any two words in a sentence.
- They consist of multiple layers, each with attention and feed-forward components.
- This parallel processing makes them faster to train and better at capturing context.
- All modern LLMs, including GPT and Claude, are based on transformers.

Self-Attention

Self-attention allows each word to “look at” all other words and decide which ones are important.
- For each word, the model computes attention scores with every other word in the sequence.
- Higher scores mean stronger relationships between words.
- The model then uses these scores to create a weighted combination of all words.
- This weighted combination becomes the new representation of each word.
- Self-attention enables the model to understand context dynamically.

Attention Mechanism: The Mathematics

For each word, we create three vectors: Query ($Q$), Key ($K$), and Value ($V$).
These are computed by multiplying the word embedding by learned weight matrices.
- The attention score between two words is: $\text{score} = Q × K^T / \sqrt{d_k}$, where $d_k$ is the dimension of the key vector.
- We apply softmax to convert scores to probabilities: $\text{attention_weights} = \text{softmax(scores)}$.
- Finally, we compute the output as a weighted sum: $\text{output} = \text{attention_weights} \times V$. This process determines which words to focus on.

Attention Example

“The animal didn’t cross the street because it was too tired”

Let’s analyze what “it” refers to in this sentence.
- The word “it” creates a query vector that asks “what does ‘it’ refer to?”.
- Every word in the sentence has a key vector representing what it is.
- The query of “it” is compared with keys of ["The", "animal", "didn't", "cross", "the", "street", "because", "it", "was", "too", "tired"].
- The attention mechanism computes high scores for “animal” (0.6) and lower scores for other words like “street” (0.1) or “tired” (0.05).
- These scores become weights, and “it” gets a new representation heavily influenced by “animal”.
- This is how the model understands “it” refers to “the animal”.

Attention Example: Numerical Calculation

Example: Attention Mechanism (Numerical)

Suppose we have simplified 3-dimensional vectors.

“it” has query Q_it = [1, 0, 2]. “animal” has key K_animal = [0.9, 0.1, 1.8] and “street” has key K_street = [0.2, 0.3, 0.4].

We compute dot products: $Q_{it} \cdot K_{animal} = 1×0.9 + 0×0.1 + 2×1.8 = 4.5$. $Q_{it} · K_{street} = 1×0.2 + 0×0.3 + 2×0.4 = 1.0$.

We divide both numbers by $√3 ≈ 1.73$: score_animal = $4.5/1.73 = 2.60$, score_street = $1.0/1.73 = 0.58$.

After softmax, attention_weight_animal $≈ 0.88$ and attention_weight_street $≈ 0.12$. The model focuses 88% on “animal” and 12% on “street” when processing “it”.

Multi-Head Attention: Multiple Perspectives

Instead of one attention mechanism, transformers use multiple attention “heads” in parallel.
- Each head learns different types of relationships between words.
- One head might focus on syntactic relationships (subject-verb), another on semantic similarities.
- Each head has its own $Q, K, V$ weight matrices, so they compute different attention patterns.
- The outputs from all heads are concatenated and transformed through another weight matrix.
- Typically, models use 8-16 attention heads per layer, and GPT-3 uses 96 heads per layer.

Multi-Head Attention Example

Example: Multi-Head Attention

“The cat sat on the mat”.

Head 1 might learn to connect “cat” with “sat” (subject-verb relationship).

When processing “sat”, Head 1 assigns high attention to “cat” (0.7) and low to “mat” (0.1).

Head 2 might learn spatial relationships, connecting “sat” with “mat” (location).

When processing “sat”, Head 2 assigns high attention to “mat” (0.6) and moderate to “on” (0.3).

By combining insights from both heads, the model understands both what is sitting and where it’s sitting. This multi-perspective approach makes transformers powerful.

Positional Encoding: Preserving Word Order

Attention processes all words simultaneously, but word order matters: “dog bites man” differs from “man bites dog”.
To preserve order information, we add positional encodings to word embeddings.
- These are vectors that represent each word’s position in the sequence.
- For position $t$ and embedding dimension $i$, we use: $PE(t,i) = \sin(t / 10000^{(2i/d)})$ for even i, and $PE(t,i) = \cos(t / 10000^{(2i/d)})$ for odd i. These encodings allow the model to distinguish “I love R” from “R love I”.
- The positional information is embedded directly into the word representations.

Overview

Here is an overview of a multi-attention mechanism.

Stacking Transformer Layers

A single transformer layer applies attention and then a feed-forward network.
- Modern LLMs stack dozens or hundreds of these layers.
- Each layer refines the representation from the previous layer.
- Early layers might capture syntax and grammar.
- Middle layers might capture semantic relationships and facts.
- Later layers might capture complex reasoning and abstract concepts.
GPT-3 has 96 layers, and larger models have even more. Deeper networks can capture more complex patterns.

The Complete Forward Pass

Example: : From Text to Prediction

Let’s trace “The economist studies” through a transformer.

Step 1: Tokenize to [“The”, “economist”, “studies”] with IDs [123, 456, 789].

Step 2: Convert to embeddings and add positional encodings.

Step 3: Pass through Layer 1 attention, which identifies “economist” and “studies” are related.

Step 4: Pass through Layer 1 feed-forward network for transformation.

Step 5: Repeat through all subsequent layers (e.g., 12 layers).

Step 6: The final layer outputs a probability distribution over all vocabulary words.

Step 7: The model predicts “inflation” has probability 0.15, “data” has 0.12, etc.

Why Are LLMs So Large?

GPT-3 has 175 billion parameters (weights and biases in the neural network).
- Each parameter is a number that the model learns during training.
- More parameters allow the model to memorize more patterns and relationships.
The embedding matrix alone might store 50,000 tokens $\times$ 12,000 dimensions = 600 million parameters.
Each transformer layer has millions of parameters in its attention and feed-forward components.
Larger models generally perform better but require more computational resources.
- Training GPT-3 cost approximately $4-5 million in compute resources.

Scaling Laws: Bigger Is Better

Research shows that model performance improves predictably with size.
If you increase model size, training data, and computation, the loss decreases following a power law.
Doubling the model size reduces loss by a consistent percentage.
This has driven the trend toward ever-larger models: GPT-2 (1.5B parameters), GPT-3 (175B), GPT-4 (estimated 1+ trillion).
- However, there are diminishing returns, and very large models face challenges. The relationship between size and capability is an active research area.

Training Data: Learning from the Internet

LLMs are trained on massive text datasets scraped from the internet. GPT-3 was trained on approximately 45TB of text data, including books, Wikipedia, web pages, and code repositories.
The training data is filtered to remove low-quality content and duplicates.
The model sees trillions of words during training. Data quality matters: biases in the training data appear in the model’s outputs.
Diverse, high-quality data leads to more capable and robust models. The phrase “garbage in, garbage out” applies to LLMs.

Emergence: Surprising Capabilities

As LLMs grow larger, they develop capabilities not explicitly trained for.
- This phenomenon is called emergence. Small models can complete sentences, but large models can reason, solve math problems, and write code.
- For example, GPT-3 can perform arithmetic, translate languages, and answer questions without being explicitly trained for these tasks.
- These abilities emerge from the general pattern-learning during next-word prediction.
Researchers don’t fully understand why or when emergence occurs, making it a fascinating area of study.

Temperature and Sampling: Controlling Randomness

When generating text, the model predicts probability distributions over words.
However, we don’t always pick the highest probability word, or the output becomes repetitive.
Temperature controls randomness: at temperature $T$, we adjust probabilities as: $p_i = \exp(\log(p_i)/T) / \sum_j \exp(\log(p_j)/T)$.
- Temperature = 1 uses the original probabilities.
- Temperature < 1 (e.g., 0.5) makes the model more confident and deterministic. Temperature > 1 (e.g., 1.5) makes the model more random and creative.
- For factual tasks, use low temperature; for creative writing, use high temperature.

Temperature Example

Example: Predicting Capital of France

Suppose the model predicts probabilities: Paris (0.7), Lyon (0.15), Marseille (0.1), London (0.05).

At temperature = 1, these probabilities are unchanged, and Paris is chosen 70% of the time.

At temperature = 0.5, we get: Paris (0.93), Lyon (0.04), Marseille (0.02), London (0.004), making Paris even more likely.

At temperature = 2, we get: Paris (0.47), Lyon (0.22), Marseille (0.17), London (0.13), increasing variety.

Lower temperature gives more predictable, correct answers; higher temperature gives more creative, varied outputs.

Implementation in R

change_probabilities <- function(probs, temperature) {
  log_probs <- log(probs)
  adjusted_log_probs <- log_probs / temperature
  exp_probs <- exp(adjusted_log_probs)
  return(exp_probs / sum(exp_probs))
}

change_probabilities(c(0.7, 0.15, 0.1, 0.05), temperature = 2)
## [1] 0.4743528 0.2195827 0.1792885 0.1267761

Limitations and Challenges

Fine-Tuning: Adapting to Specific Tasks

After pre-training on general text, LLMs can be fine-tuned for specific applications.
Fine-tuning continues training on a smaller, task-specific dataset.
- For example, fine-tune on medical texts to create a medical chatbot. Fine-tune on code to improve programming assistance.
Fine-tuning adjusts the pre-trained weights rather than training from scratch.
- This is much faster and requires less data than full training. The model retains general knowledge while specializing in the target domain.

Reinforcement Learning from Human Feedback (RLHF)

To make LLMs more helpful and safer, human feedback is also used.
Human evaluators rank different model outputs for the same prompt.
- A reward model is trained to predict which outputs humans prefer.
- The LLM is then trained to maximize this reward using reinforcement learning.
- This process aligns the model with human values and preferences.
- RLHF is why modern chatbots are helpful, harmless, and honest.
- Without RLHF, models might generate toxic, false, or unhelpful content.

LLM Challenges

LLMs have significant limitations despite their capabilities.
- They can generate false information confidently (hallucination).
- They lack true understanding and rely on pattern matching.
- They struggle with precise arithmetic and logical reasoning.
- They can perpetuate biases present in training data.
- They require enormous computational resources and energy.
- They cannot access real-time information unless connected to external tools.
Understanding these limitations is crucial for responsible use.

Hallucination

Hallucination occurs when LLMs generate plausible-sounding but incorrect information.
- For example, an LLM might confidently cite a research paper that doesn’t exist.
- This happens because the model predicts what sounds likely, not what is true.
- The model doesn’t “know” facts; it recognizes patterns.
Hallucination is more common for obscure topics with limited training data.
Users must verify critical information from LLM outputs.
- Techniques like retrieval-augmented generation help reduce hallucination.¹

Prompting: How to Interact with LLMs

The way you phrase your input (the prompt) significantly affects the output.
Clear, specific prompts yield better results than vague ones.
- “Explain transformers” gives a general answer, while “Explain how self-attention works in transformers using a numerical example” gives a focused response.
Few-shot prompting provides examples in the prompt to guide the model.
Chain-of-thought prompting asks the model to reason step-by-step.
Effective prompting is a skill that improves with practice.

LLMs in Economics Research

LLMs are becoming tools for economic research and analysis.
- They can analyze large volumes of text data, such as news articles or earnings calls, more efficiently than traditional methods.
- They can summarize economic reports and extract key information.
- They can assist in coding tasks, such as writing R scripts for data analysis.
- They can generate hypotheses and literature reviews.
However, researchers must critically evaluate LLM outputs and verify important claims.
LLMs augment, but don’t replace, human expertise.

Using LLMs with R

The `ellmer` package

The ellmer package provides an interface to interact with LLMs directly from R.
It supports multiple providers like OpenAI and Anthropic.
You can send prompts and receive responses seamlessly.
- The package handles authentication, request formatting, and response parsing.
- It allows you to experiment with different models and settings (e.g., temperature, max tokens).
You can integrate LLMs into your data analysis workflows, automate tasks, and explore new research avenues.

Ollama

Ollama is a local LLM platform that allows you to run models on your own machine without needing internet access.
Download ollama from here and install it on your computer.
It supports various open-source models and provides an API similar to cloud providers.
- Using Ollama with R, you can leverage LLM capabilities while maintaining data privacy and reducing costs.
- It’s particularly useful for sensitive data or when internet connectivity is limited.
- The ellmer package can interface with Ollama, making it easy to switch between local and cloud-based models.

Open Source LLMs

Ollama supports many open-source LLMs that you can run locally.
- Examples include LLaMA, GPT-J, and Falcon.
These models vary in size and capabilities, allowing you to choose one that fits your hardware and needs.
This also means that you don’t have to pay for API usage, making it cost-effective for experimentation.
For the next lecture, it’d be nice if you download ollama and also install locally a particular model.
- Open your command prompt / terminal and run: ollama pull gemma3. This downloads the Gemma3 model to your machine.

Example: LLMs in R

In the following example, we use the ellmer package and the gemma3 model (downloaded through ollama) to classify texts by sentiment.
Traditional sentiment analysis uses word lists or trained classifiers.
With an LLM, you can simply prompt: “Classify the sentiment of this text as positive, negative, or neutral: [text]”.
- The LLM uses its understanding of language to classify without explicit training.
- For economic news, you might prompt: “Does this article suggest economic growth, recession, or stability?”.
- The advantage is flexibility—you can adapt the task without retraining.
- The disadvantage is cost and potential inconsistency compared to specialized models.

Example: LLMs in R (2)

Example: Sentiment Analysis with LLMs

library(ellmer)
chat <- chat_ollama(
    system_prompt = "You are a helpful assistant that classifies text sentiment. Return only 1 word: 'positive', 'negative', or 'neutral'",
    model = "gemma3"
)

chat$chat("Classify the sentiment of this text as positive, negative, or neutral: 'The economy is showing signs of recovery after a long downturn.'")

positive

Ethical Considerations

LLMs raise important ethical questions.
- They can generate misinformation and deepfakes.
- They can amplify biases related to race, gender, and culture.
- They threaten jobs that involve text generation or analysis.
- They consume significant energy, raising environmental concerns.
- Privacy issues arise when models are trained on personal data.
As economists and researchers, we must use these tools responsibly.
We should be transparent about LLM use and carefully evaluate outputs for bias and accuracy.

The Future of LLMs

LLMs are rapidly evolving with new capabilities emerging regularly.
Contemporary models integrate multiple modalities: text, images, audio, and video.
- They may better handle reasoning, planning, and mathematical tasks.
- Efficiency improvements may make powerful models accessible on personal devices.
Specialized models for domains like economics, medicine, and law are being developed.
The line between human and AI-generated content will blur.
- Understanding these models is crucial for participating in the AI-augmented economy.

Recapitulation

LLMs predict text by learning patterns from massive datasets.
Transformers use attention mechanisms to capture relationships between words.
- Attention computes weighted combinations based on query, key, and value vectors.
- Multiple attention heads capture different types of relationships.
Models are trained by minimizing prediction error using cross-entropy loss.
Larger models with more data generally perform better.
- LLMs can hallucinate, so outputs must be verified.
- Effective prompting is essential for getting good results.

Introduction to Applied Data Science

Overview

Course Schedule

Outline Today

Introduction to LLMs

What Are Large Language Models?

Predicting the Next Word

Tokenization

Word Embeddings: Capturing Meaning

Mathematics Behind LLMs

Neural Networks: The Foundation

In-depth Look at Word2Vec (CBOW)

Weight Matrices

Forward Pass

From Hidden to Output

The Training Objective

The Training Objective (2)

Schematic Overview

Word2Vec in R

Training LLMs

The Training Process: Learning from Data

The Cross-Entropy Loss

Gradient Descent: Finding the Minimum

Learning Rate

Example: Gradient Descent

Gradient Descent in R

Backpropagation

Example: Backpropagation

The Limitation of Simple Neural Networks

Transformers

The Transformer Architecture: A Breakthrough

Self-Attention

Attention Mechanism: The Mathematics

Attention Example

Attention Example: Numerical Calculation

Multi-Head Attention: Multiple Perspectives

Multi-Head Attention Example

Positional Encoding: Preserving Word Order

Overview

Stacking Transformer Layers

The Complete Forward Pass

Why Are LLMs So Large?

Scaling Laws: Bigger Is Better

Training Data: Learning from the Internet

Emergence: Surprising Capabilities

Temperature and Sampling: Controlling Randomness

Temperature Example

Limitations and Challenges

Fine-Tuning: Adapting to Specific Tasks

Reinforcement Learning from Human Feedback (RLHF)

LLM Challenges

Hallucination

Prompting: How to Interact with LLMs

LLMs in Economics Research

Using LLMs with R

The ellmer package

Ollama

Open Source LLMs

Example: LLMs in R

Example: LLMs in R (2)

Ethical Considerations

The Future of LLMs

Recapitulation

Recapitulation

The `ellmer` package