Introduction to Applied Data Science

Lecture 4: Text as Data

Bas Machielsen

Overview

Course Schedule

Event Date Subject
Lecture 1 21-04 Introduction to Data and Data Science
Lecture 2 28-04 Getting Data: API's and Databases
Lecture 3 07-05 Getting Data: Web Scraping
Lecture 4 26-05 Text as Data
Lecture 5 27-05 Introduction to LLMs
Lecture 6 09-06 Prompt Engineering and Structured Data
Lecture 7 16-06 Spatial Data and Geocomputation

Outline Today

  • First part: introduction to text mining and traditional text mining methods, sentiment analysis
  • Second part: introduction to embeddings-based thought and pre-LLM era models of text.

Introduction to Text As Data

Why Should Economists Care About Text?

  • Economic data isn’t just numbers anymore.
    • Companies’ quarterly reports, central bank announcements, consumer reviews, social media posts, and news articles all contain valuable economic information.
    • Traditional economic analysis focused on structured numerical data like GDP, prices, and employment rates.
    • Text mining allows us to extract insights from millions of documents that would be impossible to read manually.
    • For example, we can measure consumer sentiment from product reviews or predict market movements from news articles.

The Fundamental Challenge

  • Computers work with numbers, but text is made of words.
  • How do we convert Shakespeare, a tweet, or an earnings report into something a computer can analyze?
    • This conversion process is called “text representation” and it’s the foundation of all text analysis.
    • We need mathematical structures that preserve meaningful information while enabling computation.
    • The journey from words to numbers involves several clever techniques we’ll explore today.

Document-Term Matrices

What is a Document-Term Matrix?

  • A document-term matrix (DTM) is the most basic way to represent text as numbers.
  • Imagine a giant spreadsheet where each row represents a document and each column represents a unique word.
  • Each cell contains a count of how many times that word appears in that document.
    • If we have 1,000 news articles and our vocabulary contains 10,000 unique words, our matrix would be 1,000 rows by 10,000 columns.
    • Most cells will be zero because any single document only uses a small fraction of all possible words.

Example: DTM

Example: DTM

Let’s say we have three very short documents. Document 1: “The economy is strong.” Document 2: “The economy is weak.” Document 3: “The market is volatile.”

Our vocabulary contains seven unique words: the, economy, is, strong, weak, market, volatile. Each document becomes a row of seven numbers counting word occurrences.

Document 1 would be represented as [1, 1, 1, 1, 0, 0, 0] corresponding to the word counts. Document 2 would be [1, 1, 1, 0, 1, 0, 0].

The Mathematics of the DTM

  • Formally, we define a document-term matrix as \(D\) where \(D(i,j)\) represents the count of term \(j\) in document \(i\).
  • If we have \(n\) documents and \(m\) unique terms, \(D\) is an \(n \times m\) matrix.
  • We can write this as \(D \in \mathbb{R}^{(n×m)}\) where each entry \(D(i,j) \geq 0\) is a non-negative integer.
    • This matrix is typically very sparse, meaning most entries are zero.
    • The sparsity arises because individual documents use only a small subset of the total vocabulary.

Obtaining a DTM in R

  • Whereas our raw data usually comes in the form of text strings, to create a DTM, we need to separate that into individual words.
  • Tokenization is the process of breaking raw text into individual units called “tokens.”
    • Tokens are typically words, but can also be sentences, characters, or subwords depending on your analysis goals.
    • This is the first and most fundamental step in any text analysis pipeline.
    • The quality of your tokenization directly affects all downstream analysis.

Example: Tokenization in R

  • The simplest approach splits text wherever there’s whitespace (spaces, tabs, newlines).
    • This is fast and language-agnostic, but it treats punctuation as part of words: “economy.” and “economy” become different tokens.
    • It doesn’t handle contractions intelligently, and most production systems use more sophisticated methods.

Example: Tokenization in R

documents <- data.frame(
  doc_id = c("doc1", "doc2", "doc3", "doc4"),
  text = c(
    "The U.S. economy grew by 2.5% in Q3 2024. Unemployment remains low at 3.8%.",
    "Dr. Smith's research shows that inflation-targeting works. It's state-of-the-art!",
    "Don't underestimate the market's volatility. Tech stocks aren't performing well.",
    "Central banks face a dilemma: raise rates or risk inflation? The answer isn't clear."
  ),
  author = c("Fed Report", "Academic", "Market Analysis", "Policy Brief"),
  stringsAsFactors = FALSE
)

documents
  doc_id
1   doc1
2   doc2
3   doc3
4   doc4
                                                                                  text
1          The U.S. economy grew by 2.5% in Q3 2024. Unemployment remains low at 3.8%.
2    Dr. Smith's research shows that inflation-targeting works. It's state-of-the-art!
3     Don't underestimate the market's volatility. Tech stocks aren't performing well.
4 Central banks face a dilemma: raise rates or risk inflation? The answer isn't clear.
           author
1      Fed Report
2        Academic
3 Market Analysis
4    Policy Brief
# Split on whitespace using base R
simple_tokens <- strsplit(documents$text, "\\s+")
names(simple_tokens) <- documents$doc_id
df <- stack(simple_tokens)
df
                values  ind
1                  The doc1
2                 U.S. doc1
3              economy doc1
4                 grew doc1
5                   by doc1
6                 2.5% doc1
7                   in doc1
8                   Q3 doc1
9                2024. doc1
10        Unemployment doc1
11             remains doc1
12                 low doc1
13                  at doc1
14               3.8%. doc1
15                 Dr. doc2
16             Smith's doc2
17            research doc2
18               shows doc2
19                that doc2
20 inflation-targeting doc2
21              works. doc2
22                It's doc2
23   state-of-the-art! doc2
24               Don't doc3
25       underestimate doc3
26                 the doc3
27            market's doc3
28         volatility. doc3
29                Tech doc3
30              stocks doc3
31              aren't doc3
32          performing doc3
33               well. doc3
34             Central doc4
35               banks doc4
36                face doc4
37                   a doc4
38            dilemma: doc4
39               raise doc4
40               rates doc4
41                  or doc4
42                risk doc4
43          inflation? doc4
44                 The doc4
45              answer doc4
46               isn't doc4
47              clear. doc4

Tokenization with unnest_tokens()

  • The tidytext package has a function, unnest_tokens() that tokenizes a text automatically, while retainining the metadata.
  • This function has 3 arguments:
    • tbl: the raw text data.frame.
    • output: the variable name that contains the tokens in the output data.frame.
    • input: the variable name that contains the untokenized text in the input data.frame.
library(tidytext)
tokenized_df <- unnest_tokens(tbl=documents, output=word, input=text)
head(tokenized_df, 5)
  doc_id     author    word
1   doc1 Fed Report     the
2   doc1 Fed Report     u.s
3   doc1 Fed Report economy
4   doc1 Fed Report    grew
5   doc1 Fed Report      by

Example: DTM in R

Example: DTM in R

library(tidytext)
library(dplyr)

# 0. Setup: Rename columns for clarity (stack() gives 'values' and 'ind')
colnames(df) <- c("word", "doc_id")

# 1. Calculate Counts, TF, IDF, and TF-IDF
# We first count how often a token appears per document
# Per doc_id, count each token in the data.frame
token_count_per_doc <- count(df, doc_id, word, sort = TRUE)

# 2. Construct the DTM using the cast_dtm function
dtm <- cast_dtm(token_count_per_doc, document = doc_id, term = word, value = n)

# Inspect the DTM
print(dtm)
<<DocumentTermMatrix (documents: 4, terms: 46)>>
Non-/sparse entries: 47/137
Sparsity           : 74%
Maximal term length: 19
Weighting          : term frequency (tf)
as.matrix(dtm)[1:4, 1:5] # View a small slice
      Terms
Docs   2.5% 2024. 3.8%. Q3 The
  doc1    1     1     1  1   1
  doc2    0     0     0  0   0
  doc3    0     0     0  0   0
  doc4    0     0     0  0   1

Problems with Simple Counting

  • Raw word counts have serious limitations.
    • Common words like “the,” “is,” and “a” appear frequently but carry little meaningful information.
    • These are called “stop words.”
  • A document that’s twice as long will have roughly twice the counts, but isn’t necessarily twice as “important” or different.
  • Rare words that appear in only a few documents might actually be more informative than common words. We need a smarter weighting scheme.

Introducing TF-IDF

  • TF-IDF stands for “Term Frequency - Inverse Document Frequency.”
    • It’s a weighting scheme that addresses the problems with raw counts.
    • The basic idea: a word is important if it appears frequently in a document (high term frequency) but rarely across all documents (high inverse document frequency).
    • Words that appear everywhere are downweighted. Words that are distinctive to specific documents are upweighted.

Term Frequency (TF)

  • Term frequency measures how often a word appears in a specific document.
  • The simplest version is just the raw count:
    • \(TF(t,d) = \text{count of term } t \text{ in document } d\).
  • A better version normalizes by document length:
    • \(TF(t,d) = (\text{ count of term } t \text{ in document } d) / (\text{total number of terms in document }d)\).
    • This prevents longer documents from having artificially inflated scores.
    • Some implementations use logarithmic scaling: \(TF(t,d) = log(1 + \text{count of term } t \text{ in document } d)\).

Inverse Document Frequency (IDF)

  • Inverse document frequency measures how rare or common a word is across all documents.
    • \(IDF(t) = \log(N / DF(t))\) where \(N\) is the total number of documents and \(DF(t)\) is the number of documents containing term \(t\).
    • If a word appears in every document, \(DF(t) = N\), so \(IDF(t) = \log(1) = 0\), giving it zero weight.
    • If a word appears in only one document, \(DF(t) = 1\), so \(IDF(t) = \log(N)\), the maximum weight.
    • The logarithm prevents extremely rare words from dominating.

Combining TF and IDF

  • The TF-IDF score combines both components through multiplication. \(TF-IDF(t,d) = TF(t,d) \times IDF(t)\).
  • A word scores high if it appears frequently in a specific document but rarely in others.
    • Common words like “the” appear in almost every document, so their IDF approaches zero, making their TF-IDF score near zero.
    • Technical terms or distinctive words get high scores. This transforms our document-term matrix into a TF-IDF weighted matrix.

TF-IDF in R

  • To compute a DTM on the basis of such a tokenized data.frame, we need the tidytext package:

Example: Computing TF-IDF

library(dplyr)
library(tidytext)

# 1. Calculate Counts, TF, IDF, and TF-IDF
# We first count how often a token appears per document
# Per doc_id, count each token in the data.frame
tf_idf_data <- df |>
  count(doc_id, word, sort = TRUE) |> 
  bind_tf_idf(term = word, document = doc_id, n = n)

# View the calculated metrics
print(head(tf_idf_data))
  doc_id  word n         tf       idf     tf_idf
1   doc1  2.5% 1 0.07142857 1.3862944 0.09902103
2   doc1 2024. 1 0.07142857 1.3862944 0.09902103
3   doc1 3.8%. 1 0.07142857 1.3862944 0.09902103
4   doc1    Q3 1 0.07142857 1.3862944 0.09902103
5   doc1   The 1 0.07142857 0.6931472 0.04951051
6   doc1  U.S. 1 0.07142857 1.3862944 0.09902103

Pre-processing

Why Preprocessing Matters

  • Before we can count words, we need to clean and standardize our text.
  • Raw text contains capitalization, punctuation, different word forms, and irrelevant content.
    • “Economy,” “economy,” and “ECONOMY” are the same word but would be counted separately.
    • “Running,” “runs,” and “ran” are different forms of the same concept.
  • Proper preprocessing reduces vocabulary size and improves analysis quality.

Stemming

  • Stemming is a crude but fast method for reducing words to their root form.
  • It works by chopping off common suffixes using simple rules.
    • “Running” becomes “run,” “jumps” becomes “jump,” “companies” becomes “compani.”
  • The Porter Stemmer is the most famous algorithm, using about 60 rules.
  • Stemming often creates non-words (“compani” isn’t a real word) but that’s okay for analysis.
  • The goal is to group related words, not create linguistically correct forms.

Math Behind Stemming

  • Stemming can be thought of as a function \(S: V \rightarrow V'\) where \(V\) is the original vocabulary and \(V'\) is the reduced vocabulary of stems.
  • Multiple words map to the same stem: S(“running”) = S(“runs”) = S(“run”) = “run.”
    • This function is many-to-one but not one-to-one.
    • Stemming effectively creates equivalence classes of words.
    • This reduces the dimensionality of our document-term matrix by merging columns that represent related word forms.

Lemmatization

  • Lemmatization is a more sophisticated approach that uses linguistic knowledge.
    • It reduces words to their dictionary form (called a “lemma”) using vocabulary and morphological analysis.
    • “Running” becomes “run,” “better” becomes “good,” “was” becomes “be.”
    • Unlike stemming, lemmatization always produces real words.
    • It requires understanding part-of-speech: “meeting” as a noun stays “meeting,” but “meeting” as a verb becomes “meet.”
    • Lemmatization is slower but more accurate than stemming.

Stemming vs. Lemmatization Trade-offs

  • Stemming is fast, simple, and works across languages with rule adjustments.
    • However, it can be over-aggressive (“university” and “universe” both become “univers”) or under-aggressive (missing some related forms).
  • Lemmatization is linguistically correct and more accurate.
    • However, it requires language-specific dictionaries and part-of-speech tagging, making it slower.
  • For large-scale analysis where speed matters, researchers often prefer stemming; for smaller, high-precision tasks, lemmatization is better.

Stop Words

  • Words like “the, it, and,” etc. , i.e. stop words, lack relevance.
  • An easy way (in English) to remove them from your corpus is provided by the tidytext package in the dataset stop_words.
  • The function anti_join() removes all words in the “left” dataset that also exist in the “right” dataset
library(tidytext)
clean_df <- anti_join(df, stop_words)
  • Verify whether words have been taken out by comparing nrow(clean_df) with nrow(df).

Sentiment Analysis

Introduction to Sentiment Analysis

  • A frequency list or a tf-idf-based frequency list can summarize a text by way of its most prominent words.
  • A more nuanced and productive way of analyzing text data is called sentiment analysis.
  • In sentiment analysis, text is classified as positive or negative.1
  • The basics are really simple: we map a word to a number, e.g.:
    • Positive number: positive sentiment
    • Negative number: negative sentiment
  • Some important obvious drawbacks are that:
    • Not every word is in the lexicon because many words are pretty neutral.
    • The methods do not take into account qualifiers before a word, such as in “no good” or “not true”.

Sentiment Maps

  • The tidytext package contains three automatic sentiment maps:
    • The bing lexicon categorizes words in a binary fashion into positive and negative categories.
    • The nrc lexicon categorizes words in a binary fashion (“yes”/“no”) into categories of positive, negative, anger, anticipation, disgust, fear, joy, sadness, surprise, and trust.
    • The AFINN lexicon assigns words with a score that runs between -5 and 5, with negative scores indicating negative sentiment and positive scores indicating positive sentiment.

Example TF-IDF Based Sentiment Analysis

1: Tokenize and Pre-Process the Data

Example: TF-IDF Analysis

In the following example, we will scrape the text of a Wikipedia article and consider paragraph’s as separate documents. We will construct a document-term matrix on the basis of this.

library(rvest)
library(tidytext)
link <- "https://en.wikipedia.org/wiki/2022_FIFA_World_Cup"

# Go to the wikipedia page, right click, look at the structure of the page
# Extract all p's in a <div> with class "mw-parser-output"
wc2022 <- read_html(link) |> 
  html_elements("div.mw-parser-output p") |> 
  html_text()

data <- tibble(section = seq(wc2022), text = wc2022) |>
  unnest_tokens(word, text) |>
  anti_join(stop_words)

print(head(data))
# A tibble: 6 × 2
  section word 
    <int> <chr>
1       2 2022 
2       2 fifa 
3       2 world
4       2 cup  
5       2 22nd 
6       2 fifa 

2: Add Sentiment Scores

  • We use the get_sentiments() function from the tidytext package to retrieve the sentiments and put them together with the tokenized data.
bing <- get_sentiments("bing")
# Merge sentiment with your data with inner_join
sentiment_per_section <- data |> 
  inner_join(bing) |>
  count(section, sentiment) 

counts <- xtabs(n ~ section + sentiment, data = sentiment_per_section)
sect_sent <- data.frame(section = as.integer(rownames(counts)), unclass(counts))
head(sect_sent)
  section negative positive
2       2        0        1
3       3        1        1
4       4        2        4
5       5        0       10
6       6        6        2
7       7        1        1

3: Calculate Sentiment Score

  • Let’s now calculate a sentiment score.
    • For each section, we compute the percentage of positive words out of the total number of words that are associated with a sentiment.
    • Which sections are most “positive”, and which are most “negative”?
# Calculate a net sentiment score
sect_sent$score <- sect_sent$positive / (sect_sent$positive + sect_sent$negative)
head(arrange(sect_sent, by=score), 4)
    section negative positive score
42       42        1        0     0
98       98        6        0     0
99       99        1        0     0
105     105        2        0     0
tail(arrange(sect_sent, by=score), 4)
    section negative positive score
86       86        0        2     1
87       87        0        3     1
93       93        0        1     1
115     115        0        1     1

4: TF-IDF Weighted Sentiment Score

  • This score implicitly weighs all words equally.
    • By using a tf-idf-weighted version of this analysis, we can accord greater weights to words (and their associated sentiments), the more they are unique compared to other sections, and the more frequent they are within a section.

Example: TF-IDF-weighted Sentiment Analysis

# 0. Look at the tokenized data:
print(head(data))
# A tibble: 6 × 2
  section word 
    <int> <chr>
1       2 2022 
2       2 fifa 
3       2 world
4       2 cup  
5       2 22nd 
6       2 fifa 
# 1. Prepare sentiment scores (+1/-1) vector
bing <- tidytext::get_sentiments("bing")
bing$sentiment <- ifelse(bing$sentiment == "positive", 1, -1)

# 2. Create the document-term matrix
dtm <- data |>
  count(section, word, sort = FALSE) |> 
  bind_tf_idf(term = word, document = section, n = n)

# 3. Match the dtm with the sentiment scores
dtm_with_sentiment <- inner_join(dtm, bing)

# 4. Compute the weighted sentiment and aggregate by section
dtm_with_sentiment$weighted_sentiment <- dtm_with_sentiment$tf_idf * dtm_with_sentiment$sentiment
final_scores <- aggregate(weighted_sentiment ~ section, data = dtm_with_sentiment, mean)

head(final_scores)
  section weighted_sentiment
1       2         0.07366890
2       3         0.00000000
3       4         0.00725028
4       5         0.07608900
5       6        -0.04013362
6       7        -0.01307872

Embeddings

Beyond Word Counts - The Semantic Problem

  • Document-term matrices have a fundamental limitation: they ignore word meaning and relationships.
    • “King” and “queen” are treated as completely unrelated despite being semantically connected.
    • “Strong economy” and “robust economy” mean similar things but have no overlapping roots.
    • The DTM treats each word as an independent dimension with no relationship to other words.
    • We need methods that capture semantic relationships between words.

Word Embeddings - The Big Idea

  • Word embeddings represent words as vectors of numbers in a continuous space.
  • Instead of representing “king” as a single column in a sparse matrix, we represent it as a dense vector like [0.2, -0.4, 0.7, ...].
  • Words with similar meanings have similar vectors.
  • The vector space itself has geometric properties that capture semantic relationships. This is one of the most important innovations in modern text analysis.

Word2Vec - Learning Word Relationships

  • Word2Vec is an algorithm that learns word embeddings from large text corpora.
  • It’s based on the distributional hypothesis: “words that appear in similar contexts have similar meanings.”
    • The algorithm creates vectors such that words appearing in similar contexts have similar vector representations.
    • Word2Vec comes in two flavors: Continuous Bag of Words (CBOW) and Skip-gram.
  • Both use neural networks to learn these representations, but you don’t need to understand neural networks to use the results.

The Skip-gram Model

  • Skip-gram tries to predict context words from a target word.
    • Given the word “economy,” can we predict words that appear nearby like “strong,” “growth,” “inflation”?
  • The model learns word vectors by adjusting them to maximize prediction accuracy.
  • Words that are useful for predicting similar contexts end up with similar vectors.
  • Mathematically, we’re learning a function that maps each word to a vector in \(\mathbb{R}^d\) where \(d\) is typically 100-300 dimensions.

The CBOW Model

  • Continuous Bag of Words (CBOW) works in reverse: predict the target word from context words.
  • Given surrounding words like “the ___ is strong,” can we predict the missing word might be “economy”?
    • CBOW is faster to train than skip-gram.
    • Skip-gram works better for rare words.
    • Both produce similar results on common words. The choice depends on your corpus size and computational resources.

Word Arithmetic

  • Word2Vec produces vectors with remarkable properties.
    • Vector(“king”) - Vector(“man”) + Vector(“woman”) \(\approx\) Vector(“queen”).
    • Vector(“Paris”) - Vector(“France”) + Vector(“Italy”) \(\approx\) Vector(“Rome”).
  • These algebraic relationships emerge naturally from the training process.
  • This means we can do “reasoning” with words by manipulating their vectors.
  • For economics, Vector(“recession”) - Vector(“negative”) + Vector(“positive”) might approximate Vector(“growth”).

Other Embeddings Models

Global Vectors

  • GloVe (Global Vectors for Word Representation) is an alternative to Word2Vec.
  • Instead of predicting context words, GloVe directly analyzes word co-occurrence statistics.
  • It constructs a matrix \(X\) where \(X(i,j)\) represents how often word \(i\) appears in the context of word \(j\) across the entire corpus.
  • The algorithm factorizes this co-occurrence matrix to produce word vectors.
  • GloVe argues this global statistical approach is more efficient than Word2Vec’s local context windows.

Mathematics of GloVe

  • GloVe learns vectors by minimizing a weighted least squares objective.
  • For each word pair \((i,j)\), we want: \(w_i^T w_j + b_i + b_j \approx \log(x_{ij})\).
    • Here \(w_i\) and \(w_j\) are the word vectors we’re learning, \(b_i\) and \(b_j\) are bias terms, and \(x_{ij}\) is the co-occurrence count. The model weighs frequent co-occurrences more heavily than rare ones.
    • By minimizing the difference between the dot product of word vectors and the logarithm of co-occurrence counts, we capture word relationships.

Word2Vec vs. GloVe

  • Word2Vec uses local context windows and makes predictions word-by-word during training.
  • GloVe uses global corpus statistics and trains on aggregated co-occurrence information.
    • Word2Vec is typically faster to train on very large corpora.
    • GloVe often performs better on word analogy tasks and similarity judgments.
    • In practice, both produce high-quality embeddings, and the choice often depends on implementation convenience.
    • Modern researchers often use pre-trained versions of both.

Topic Modeling

The Topic Modeling Problem

  • Imagine you have 10,000 news articles but want to understand what topics they discuss.
  • Reading them all is impossible.
  • Word frequencies tell you which words are common but not how they cluster into themes.
  • Topic modeling automatically discovers abstract “topics” that run through a collection of documents.
  • Each topic is a probability distribution over words. Each document is a mixture of topics.

What is Latent Dirichlet Allocation?

  • Latent Dirichlet Allocation (LDA) is the most popular topic modeling algorithm.
    • “Latent” means the topics are hidden variables we infer from observed words.
    • “Dirichlet” refers to the statistical distribution used for probabilities.
    • “Allocation” means assigning words to topics.
    • LDA assumes each document is a mixture of topics, and each topic is a mixture of words.
    • The algorithm discovers both the topics and how they combine in each document.

Interpreting LDA Topics

  • Once LDA runs, each topic is a distribution over words.
  • We typically examine the highest-probability words to interpret what each topic represents.
    • Topic 1 might have high probability for: market, stocks, trading, investors, equity (a “financial markets” topic).
    • Topic 2 might emphasize: president, election, vote, campaign, poll (a “politics” topic).
  • Interpretation requires human judgment—LDA discovers statistical patterns, but we decide what they mean.
  • Each document gets a distribution over topics showing its mixture.

Choosing the Number of Topics

  • LDA requires specifying \(K\), the number of topics, in advance.
    • Too few topics and you’ll miss important themes; too many and topics become redundant or incoherent.
  • Researchers use several approaches: perplexity (a statistical measure of model fit on held-out data), coherence scores (measuring how semantically related top words are), and human evaluation of interpretability.
  • There’s no single “correct” number—it depends on your corpus and research question.
  • Common practice is to try several values of \(K\) and compare results.

LDA in R

  • In R, we can do Latent Dirichlet Allocation using the package topicmodels.
    • The input in LDA is a document-term data.frame.
  • Remember:
    • Every document is a mixture of topics. For example, in a two-topic model we could say “Document 1 is 90% topic A and 10% topic B, while Document 2 is 30% topic A and 70% topic B.”
    • Every topic is a mixture of words.
    • We could imagine a two-topic model of American news, with one topic for “politics” and one for “entertainment.”
    • The most common words in the politics topic might be “President”, “Congress”, and “government”, while the entertainment topic may be made up of words such as “movies”, “television”, and “actor”.

Output

  • Mathematically, LDA outputs two objects, \(\Gamma\) and \(\beta\)
    • \(\Gamma\) is a map assigning a probability of \(D\) documents to \(K\) topics
    • \(\beta\) is a map assigning a probability of \(V\) words to \(K\) topics
  • We can use the cast_dtm() function to convert this into a document-term matrix
  • The syntax is: cast_dtm(frequency_df, section_var, word_var,count_var)

Example LDA in R

LDA in R

  • Conducting LDA in R is actually quite simple.
    1. Convert the tokenized data to a document term matrix using the function cast_dtm() from the tidytext package.
    2. Use the function LDA from the topicmodels package to run LDA

Example: LDA in R

# 0. Load the topicmodels library for LDA
library(topicmodels)
# 1. Inspect the data
data |> head(5)
# A tibble: 5 × 2
  section word 
    <int> <chr>
1       2 2022 
2       2 fifa 
3       2 world
4       2 cup  
5       2 22nd 
# 2. Convert the data into DTM format
dtm <- data |> 
  count(section, word) |> 
  cast_dtm(section, word, n)

# 3. Conduct LDA with k=2
lda <- LDA(dtm, k = 2, control = list(seed = 1234))

Inspecting Results (\(\beta\))

  • Each word gets a \(\beta\)-coefficient for each topic, interpretable as the probability of each word belonging to a topic.
result_beta <- tidy(lda, matrix = "beta") 
head(result_beta, 5)
# A tibble: 5 × 3
  topic term      beta
  <int> <chr>    <dbl>
1     1 18    0.000537
2     2 18    0.000997
3     1 20    0.000904
4     2 20    0.00184 
5     1 2002  0.000381

Inspecting Results (\(\gamma\))

  • Each document gets a \(\gamma\) coefficient for each topic, interpretable as the probability of each document belonging to a topic.
result_gamma <- tidy(lda, matrix = "gamma") 
head(result_gamma, 5)
# A tibble: 5 × 3
  document topic   gamma
  <chr>    <int>   <dbl>
1 2            1 0.00145
2 3            1 0.0689 
3 4            1 0.999  
4 5            1 0.999  
5 6            1 0.00192

Context of Text Mining

Applications in Economics

  • Text mining has numerous economic applications.
    • Measuring economic policy uncertainty by analyzing newspaper articles.
    • Predicting stock returns from earnings call transcripts and analyst reports.
    • Analyzing central bank communication to understand policy intentions.
    • Tracking firm innovation through patent texts.
    • Measuring consumer sentiment from product reviews.
    • Understanding labor market dynamics through job postings.
  • All these applications start with the techniques we’ve discussed: converting text to numbers, weighting terms appropriately, and discovering patterns.

From Pre-LLM to Modern Methods

  • The techniques we’ve covered—DTM, TF-IDF, Word2Vec, GloVe, and LDA—dominated text analysis until around 2018.
  • They remain valuable because they’re interpretable, computationally efficient, and work well on smaller datasets.
  • Modern large language models (LLMs) like BERT and GPT build on these foundations but use much more complex neural architectures.
    • However, understanding these classical methods is essential for understanding what LLMs do and when simpler methods might be more appropriate.
    • Many economic applications still use these pre-LLM techniques successfully.1

Practical Considerations

  • Text analysis requires careful preprocessing decisions that affect results.
    • Should you remove stop words or keep them for certain tasks? Stemming or lemmatization or neither? How do you handle numbers, punctuation, and special characters? What minimum word frequency threshold should you use?
    • These decisions depend on your research question and data.
    • There’s no universal “best practice”—you should experiment and validate that your preprocessing choices improve performance on your specific task.
    • Document your decisions so your analysis is reproducible.

Limitations and Cautions

  • Text mining has important limitations you should understand.
    • Language is subtle and context-dependent—algorithms miss sarcasm, irony, and nuance.
    • Bag-of-words approaches ignore word order and grammar.
    • Word embeddings can capture and amplify biases present in training data.
    • Topic models find statistical patterns but don’t guarantee those patterns are meaningful.
    • Results depend heavily on preprocessing choices.
    • Always validate your text analysis against human judgment and domain expertise. Text mining is a powerful tool but not a replacement for careful thinking.

Building Up Complexity

  • The techniques we’ve covered build on each other logically.
  • Document-term matrices provide the basic representation. TF-IDF weights terms by importance. Stemming and lemmatization reduce vocabulary while preserving meaning.
  • Word embeddings capture semantic relationships that DTMs miss. Topic models discover latent structure in document collections.
  • Each method addresses limitations of simpler approaches while introducing its own assumptions and trade-offs.
  • Understanding this progression helps you choose the right tool for your analysis.

Recapitulation

Recapitulation

  • Text can be represented as numbers through document-term matrices.
  • TF-IDF weighting identifies distinctive and informative terms.
  • Preprocessing through stemming and lemmatization standardizes text and reduces dimensionality.
  • Word embeddings like Word2Vec and GloVe capture semantic relationships in continuous vector spaces.
  • Topic models like LDA discover latent themes in document collections.
    • These pre-LLM methods remain valuable for their interpretability, efficiency, and effectiveness on many tasks.
    • The foundation you’ve learned today prepares you to understand modern advances in natural language processing and to apply text analysis in economic research.