Tutorial 5

Instructions

Open the folder you created for this course in Positron using Open > Open Folder… > Select your Folder.
Inside this folder, create a new Quarto document (tutorial5.qmd).¹
For each question, include:
- The question number and text
- Your R code in a code chunk
- Brief explanation of your approach (for conceptual questions)
Make sure your YAML-header (first lines of your .qmd document) look as approximately as follows:

---
title: Tutorial 5
format: html
author: Your Name And Student No.
---

Render your document to HTML to verify all code executes correctly (click on “Preview” in Positron.)

Part 1: Teacher Demonstration

A. Building a Document-Term Matrix from Scratch

library(tidytext)
library(dplyr)
library(tm)

# Create three sample economic documents
documents <- data.frame(
  doc_id = c("doc1", "doc2", "doc3"),
  text = c(
    "The economy is strong and growing",
    "The economy is weak and contracting",
    "The market is volatile and unpredictable"
  )
)

# Step 1: Tokenize using tidytext
tokenized <- documents |>
  unnest_tokens(output = word, input = text)

# Step 2: Count word occurrences per document
word_counts <- tokenized |>
  count(doc_id, word, sort = TRUE)

# Step 3: Convert to document-term matrix
dtm <- cast_dtm(word_counts, document = doc_id, term = word, value = n)

# Step 4: Inspect the matrix
print("Document-Term Matrix (sparse format):")

[1] "Document-Term Matrix (sparse format):"

print(dtm)

<<DocumentTermMatrix (documents: 3, terms: 11)>>
Non-/sparse entries: 18/15
Sparsity           : 45%
Maximal term length: 13
Weighting          : term frequency (tf)

print("\nDense matrix representation (first 3 docs × first 6 terms):")

[1] "\nDense matrix representation (first 3 docs × first 6 terms):"

print(as.matrix(dtm)[1:3, 1:6])

      Terms
Docs   and economy growing is strong the
  doc1   1       1       1  1      1   1
  doc2   1       1       0  1      0   1
  doc3   1       0       0  1      0   1

Key points:

Each row = one document
Each column = one unique term in the corpus
Values represent raw word counts
Matrix is sparse (mostly zeros) because documents use only a subset of vocabulary
Notice how common words like “the” and “is” appear in all documents

B. TF-IDF Weighting and Preprocessing

library(ggplot2)
library(SnowballC)

# Start with tokenized data from Part A
# Step 1: Remove stop words
data_clean <- tokenized |>
  anti_join(stop_words, by = "word")

# Step 2: Apply stemming
data_stemmed <- data_clean |>
  mutate(word = wordStem(word, language = "english"))

# Step 3: Calculate TF-IDF
tfidf_data <- data_stemmed |>
  count(doc_id, word) |>
  bind_tf_idf(term = word, document = doc_id, n = n)

# Step 4: Compare raw counts vs TF-IDF weights
comparison <- tfidf_data |>
  arrange(desc(tf_idf)) |>
  head(10)

print("Top 10 terms by TF-IDF weight:")

[1] "Top 10 terms by TF-IDF weight:"

print(comparison)

  doc_id      word n        tf       idf    tf_idf
1   doc1      grow 1 0.3333333 1.0986123 0.3662041
2   doc1    strong 1 0.3333333 1.0986123 0.3662041
3   doc2  contract 1 0.3333333 1.0986123 0.3662041
4   doc2      weak 1 0.3333333 1.0986123 0.3662041
5   doc3    market 1 0.3333333 1.0986123 0.3662041
6   doc3 unpredict 1 0.3333333 1.0986123 0.3662041
7   doc3   volatil 1 0.3333333 1.0986123 0.3662041
8   doc1   economi 1 0.3333333 0.4054651 0.1351550
9   doc2   economi 1 0.3333333 0.4054651 0.1351550

# Step 5: Visualize the effect of TF-IDF
tfidf_data |>
  filter(word %in% c("economi", "strong", "weak", "market", "volatil")) |>
  ggplot(aes(x = word, y = tf_idf, fill = doc_id)) +
  geom_col(position = "dodge") +
  labs(title = "TF-IDF Weights for Key Economic Terms",
       y = "TF-IDF Score", x = "Term (stemmed)") +
  theme_minimal()

Key points:

Stop words removal eliminates non-informative terms (“the”, “is”)
Stemming reduces vocabulary size by grouping word variants (“economy”/“economies” → “economi”)
TF-IDF downweights common terms (“economi” appears in all docs → lower weight)
TF-IDF upweights distinctive terms (“strong” only in doc1 → higher weight)
The visualization shows how TF-IDF highlights document-specific terms

C. Sentiment Analysis with Multiple Lexicons

library(tidyr)
# Use cleaned tokenized data from Part B (before stemming for better sentiment matching)
sentiment_data <- tokenized |>
  anti_join(stop_words, by = "word")

# Step 1: Apply Bing lexicon (binary positive/negative)
bing_sentiment <- get_sentiments("bing")
bing_results <- sentiment_data |>
  inner_join(bing_sentiment, by = "word") |>
  count(doc_id, sentiment) |>
  pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) |>
  mutate(bing_score = positive - negative)

# Step 2: Apply AFINN lexicon (graded -5 to +5 scale)
afinn_sentiment <- get_sentiments("afinn")
afinn_results <- sentiment_data |>
  inner_join(afinn_sentiment, by = "word") |>
  group_by(doc_id) |>
  summarise(afinn_score = sum(value))

# Step 3: Compare results across lexicons
comparison <- bing_results |>
  left_join(afinn_results, by = "doc_id") |>
  select(doc_id, bing_score, afinn_score)

print("Sentiment scores across lexicons:")

[1] "Sentiment scores across lexicons:"

print(comparison)

# A tibble: 3 × 3
  doc_id bing_score afinn_score
  <chr>       <int>       <dbl>
1 doc1            1           3
2 doc2           -1          -2
3 doc3           -2          NA

# Step 4: TF-IDF weighted sentiment (more sophisticated approach)
tfidf_weighted <- sentiment_data |>
  count(doc_id, word) |>
  bind_tf_idf(term = word, document = doc_id, n = n) |>
  inner_join(bing_sentiment, by = "word") |>
  mutate(sentiment_value = ifelse(sentiment == "positive", 1, -1),
         weighted_sent = tf_idf * sentiment_value) |>
  group_by(doc_id) |>
  summarise(tfidf_weighted_score = sum(weighted_sent))

Key points:

Different lexicons produce different results (binary vs. graded scales)
Simple sentiment counts ignore term importance (all words weighted equally)
TF-IDF weighting gives more influence to distinctive sentiment words
Lexicons have coverage limitations (many words won’t match any sentiment dictionary)
Context matters: “not strong” would be misclassified as positive by these methods

D. Topic Modeling with LDA

Discover latent topics in a collection of documents using Latent Dirichlet Allocation, using a dataset of Economics research paper abstracts.

library(topicmodels)

# Sample abstracts
abstracts <- data.frame(
  paper_id = 1:6,
  text = c(
    "monetary policy interest rates central bank inflation economic growth",
    "labor market unemployment wages employment human capital",
    "fiscal policy government spending taxation budget deficit",
    "interest rates monetary policy inflation targeting central banking",
    "unemployment labor force participation wage inequality",
    "government debt fiscal stimulus public spending economic recovery"
  ),
  stringsAsFactors = FALSE
)

# Tokenize and create DTM
abstract_dtm <- abstracts |>
  unnest_tokens(word, text) |>
  count(paper_id, word) |>
  cast_dtm(document = paper_id, term = word, value = n)

# Run LDA with k=2 topics
lda_model <- LDA(abstract_dtm, k = 2, control = list(seed = 1234))

# Extract beta (word-topic probabilities)
topics_beta <- tidy(lda_model, matrix = "beta")

# Top terms per topic
top_terms <- topics_beta |>
  group_by(topic) |>
  slice_max(beta, n = 5) |>
  arrange(topic, desc(beta))

print("Top terms per topic:")

[1] "Top terms per topic:"

print(top_terms)

# A tibble: 10 × 3
# Groups:   topic [2]
   topic term           beta
   <int> <chr>         <dbl>
 1     1 economic     0.0691
 2     1 rates        0.0615
 3     1 government   0.0595
 4     1 monetary     0.0587
 5     1 policy       0.0570
 6     2 policy       0.0763
 7     2 unemployment 0.0632
 8     2 labor        0.0565
 9     2 interest     0.0545
10     2 spending     0.0526

# Extract gamma (document-topic probabilities)
topics_gamma <- tidy(lda_model, matrix = "gamma")

print("\nDocument-topic distributions:")

[1] "\nDocument-topic distributions:"

print(topics_gamma)

# A tibble: 12 × 3
   document topic gamma
   <chr>    <int> <dbl>
 1 1            1 0.506
 2 2            1 0.495
 3 3            1 0.500
 4 4            1 0.496
 5 5            1 0.493
 6 6            1 0.511
 7 1            2 0.494
 8 2            2 0.505
 9 3            2 0.500
10 4            2 0.504
11 5            2 0.507
12 6            2 0.489

Key points:

LDA discovers latent topics as probability distributions over words
Each document contains a mixture of topics (not just one topic)
Topic interpretation requires human judgment—algorithms find patterns but don’t “understand” meaning
Choosing $k$ (number of topics) requires balancing coherence and coverage

Discussion Questions:

What are the two topics that emerged? Can you label them?
Which documents belong primarily to which topic?
What would happen if we used k=3 instead of k=2?

Part 2: Student Practice Questions

Conceptual Understanding

Explain why document-term matrices are typically sparse. What does this sparsity imply about language use in documents?
Describe the fundamental difference between stemming and lemmatization. When might you prefer one approach over the other for economic text analysis?
Why does TF-IDF assign near-zero weight to words that appear in every document? Provide an economic example of such a word and explain why downweighting it improves analysis.
What is the “distributional hypothesis” that underlies word embedding models like Word2Vec? How does this differ from the assumptions behind document-term matrices?
In Latent Dirichlet Allocation (LDA), what do the $\beta$ (beta) and $\gamma$ (gamma) parameters represent? Explain their economic interpretation.

Applied Calculations

Given these three documents:
- Doc1: “inflation inflation rising prices”
- Doc2: “inflation stable low”
- Doc3: “unemployment rising jobs”
Calculate the raw term frequency (TF) for “inflation” in Doc1 and Doc2.
Using the same documents from Question 6, calculate the inverse document frequency (IDF) for “inflation” and “unemployment” assuming log base 10. Show your work.
Compute the TF-IDF score for “rising” in Doc1 from Question 6. First calculate TF (using raw count), then IDF, then multiply them.
A document contains 200 total words. The word “recession” appears 8 times in this document but only appears in 5 out of 10,000 documents in the corpus. Calculate its TF-IDF score using:
- TF = raw count
- IDF = $\log_{10}(N/DF)$ Show all steps.
After stemming with the Porter algorithm, which of these word groups would be merged into the same stem?
- Group A: “economy”, “economies”, “economic”
- Group B: “policy”, “policies”, “politician”
- Group C: “growth”, “growing”, “grow” Explain your reasoning for each group.

R Code Interpretation

What does this R code accomplish? Explain each line’s purpose:

data |> 
  unnest_tokens(word, text) |>
  anti_join(stop_words) |>
  count(doc_id, word) |>
  cast_dtm(doc_id, word, n)

The following code calculates sentiment scores. What is the critical limitation of this approach for economic texts?
```
data |> 
  inner_join(get_sentiments("bing")) |>
  count(doc_id, sentiment)
```
What problem does this preprocessing step solve?
```
data$word = tolower(data$word)
```
In the TF-IDF calculation bind_tf_idf(term = word, document = doc_id, n = n), what does the n variable represent in the input data frame?
When using LDA(dtm, k = 5), what does changing the value of k control? What are the risks of setting k too high or too low?

Critical Thinking & Application

You’re analyzing Federal Reserve meeting minutes to measure policy uncertainty. Why might simple bag-of-words approaches (DTM) fail to capture important nuances? Describe one specific limitation and propose how embeddings might address it.
A researcher removes all numeric tokens (e.g., “2.5”, “3.8%”) during preprocessing before sentiment analysis of earnings reports. What valuable economic information might be lost? Justify your answer.
When analyzing central bank communications, why might stop word removal be problematic for certain research questions? Provide a concrete example where keeping stop words could be important.
Word embeddings trained on general news corpora might misrepresent economic terminology. Give one example of an economic term that has a specialized meaning different from everyday usage, and explain how this could distort analysis.
You’re comparing sentiment in earnings calls before and during a recession. Why might raw sentiment scores be misleading without normalization? Propose one method to make scores comparable across time periods.

Footnotes

File > New File > Quarto Document.↩︎