Create a new Quarto document (tutorial5.qmd) in a folder designated for this course.1
For each question, include:
The question number and text
Your R code in a code chunk
Brief explanation of your approach (for conceptual questions)
Make sure your YAML-header (first lines of your .qmd document) look as approximately as follows:
---
title: Tutorial 5
format: html
author: Your Name And Student No.
---
Render your document to HTML to verify all code executes correctly (click on “Preview” in Positron.)
Part 1: Teacher Demonstration
A. Building a Document-Term Matrix from Scratch
library(tidytext)library(dplyr)library(tm)# Create three sample economic documentsdocuments <-data.frame(doc_id =c("doc1", "doc2", "doc3"),text =c("The economy is strong and growing","The economy is weak and contracting","The market is volatile and unpredictable" ))# Step 1: Tokenize using tidytexttokenized <- documents |>unnest_tokens(output = word, input = text)# Step 2: Count word occurrences per documentword_counts <- tokenized |>count(doc_id, word, sort =TRUE)# Step 3: Convert to document-term matrixdtm <-cast_dtm(word_counts, document = doc_id, term = word, value = n)# Step 4: Inspect the matrixprint("Document-Term Matrix (sparse format):")
[1] "Document-Term Matrix (sparse format):"
print(dtm)
<<DocumentTermMatrix (documents: 3, terms: 11)>>
Non-/sparse entries: 18/15
Sparsity : 45%
Maximal term length: 13
Weighting : term frequency (tf)
print("\nDense matrix representation (first 3 docs × first 6 terms):")
LDA discovers latent topics as probability distributions over words
Each document contains a mixture of topics (not just one topic)
Topic interpretation requires human judgment—algorithms find patterns but don’t “understand” meaning
Choosing \(k\) (number of topics) requires balancing coherence and coverage
Discussion Questions:
What are the two topics that emerged? Can you label them?
Which documents belong primarily to which topic?
What would happen if we used k=3 instead of k=2?
Part 2: Student Practice Questions
Conceptual Understanding
Explain why document-term matrices are typically sparse. What does this sparsity imply about language use in documents?
Describe the fundamental difference between stemming and lemmatization. When might you prefer one approach over the other for economic text analysis?
Why does TF-IDF assign near-zero weight to words that appear in every document? Provide an economic example of such a word and explain why downweighting it improves analysis.
What is the “distributional hypothesis” that underlies word embedding models like Word2Vec? How does this differ from the assumptions behind document-term matrices?
In Latent Dirichlet Allocation (LDA), what do the \(\beta\) (beta) and \(\gamma\) (gamma) parameters represent? Explain their economic interpretation.
Applied Calculations
Given these three documents:
Doc1: “inflation inflation rising prices”
Doc2: “inflation stable low”
Doc3: “unemployment rising jobs”
Calculate the raw term frequency (TF) for “inflation” in Doc1 and Doc2.
Using the same documents from Question 6, calculate the inverse document frequency (IDF) for “inflation” and “unemployment” assuming log base 10. Show your work.
Compute the TF-IDF score for “rising” in Doc1 from Question 6. First calculate TF (using raw count), then IDF, then multiply them.
A document contains 200 total words. The word “recession” appears 8 times in this document but only appears in 5 out of 10,000 documents in the corpus. Calculate its TF-IDF score using:
TF = raw count
IDF = \(\log_{10}(N/DF)\) Show all steps.
After stemming with the Porter algorithm, which of these word groups would be merged into the same stem?
Group A: “economy”, “economies”, “economic”
Group B: “policy”, “policies”, “politician”
Group C: “growth”, “growing”, “grow” Explain your reasoning for each group.
R Code Interpretation
What does this R code accomplish? Explain each line’s purpose:
data |>unnest_tokens(word, text) |>anti_join(stop_words) |>count(doc_id, word) |>cast_dtm(doc_id, word, n)
The following code calculates sentiment scores. What is the critical limitation of this approach for economic texts?
data |>inner_join(get_sentiments("bing")) |>count(doc_id, sentiment)
What problem does this preprocessing step solve?
data$word =tolower(data$word)
In the TF-IDF calculation bind_tf_idf(term = word, document = doc_id, n = n), what does the n variable represent in the input data frame?
When using LDA(dtm, k = 5), what does changing the value of k control? What are the risks of setting k too high or too low?
Critical Thinking & Application
You’re analyzing Federal Reserve meeting minutes to measure policy uncertainty. Why might simple bag-of-words approaches (DTM) fail to capture important nuances? Describe one specific limitation and propose how embeddings might address it.
A researcher removes all numeric tokens (e.g., “2.5”, “3.8%”) during preprocessing before sentiment analysis of earnings reports. What valuable economic information might be lost? Justify your answer.
When analyzing central bank communications, why might stop word removal be problematic for certain research questions? Provide a concrete example where keeping stop words could be important.
Word embeddings trained on general news corpora might misrepresent economic terminology. Give one example of an economic term that has a specialized meaning different from everyday usage, and explain how this could distort analysis.
You’re comparing sentiment in earnings calls before and during a recession. Why might raw sentiment scores be misleading without normalization? Propose one method to make scores comparable across time periods.