class: center, middle, inverse, title-slide .title[ # Topic Models with
] .author[ ### Jason Thomas ] .institute[ ### R Working Group ] .date[ ### March 7th, 2025 ] --- # Introduction * In this session, the second of a two-part series, we will explore the structure in a **corpus** (or collection of documents). * The "structure" is assumed by the choice of **topic model**: + **Latent Dirichlet Allocation model** (LDA) + The corpus has a fixed number of *topics* (chosen *a priori* by the modeler) + Each topic has a probability for generating/producing each term in the vocabulary (i.e., all terms appearing in the corpus) + Each document is made up of a mixuture of topics (so its words are generated from the topics) --- # Take 2 Yikes, that was quite the word salad! Let's try again, but with pictures from the expert. * David M. Blei (Professor, Columbia University) has pioneered much of the work on LDA, and his [website](https://www.cs.columbia.edu/~blei/topicmodeling.html) includes useful resources for learning about many things, including topic models. + (on the stats-ier side of things) * **The following two slides (19 & 20) are taken from Prof. Blei's 2012 ICML presentation on topic models (obtained from his [website](https://www.cs.columbia.edu/~blei/talks/Blei_ICML_2012.pdf))** --- class: inverse, center, middle background-image: url(img/blei_slide19_ICML_2012.png) background-size: contain <p style="color:red; font-weight:bold; font-size:32px; position: absolute; top: 40px"> <a href="https://www.cs.columbia.edu/~blei/talks/Blei_ICML_2012.pdf">Source: David Blei's 2012 ICML presentation (slide 19)</a></p> --- class: inverse, center, middle background-image: url(img/blei_slide20_ICML_2012.png) background-size: contain <p style="color:red; font-weight:bold; font-size:32px; position: absolute; top: 40px"> <a href="https://www.cs.columbia.edu/~blei/talks/Blei_ICML_2012.pdf">Source: David Blei's 2012 ICML presentation (slide 20)</a></p> --- # More on Topic Models * **LDA** is Latent Diri-what-now? + Dirichlet (a German mathematician) had a probability distribution named after him + The Dirichlet distribution is useful for generating probabilities (values between 0 and 1) for multiple categies + (think probabilities of a multinomial distribution) * There are different types of topic models, some of which extend the LDA (see Blei's presentation for detailed examples ) + (e.g, correlated topics, dynamic topic models, supervised LDA) --- # Setup We will use a few packages to make life easier... ``` r # you can install the packages with the following command (only need to run once) install.packages(c("topicmodel", "tm", "SnowballC", "slam")) ``` and load them with ``` r library(topicmodels) library(tm) library(SnowballC) library(slam) ``` --- class: inverse, center, middle # Example Data & Prep --- # Example data * We are going to follow the example in the [vignette](https://cran.r-project.org/web/packages/topicmodels/vignettes/topicmodels.pdf) from the R package [**topicmodels**](https://cran.r-project.org/package=topicmodels) + Abstracts from articles in the *Journal of Statistical Software* ``` r data(JSS_papers) ## load example dataset from topicmodels colnames(JSS_papers) ## [1] "title" "creator" "subject" "description" "publisher" ## [6] "contributor" "date" "type" "format" "identifier" ## [11] "source" "language" "relation" "coverage" "rights" ``` The "description" contains the abstracts. We'll use LDA to find topics that generate the words in the abstract. --- # Example data (cont) <style type="text/css"> /* R code */ /*.foobar code.r { font-weight: bold; }*/ /* code without language names */ .wrap code[class="remark-code"] { /*display: block;*/ /*border: 1px solid red;*/ text-wrap: wrap; } </style> ``` r JSS_papers[1, "title"] ## title for 1st article ``` ``` ## $title ## [1] "A Diagnostic to Assess the Fit of a Variogram Model to Spatial Data" ``` ``` r JSS_papers[1, "creator"] ## author(s) or 1st article ``` ``` ## $creator ## [1] "Barry, Ronald" ``` .wrap[ ``` r JSS_papers[1, "description"] ## abstract for 1st article ``` ``` ## $description ## [1] "The fit of a variogram model to spatially-distributed data is often difficult to assess. A graphical diagnostic written in S-plus is introduced that allows the user to determine both the general quality of the fit of a variogram model, and to find specific pairs of locations that do not have measurements that are consonant with the fitted variogram. It can help identify nonstationarity, outliers, and poor variogram fit in general. Simulated data sets and a set of soil nitrogen concentration data are examined using this graphical diagnostic." ``` ] --- # Data Prep * We need to prepare our data by creating a document-term-matrix (as discussed in the previous session) using tools from the [**tm** package](https://cran.r-project.org/web/packages/tm/vignettes/tm.pdf) * First we create a corpus... .wrap[ ``` r corpus <- Corpus(VectorSource(unlist(JSS_papers[, "description"]))) inspect(corpus[1]) ``` ``` ## <<SimpleCorpus>> ## Metadata: corpus specific: 1, document level (indexed): 0 ## Content: documents: 1 ## ## [1] The fit of a variogram model to spatially-distributed data is often difficult to assess. A graphical diagnostic written in S-plus is introduced that allows the user to determine both the general quality of the fit of a variogram model, and to find specific pairs of locations that do not have measurements that are consonant with the fitted variogram. It can help identify nonstationarity, outliers, and poor variogram fit in general. Simulated data sets and a set of soil nitrogen concentration data are examined using this graphical diagnostic. ``` ] --- # Document-Term-Matrix Now we can convert the corpus into a DTM (with some of preprocessing ) ``` r jss_dtm <- DocumentTermMatrix(corpus, control=list(tolower = TRUE, stemming = TRUE, stopwords = TRUE, minWordLength = 3, removeNumbers = TRUE, removePunctuation = TRUE)) jss_dtm ``` ``` ## <<DocumentTermMatrix (documents: 361, terms: 3497)>> ## Non-/sparse entries: 19199/1243218 ## Sparsity : 98% ## Maximal term length: 36 ## Weighting : term frequency (tf) ``` --- # (DTM Pit Stop) * As you may have noticed, our DTM is pretty big and has lots of zeros! * A special structure is used to efficiently work with large, sparse matrices * **tm** provides useful tools for exploring your DTM + `nDocs(dtm)` -- number of documents (or rows) in your corpus/dtm + `nTerms(dtm)` -- number of terms (or columns) in your corpus/dtm + `inspect(dtm)` -- used for looking at slices (rows & columns) of your dtm --- # Inspecting DTMs (back to our regularly scheduled programming) ``` r nDocs(jss_dtm) ``` ``` ## [1] 361 ``` ``` r nTerms(jss_dtm) ``` ``` ## [1] 3497 ``` --- # Inspecting DTMs (cont) ``` r inspect(jss_dtm[1:5, 1:5]) ``` ``` ## <<DocumentTermMatrix (documents: 5, terms: 5)>> ## Non-/sparse entries: 0/25 ## Sparsity : 100% ## Maximal term length: 9 ## Weighting : term frequency (tf) ## Sample : ## Terms ## Docs “formula” abc abil abl absenc ## 1 0 0 0 0 0 ## 2 0 0 0 0 0 ## 3 0 0 0 0 0 ## 4 0 0 0 0 0 ## 5 0 0 0 0 0 ``` --- # TF-IDF mean tf-idf (term frequency-inverse document frequency) ``` r term_tfidf <- tapply(jss_dtm$v/row_sums(jss_dtm)[jss_dtm$i], jss_dtm$j, mean) * log2(nDocs(jss_dtm)/col_sums(jss_dtm > 0)) length(term_tfidf) ``` ``` ## [1] 3497 ``` ``` r summary(term_tfidf) ``` ``` ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.01555 0.07614 0.10165 0.12944 0.14583 1.17184 ``` ``` r jss_dtm <- jss_dtm[, term_tfidf >= 0.1] dim(jss_dtm) ``` ``` ## [1] 361 1788 ``` --- class: inverse, center, middle # LDA Model --- # LDA intro There are different versions of the model available and different methods of parameter estimation estimation. * LDA + VEM & Gibbs * Correlated Topic Model with Gibbs --- # LDA example ``` r # fit models m1_k30 <- LDA(jss_dtm, k = 30) m1_k30 ## S4 Object ``` ``` ## A LDA_VEM topic model with 30 topics. ``` ``` r slotNames(m1_k30) ## can also access with m1_k30@alpha ``` ``` ## [1] "alpha" "call" "Dim" "control" ## [5] "k" "terms" "documents" "beta" ## [9] "gamma" "wordassignments" "loglikelihood" "iter" ## [13] "logLiks" "n" ``` ``` r slot(m1_k30, "alpha") ## parameter for topic dist for docs ``` ``` ## [1] 0.008558141 ``` ``` r dim(slot(m1_k30, "beta")) ## term distribution for each topic ``` ``` ## [1] 30 1788 ``` --- # LDA example (cont) Accessing results ``` r m1_k30_results <- posterior(m1_k30) str(m1_k30_results) ``` ``` ## List of 2 ## $ terms : num [1:30, 1:1788] 5.32e-146 1.59e-196 3.47e-196 2.29e-196 9.83e-03 ... ## ..- attr(*, "dimnames")=List of 2 ## .. ..$ : chr [1:30] "1" "2" "3" "4" ... ## .. ..$ : chr [1:1788] "“formula”" "abc" "absorb" "abund" ... ## $ topics: num [1:361, 1:30] 0.000698 0.000925 0.000698 0.00076 0.000112 ... ## ..- attr(*, "dimnames")=List of 2 ## .. ..$ : chr [1:361] "1" "2" "3" "4" ... ## .. ..$ : chr [1:30] "1" "2" "3" "4" ... ``` ``` r names(m1_k30_results) ``` ``` ## [1] "terms" "topics" ``` --- # LDA example (cont) Pr(Term | Topic) -- term distribution over topics ``` r dim(posterior(m1_k30)$terms) ``` ``` ## [1] 30 1788 ``` ``` r posterior(m1_k30)$terms[1,1:10] ``` ``` ## “formula” abc absorb abund accept ## 5.319081e-146 7.178455e-146 7.098539e-146 2.722974e-197 9.891897e-145 ## acceptandreject access accommod accompani accomplish ## 4.355926e-146 1.349166e-144 8.276565e-146 7.603964e-197 7.084722e-146 ``` --- # LDA example (cont) Pr(Topic | Doc) -- topic distribution over documents ``` r dim(posterior(m1_k30)$topics) ``` ``` ## [1] 361 30 ``` ``` r posterior(m1_k30)$topics[1:3,] ``` ``` ## 1 2 3 4 5 6 ## 1 0.0006982393 0.0006982393 0.0006982393 0.0006982393 0.0006982393 0.0006982393 ## 2 0.0009245303 0.0009245303 0.0009245303 0.0009245303 0.0009245303 0.0009245303 ## 3 0.0006982393 0.1638737094 0.0006982393 0.0006982393 0.0822859744 0.0006982393 ## 7 8 9 10 11 12 ## 1 0.0006982393 0.0006982393 0.0006982393 0.0006982393 0.0006982393 0.0006982393 ## 2 0.0009245303 0.0009245303 0.0009245303 0.0009245303 0.0009245303 0.0009245303 ## 3 0.0006982393 0.0006982393 0.0006982393 0.0006982393 0.0006982393 0.0006982393 ## 13 14 15 16 17 18 ## 1 0.0006982393 0.0006982393 0.0006982393 0.0006982393 0.0006982393 0.9797510598 ## 2 0.7630307735 0.0009245303 0.0009245303 0.0009245303 0.0009245303 0.0009245303 ## 3 0.0006982393 0.0006982393 0.0006982393 0.0006982393 0.0006982393 0.0006982393 ## 19 20 21 22 23 24 ## 1 0.0006982393 0.0006982393 0.0006982393 0.0006982393 0.0006982393 0.0006982393 ## 2 0.0009245303 0.0009245303 0.0009245303 0.0009245303 0.0009245303 0.2110823774 ## 3 0.0006982393 0.7349878547 0.0006982393 0.0006982393 0.0006982393 0.0006982393 ## 25 26 27 28 29 30 ## 1 0.0006982393 0.0006982393 0.0006982393 0.0006982393 0.0006982393 0.0006982393 ## 2 0.0009245303 0.0009245303 0.0009245303 0.0009245303 0.0009245303 0.0009245303 ## 3 0.0006982393 0.0006982393 0.0006982393 0.0006982393 0.0006982393 0.0006982393 ``` --- # LDA example (cont) Most likely topic for each document ``` r topics(m1_k30, 1) ``` ``` ## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ## 18 13 20 2 6 6 14 23 20 19 19 10 7 8 16 21 30 6 24 9 ## 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 ## 19 12 4 5 16 20 19 4 1 23 21 6 19 26 15 1 16 23 30 21 ## 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 ## 7 21 16 15 20 20 12 29 10 27 26 23 15 25 19 19 12 10 9 28 ## 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 ## 21 6 24 21 13 29 26 14 13 5 13 25 13 28 8 5 11 17 6 30 ## 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 ## 27 13 28 20 27 20 4 1 12 21 4 22 14 5 16 7 19 9 1 4 ## 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 ## 4 8 19 20 20 21 15 13 1 10 29 11 27 21 18 1 13 21 18 30 ## 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 ## 28 3 28 25 13 2 23 28 28 7 3 15 20 13 13 29 12 24 24 10 ## 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 ## 26 26 27 29 8 30 16 6 22 15 23 13 26 4 21 17 12 30 6 10 ## 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 ## 20 21 21 3 7 14 2 15 30 25 2 2 17 10 9 17 8 15 5 10 ## 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 ## 21 9 21 8 13 9 5 4 22 22 19 20 9 3 24 1 23 5 15 17 ## 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 ## 24 17 14 11 22 17 17 18 22 22 20 14 24 16 28 14 15 29 18 1 ## 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 ## 23 7 11 24 9 26 15 3 4 24 18 8 8 22 2 11 5 7 28 21 ## 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 ## 3 28 29 17 24 12 26 25 25 25 2 25 25 25 25 16 5 5 2 11 ## 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 ## 11 11 12 26 5 4 19 19 4 26 22 26 11 16 27 15 25 12 9 21 ## 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 ## 18 9 16 14 16 20 18 21 11 24 20 3 12 2 3 13 26 18 19 15 ## 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 ## 22 4 9 27 18 8 15 27 11 29 10 3 28 1 13 11 23 13 20 7 ## 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 ## 16 6 9 15 21 17 14 14 8 2 15 1 29 3 16 29 22 8 30 5 ## 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 ## 27 30 30 19 16 14 11 27 29 19 12 2 1 11 25 18 28 13 29 8 ## 361 ## 10 ``` --- # LDA example (cont) 5 most frequent terms for each topic ``` r terms(m1_k30, 5) ``` ``` ## Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6 ## [1,] "tree" "cluster" "command" "econometr" "formula" "www" ## [2,] "perl" "imag" "gui" "imag" "confid" "java" ## [3,] "diagram" "posit" "genet" "cell" "recurs" "cgi" ## [4,] "decis" "index" "document" "digit" "accuraci" "server" ## [5,] "see" "partit" "tcltk" "java" "cumul" "browser" ## Topic 7 Topic 8 Topic 9 Topic 10 Topic 11 Topic 12 ## [1,] "tour" "scale" "gene" "matric" "pattern" "socr" ## [2,] "calibr" "xlispstat" "initi" "distanc" "dose" "gee" ## [3,] "tsp" "beta" "microarray" "graph" "quantit" "qls" ## [4,] "access" "item" "mixtur" "bilinear" "accept" "conting" ## [5,] "molecular" "unit" "record" "biplot" "benchmark" "macro" ## Topic 13 Topic 14 Topic 15 Topic 16 Topic 17 Topic 18 ## [1,] "survey" "copula" "correspond" "densiti" "ecolog" "threshold" ## [2,] "orient" "random" "scale" "mixtur" "popul" "sequenc" ## [3,] "weight" "librari" "biplot" "hypothes" "stochast" "posterior" ## [4,] "bayesx" "rubi" "aspect" "librari" "abund" "habitat" ## [5,] "logit" "polit" "speci" "critic" "boost" "variogram" ## Topic 19 Topic 20 Topic 21 Topic 22 Topic 23 Topic 24 ## [1,] "event" "lispstat" "item" "popul" "wavelet" "imput" ## [2,] "censor" "vista" "random" "captur" "polyk" "winbug" ## [3,] "econometr" "loglinear" "irt" "rcaptur" "tumor" "map" ## [4,] "heteroskedast" "control" "variat" "period" "wave" "miss" ## [5,] "mixlow" "guid" "rasch" "seriat" "field" "longitudin" ## Topic 25 Topic 26 Topic 27 Topic 28 Topic 29 Topic 30 ## [1,] "network" "subset" "control" "smooth" "curv" "trait" ## [2,] "ergm" "autoregress" "chart" "diffus" "roc" "haplotyp" ## [3,] "graph" "vector" "excel" "weight" "zeroinfl" "ordin" ## [4,] "sunflow" "forecast" "parallel" "kernel" "recurr" "region" ## [5,] "intern" "crossvalid" "criterion" "mine" "scale" "impur" ``` --- # LDA example (cont) 20 most frequent terms for Topic 23 ``` r terms(m1_k30, 20)[,"Topic 23"] ``` ``` ## [1] "wavelet" "polyk" "tumor" "wave" "field" ## [6] "fraction" "sacrific" "anim" "brownian" "cancer" ## [11] "failur" "ageadjust" "breast" "bootstrapbas" "fast" ## [16] "gompertz" "ident" "mixtur" "surviv" "motion" ``` --- # LDA example (cont) ``` r m2_k30 <- LDA(jss_dtm, k = 30, method = "Gibbs") ``` --- # Other resources * The **Text Mining with R** book/site has a few more [example applications of LDA](https://www.tidytextmining.com/topicmodeling)