Using stm to Investigate if Stemming is Appropriate

Image credit: Jess Bailey

It is known that topic modeling does not benefit from stemming ref. I propose a workflow to investigate if stemming is appropriate as a method for data reduction.

  1. Take all the tokens and apply the stemming algorithm you would like to test
  2. Construct a list of words that should be equal under stemming
  3. Apply a topic model to your original data
  4. Predict the topic for each word created in 2.

If grouped words are predicted to the same topic then we assume that stemming would not make much of a difference. If the words are predicted to be indifferent topics then we have a suspicion that the stemmed and unstemmed words have different uses and stemming would be ill-advised.

First, we load the packages we will be using.

library(tidytext)
library(tidyverse)
library(stm)
library(hcandersenr)
library(SnowballC)

As a first test, we pick 3 English1 fairy tales by H.C. Andersens using the hcandersenr package. To create multiple “documents” for each fairy tale we start by tokenizing to sentences. Then we give each sentence a unique identifier.

fairy_tales <- hcandersen_en %>%
  filter(book %in% c("The fir tree", "The tinder-box", "Thumbelina")) %>%
  unnest_tokens(token, text, token = "sentences") %>%
  group_by(book) %>%
  mutate(sentence = row_number()) %>%
  ungroup() %>%
  unite(document, book, sentence)

fairy_tales
## # A tibble: 501 x 2
##    document       token                                                         
##    <chr>          <chr>                                                         
##  1 The fir tree_1 "far down in the forest, where the warm sun and the fresh air…
##  2 The fir tree_2 "the sun shone, and the soft air fluttered its leaves, and th…
##  3 The fir tree_3 "sometimes the children would bring a large basket of raspber…
##  4 The fir tree_4 "which made it feel more unhappy than before."                
##  5 The fir tree_5 "and yet all this while the tree grew a notch or joint taller…
##  6 The fir tree_6 "still, as it grew, it complained."                           
##  7 The fir tree_7 "\"oh!"                                                       
##  8 The fir tree_8 "how i wish i were as tall as the other trees, then i would s…
##  9 The fir tree_9 "i should have the birds building their nests on my boughs, a…
## 10 The fir tree_… "the tree was so discontented, that it took no pleasure in th…
## # … with 491 more rows

Now we unnest the tokens to words and create a new variable of the stemmed words

fairy_tales_tokens <- fairy_tales %>%
  unnest_tokens(token, token) %>%
  mutate(token_stem = wordStem(token))

fairy_tales_tokens
## # A tibble: 10,577 x 3
##    document       token  token_stem
##    <chr>          <chr>  <chr>     
##  1 The fir tree_1 far    far       
##  2 The fir tree_1 down   down      
##  3 The fir tree_1 in     in        
##  4 The fir tree_1 the    the       
##  5 The fir tree_1 forest forest    
##  6 The fir tree_1 where  where     
##  7 The fir tree_1 the    the       
##  8 The fir tree_1 warm   warm      
##  9 The fir tree_1 sun    sun       
## 10 The fir tree_1 and    and       
## # … with 10,567 more rows

We can take a look at all the times where stemming we can look at all the times stemming yields a different token.

different <- fairy_tales_tokens %>%
  select(token, token_stem) %>%
  filter(token != token_stem) %>%
  unique()

different
## # A tibble: 759 x 2
##    token      token_stem
##    <chr>      <chr>     
##  1 resting    rest      
##  2 pretty     pretti    
##  3 little     littl     
##  4 was        wa        
##  5 happy      happi     
##  6 wished     wish      
##  7 its        it        
##  8 companions companion 
##  9 pines      pine      
## 10 firs       fir       
## # … with 749 more rows

In this example, we have 759 different tokens. But since stemming can collapse multiple different tokens into one.

different %>%
  count(token_stem, sort = TRUE)
## # A tibble: 672 x 2
##    token_stem     n
##    <chr>      <int>
##  1 seiz           4
##  2 leav           3
##  3 live           3
##  4 look           3
##  5 place          3
##  6 plai           3
##  7 pleas          3
##  8 sai            3
##  9 trembl         3
## 10 appear         2
## # … with 662 more rows

We can use the different data.frame and construct a list of words that would land in the same bucket after stemming.

stem_buckets <- split(different$token, different$token_stem) %>%
  imap(~ c(.x, .y))

stem_buckets[21:25]
## $anxiou
## [1] "anxious" "anxiou" 
## 
## $anyth
## [1] "anything" "anyth"   
## 
## $apart
## [1] "apartment" "apart"    
## 
## $appear
## [1] "appearance" "appeared"   "appear"    
## 
## $appl
## [1] "apples" "apple"  "appl"

Here we see that “anxiou” and “anxious” would look the same after stemming, likewise will “apples”, “apple” and “appl”. The main point of this exercise is to see if the words in these groups of words end up in the topic when during topic modeling.

stm_model <- fairy_tales_tokens %>%
  count(document, token) %>%
  cast_sparse(document, token, n) %>%
  stm(K = 3, verbose = FALSE)

stm_model
## A topic model with 3 topics, 501 documents and a 1518 word dictionary.

In this case, I fit the model to 3 topics because I knew that would be the right number since I picked the data. When doing this on your data you should run multiple models with a varying number of topics to find the best one. For more information please read Training, Evaluating, and Interpreting Topic Models by Julia Silge.

Now that we have a stm model and a list of words, We can inspect the model object to check if multiple words are put in the same topic. Below is a function that will take a vector of characters and a stm model and return TRUE if all the words appear in the same topic and FALSE if they don’t.

stm_match <- function(x, model) {
  topics <- tidy(model) %>%
  filter(term %in% x) %>%
  group_by(term) %>%
  top_n(1, beta) %>%
  ungroup() %>%
  select(topic) %>%
  n_distinct()
  
  topics == 1
}

As an example, if we pass the words “believed” and “believ”

stm_match(c("believed", "believ"), stm_model)
## [1] TRUE

We see that they did end up in the same bucket. If we instead pass in “dog” and “happy” they land in different topics.

stm_match(c("dog", "happy"), stm_model)
## [1] FALSE

All of this is not perfect, there is still some uncertainty but it is a good first step to evaluate if stemming is appropriate for your application.

tested <- tibble(terms = stem_buckets,
                 stem = names(stem_buckets)) %>%
  mutate(match = map_lgl(terms, stm_match, stm_model))

tested
## # A tibble: 672 x 3
##    terms        stem      match
##    <named list> <chr>     <lgl>
##  1 <chr [2]>    a         FALSE
##  2 <chr [2]>    abl       TRUE 
##  3 <chr [2]>    abov      TRUE 
##  4 <chr [2]>    accompani TRUE 
##  5 <chr [2]>    ach       TRUE 
##  6 <chr [2]>    admir     TRUE 
##  7 <chr [2]>    adorn     TRUE 
##  8 <chr [2]>    ag        TRUE 
##  9 <chr [2]>    ala       TRUE 
## 10 <chr [2]>    alarm     TRUE 
## # … with 662 more rows

First, we’ll look at the distribution of TRUEs and FALSEs.

tested %>%  
  ggplot(aes(match)) +
  geom_bar()

So it looks like most of the word groups were put into the same topic during modeling. This is a good sign. Please note that this category includes a lot of false positives. This is happening because stm_match() also returns true for a case where one of the words appears in the model and all other words don’t. So for the case of “accompanied” and “accompani”, the word “accompanied” was present in one of the topics, but the word “accompani” was not present in the original data and hence did not appear in any of the topics. In this case, the TRUE value we are getting is saying that the data doesn’t provide enough evidence to indicate that stemming would be bad. By looking at a sample of TRUE cases we see that a lot of them are happening because the stemmed word isn’t being used, like the words “aliv”, “alon” and “alwai”. On the other side, we have that the words “allowed” and “allow” are both real words AND they appeared in the same topic.

tested %>%
  filter(match) %>%
  slice(10:15) %>%
  pull(terms)
## [[1]]
## [1] "alighted" "alight"  
## 
## [[2]]
## [1] "alive" "aliv" 
## 
## [[3]]
## [1] "allowed" "allow"  
## 
## [[4]]
## [1] "alone" "alon" 
## 
## [[5]]
## [1] "already" "alreadi"
## 
## [[6]]
## [1] "always" "alwai"

Turning our head to the FALSE cases. These cases will not have any false positives as both words would have to appear in the original corpus for them to put into different topics. These cases are still not going to be perfect, but will again be an indication.

tested %>%
  filter(!match) %>%
  pull(terms) %>%
  head()
## [[1]]
## [1] "as" "a" 
## 
## [[2]]
## [1] "appearance" "appeared"   "appear"    
## 
## [[3]]
## [1] "backs" "back" 
## 
## [[4]]
## [1] "beginning" "begin"    
## 
## [[5]]
## [1] "beside"  "besides" "besid"  
## 
## [[6]]
## [1] "birds" "bird"

This is the list I would advise you to look over carefully. Check to make sure that you are okay with the number and count of misgroupings you would get by applying stemming.

session information


─ Session info ───────────────────────────────────────────────────────────────
 setting  value                       
 version  R version 3.6.0 (2019-04-26)
 os       macOS Mojave 10.14.6        
 system   x86_64, darwin15.6.0        
 ui       X11                         
 language (EN)                        
 collate  en_US.UTF-8                 
 ctype    en_US.UTF-8                 
 tz       America/Los_Angeles         
 date     2020-04-23                  

─ Packages ───────────────────────────────────────────────────────────────────
 ! package     * version date       lib source        
 P assertthat    0.2.1   2019-03-21 [?] CRAN (R 3.6.0)
 P backports     1.1.6   2020-04-05 [?] CRAN (R 3.6.0)
 P blogdown      0.18    2020-03-04 [?] CRAN (R 3.6.0)
 P bookdown      0.18    2020-03-05 [?] CRAN (R 3.6.0)
 P broom         0.5.5   2020-02-29 [?] CRAN (R 3.6.0)
 P cellranger    1.1.0   2016-07-27 [?] CRAN (R 3.6.0)
 P cli           2.0.2   2020-02-28 [?] CRAN (R 3.6.0)
 P clipr         0.7.0   2019-07-23 [?] CRAN (R 3.6.0)
 P colorspace    1.4-1   2019-03-18 [?] CRAN (R 3.6.0)
 P crayon        1.3.4   2017-09-16 [?] CRAN (R 3.6.0)
 P data.table    1.12.8  2019-12-09 [?] CRAN (R 3.6.0)
 P DBI           1.1.0   2019-12-15 [?] CRAN (R 3.6.0)
 P dbplyr        1.4.2   2019-06-17 [?] CRAN (R 3.6.0)
 P desc          1.2.0   2018-05-01 [?] CRAN (R 3.6.0)
 P details     * 0.2.1   2020-01-12 [?] CRAN (R 3.6.0)
 P digest        0.6.25  2020-02-23 [?] CRAN (R 3.6.0)
 P dplyr       * 0.8.5   2020-03-07 [?] CRAN (R 3.6.0)
 P ellipsis      0.3.0   2019-09-20 [?] CRAN (R 3.6.0)
 P evaluate      0.14    2019-05-28 [?] CRAN (R 3.6.0)
 P fansi         0.4.1   2020-01-08 [?] CRAN (R 3.6.0)
 P forcats     * 0.5.0   2020-03-01 [?] CRAN (R 3.6.0)
 P fs            1.4.1   2020-04-04 [?] CRAN (R 3.6.0)
 P generics      0.0.2   2018-11-29 [?] CRAN (R 3.6.0)
 P ggplot2     * 3.3.0   2020-03-05 [?] CRAN (R 3.6.0)
 P glue          1.4.0   2020-04-03 [?] CRAN (R 3.6.0)
 P gtable        0.3.0   2019-03-25 [?] CRAN (R 3.6.0)
 P haven         2.2.0   2019-11-08 [?] CRAN (R 3.6.0)
 P hcandersenr * 0.2.0   2019-01-19 [?] CRAN (R 3.6.0)
 P hms           0.5.3   2020-01-08 [?] CRAN (R 3.6.0)
 P htmltools     0.4.0   2019-10-04 [?] CRAN (R 3.6.0)
 P httr          1.4.1   2019-08-05 [?] CRAN (R 3.6.0)
 P janeaustenr   0.1.5   2017-06-10 [?] CRAN (R 3.6.0)
 P jsonlite      1.6.1   2020-02-02 [?] CRAN (R 3.6.0)
 P knitr       * 1.28    2020-02-06 [?] CRAN (R 3.6.0)
 P lattice       0.20-41 2020-04-02 [?] CRAN (R 3.6.0)
 P lifecycle     0.2.0   2020-03-06 [?] CRAN (R 3.6.0)
 P lubridate     1.7.8   2020-04-06 [?] CRAN (R 3.6.0)
 P magrittr      1.5     2014-11-22 [?] CRAN (R 3.6.0)
 P Matrix        1.2-18  2019-11-27 [?] CRAN (R 3.6.0)
 P modelr        0.1.6   2020-02-22 [?] CRAN (R 3.6.0)
 P munsell       0.5.0   2018-06-12 [?] CRAN (R 3.6.0)
 P nlme          3.1-145 2020-03-04 [?] CRAN (R 3.6.0)
 P pillar        1.4.3   2019-12-20 [?] CRAN (R 3.6.0)
 P pkgconfig     2.0.3   2019-09-22 [?] CRAN (R 3.6.0)
 P png           0.1-7   2013-12-03 [?] CRAN (R 3.6.0)
 P purrr       * 0.3.3   2019-10-18 [?] CRAN (R 3.6.0)
 P R6            2.4.1   2019-11-12 [?] CRAN (R 3.6.0)
 P Rcpp          1.0.4.6 2020-04-09 [?] CRAN (R 3.6.0)
 P readr       * 1.3.1   2018-12-21 [?] CRAN (R 3.6.0)
 P readxl        1.3.1   2019-03-13 [?] CRAN (R 3.6.0)
   renv          0.9.3   2020-02-10 [1] CRAN (R 3.6.0)
 P reprex        0.3.0   2019-05-16 [?] CRAN (R 3.6.0)
 P rlang         0.4.5   2020-03-01 [?] CRAN (R 3.6.0)
 P rmarkdown     2.1     2020-01-20 [?] CRAN (R 3.6.0)
 P rprojroot     1.3-2   2018-01-03 [?] CRAN (R 3.6.0)
 P rstudioapi    0.11    2020-02-07 [?] CRAN (R 3.6.0)
 P rvest         0.3.5   2019-11-08 [?] CRAN (R 3.6.0)
 P scales        1.1.0   2019-11-18 [?] CRAN (R 3.6.0)
 P sessioninfo   1.1.1   2018-11-05 [?] CRAN (R 3.6.0)
 P SnowballC   * 0.7.0   2020-04-01 [?] CRAN (R 3.6.2)
 P stm         * 1.3.5   2019-12-17 [?] CRAN (R 3.6.0)
 P stringi       1.4.6   2020-02-17 [?] CRAN (R 3.6.0)
 P stringr     * 1.4.0   2019-02-10 [?] CRAN (R 3.6.0)
 P tibble      * 3.0.1   2020-04-20 [?] CRAN (R 3.6.2)
 P tidyr       * 1.0.2   2020-01-24 [?] CRAN (R 3.6.0)
 P tidyselect    1.0.0   2020-01-27 [?] CRAN (R 3.6.0)
 P tidytext    * 0.2.3   2020-03-04 [?] CRAN (R 3.6.0)
 P tidyverse   * 1.3.0   2019-11-21 [?] CRAN (R 3.6.0)
 P tokenizers    0.2.1   2018-03-29 [?] CRAN (R 3.6.0)
 P vctrs         0.2.4   2020-03-10 [?] CRAN (R 3.6.0)
 P withr         2.1.2   2018-03-15 [?] CRAN (R 3.6.0)
 P xfun          0.13    2020-04-13 [?] CRAN (R 3.6.2)
 P xml2          1.3.0   2020-04-01 [?] CRAN (R 3.6.2)
 P yaml          2.2.1   2020-02-01 [?] CRAN (R 3.6.0)

[1] /Users/emilhvitfeldthansen/Desktop/blogv4/renv/library/R-3.6/x86_64-apple-darwin15.6.0
[2] /private/var/folders/m0/zmxymdmd7ps0q_tbhx0d_26w0000gn/T/Rtmp4uYTJt/renv-system-library

 P ── Loaded and on-disk path mismatch.