Tidy Text Summarization using TextRank

This code have been lightly revised to make sure it works as of 2018-12-19.

Text summarization

In the realm of text summarization there two main paths:

  • extractive summarization
  • abstractive summarization

Where extractive scoring word and sentences according to some metric and then using that information to summarize the text. Usually done by copy/pasting (extracting) the most informative parts of the text.

The abstractive methods aims to build a semantic representation of the text and then use natural language generation techniques to generate text describing the informative parts.

Extractive summarization is primarily the simpler task, with a handful of algorithms do will do the scoring. While with the advent of deep learning did NLP have a boost in abstractive summarization methods.

In this post will I focus on an example of a extractive summarization method called TextRank which is based on the PageRank algorithm that is used by Google to rank websites by their importance.

TextRank Algorithm

The TextRank algorithm is based on graph-based ranking algorithm. Generally used in web searches at Google, but have many other applications. Graph-based ranking algorithms try to decide the importance of a vertex by taking into account information about the entire graph rather then the vertex specific information. A typical piece of information would be information between relationships (edges) between the vertices.

In the NLP case we need to define the what we want to use as vertices and edges. In our case will we be using sentences as the vertices and words as the connection edges. So sentences with words that appear in many other sentences are seen as more important.

Data preparation

We start by loading the appropriate packages, which include tidyverse for general tasks, tidytext for text manipulations, textrank for the implementation of the TextRank algorithm and finally rvest to scrape an article to use as an example. The github for the textrank package can be found here.

library(tidyverse)
## Warning: package 'tibble' was built under R version 3.6.2
library(tidytext)
library(textrank)
library(rvest)
## Warning: package 'xml2' was built under R version 3.6.2

To showcase this method I have randomly (EXTENSIVELY filtered political and controversial) selected an article as our guinea pig. The main body is selected using the html_nodes.

url <- "http://time.com/5196761/fitbit-ace-kids-fitness-tracker/"
article <- read_html(url) %>%
  html_nodes('div[class="padded"]') %>%
  html_text()

next we load the article into a tibble (since tidytext required the input as a data.frame). We start by tokenize according to sentences which is done by setting token = "sentences" in unnest_tokens. The tokenization is not always perfect using this tokenizer, but it have a low number of dependencies and is sufficient for this showcase. Lastly we add sentence number column and switch the order of the columns (textrank_sentences prefers the columns in a certain order).

article_sentences <- tibble(text = article) %>%
  unnest_tokens(sentence, text, token = "sentences") %>%
  mutate(sentence_id = row_number()) %>%
  select(sentence_id, sentence)

next we will tokenize again but this time to get words. In doing this we will retain the sentence_id column in our data.

article_words <- article_sentences %>%
  unnest_tokens(word, sentence)

now we have all the sufficient input for the textrank_sentences function. However we will go one step further and remove the stop words in article_words since they would appear in most of the sentences and doesn’t really carry any information in them self.

article_words <- article_words %>%
  anti_join(stop_words, by = "word")

Running TextRank

Running the TextRank algorithm is easy, the textrank_sentences function only required 2 inputs.

  • A data.frame with sentences
  • A data.frame with tokens (in our case words) which are part of the each sentence

So we are ready to run

article_summary <- textrank_sentences(data = article_sentences, 
                                      terminology = article_words)

The output have its own printing method that displays the top 5 sentences:

article_summary
## Textrank on sentences, showing top 5 most important sentences found:
##   1. fitbit is launching a new fitness tracker designed for children called the fitbit ace, which will go on sale for $99.95 in the second quarter of this year.
##   2. fitbit says the tracker is designed for children eight years old and up.
##   3. sign up now                                                                                                                                                check the box if you do not wish to receive promotional offers via email from time.
##   4. the fitbit ace looks a lot like the company’s alta tracker, but with a few child-friendly tweaks.
##   5. like many of fitbit’s other products, the fitbit ace can automatically track steps, monitor active minutes, and remind kids to move when they’ve been still for too long.

Which in itself is pretty good.

Digging deeper

While the printing method is good, we can extract the information to good some further analysis. The information about the sentences is stored in sentences. It includes the information article_sentences plus the calculated textrank score.

article_summary[["sentences"]]

Lets begging by extracting the top 3 and bottom 3 sentences to see how they differ.

article_summary[["sentences"]] %>%
  arrange(desc(textrank)) %>% 
  slice(1:3) %>%
  pull(sentence)
## [1] "fitbit is launching a new fitness tracker designed for children called the fitbit ace, which will go on sale for $99.95 in the second quarter of this year."                                                                                   
## [2] "fitbit says the tracker is designed for children eight years old and up."                                                                                                                                                                      
## [3] "sign up now                                                                                                                                                check the box if you do not wish to receive promotional offers via email from time."

As expected these are the same sentences as we saw earlier. However the button sentences, doesn’t include the word fitbit (properly rather important word) and focuses more “other” things, like the reference to another product in the second sentence.

article_summary[["sentences"]] %>%
  arrange(textrank) %>% 
  slice(1:3) %>%
  pull(sentence)
## [1] "contact us at editors@time.com."                                                                                                                                                                                                                                                                                                     
## [2] "by signing up you are agreeing to our terms of use and privacy policy                                                                                                                                                                                                                                                     thank you!"
## [3] "the $39.99 nabi compete, meanwhile, is sold in pairs so that family members can work together to achieve movement milestones."

If we look at the article over time, it would be interesting to see where the important sentences appear.

article_summary[["sentences"]] %>%
  ggplot(aes(textrank_id, textrank, fill = textrank_id)) +
  geom_col() +
  theme_minimal() +
  scale_fill_viridis_c() +
  guides(fill = "none") +
  labs(x = "Sentence",
       y = "TextRank score",
       title = "4 Most informative sentences appear within first half of sentences",
       subtitle = 'In article "Fitbits Newest Fitness Tracker Is Just for Kids"',
       caption = "Source: http://time.com/5196761/fitbit-ace-kids-fitness-tracker/")

Working with books???

Summaries help cut down the reading when used on articles. Would the same approach work on books? Lets see what happens when you exchange “sentence” in “article” with “chapter” in “book”. I’ll go to my old friend emma form the janeaustenr package. We will borrow some code from the Text Mining with R book to create the chapters. Remember that we want 1 chapter per row.

emma_chapters <- janeaustenr::emma %>%
  tibble(text = .) %>%
  mutate(chapter_id = cumsum(str_detect(text, regex("^chapter [\\divxlc]",
                                                 ignore_case = TRUE)))) %>%
  filter(chapter_id > 0) %>%
  group_by(chapter_id) %>%
  summarise(text = paste(text, collapse = ' '))

and proceed as before to find the words and remove the stop words.

emma_words <- emma_chapters %>%
  unnest_tokens(word, text) %>%
  anti_join(stop_words, by = "word")

We run the textrank_sentences function again. It should still be very quick, as the bottleneck of the algorithm is more with the number of vertices rather then their individual size.

emma_summary <- textrank_sentences(data = emma_chapters, 
                                   terminology = emma_words)

We will be careful not to use the standard printing method as it would print 5 whole chapter!!

Instead we will look at the bar chart again to see if the important chapters appear in any particular order.

emma_summary[["sentences"]] %>%
  ggplot(aes(textrank_id, textrank, fill = textrank_id)) +
  geom_col() +
  theme_minimal() +
  scale_fill_viridis_c(option = "inferno") +
  guides(fill = "none") +
  labs(x = "Chapter",
       y = "TextRank score",
       title = "Chapter importance in the novel Emma by Jane Austen") +
  scale_x_continuous(breaks = seq(from = 0, to = 55, by = 5))

Which doesn’t appear to be the case in this particular text (which is properly good since skipping a chapter would be discouraged in a book like Emma). however it might prove helpful in non-chronological texts.

session information


─ Session info ───────────────────────────────────────────────────────────────
 setting  value                       
 version  R version 3.6.0 (2019-04-26)
 os       macOS Mojave 10.14.6        
 system   x86_64, darwin15.6.0        
 ui       X11                         
 language (EN)                        
 collate  en_US.UTF-8                 
 ctype    en_US.UTF-8                 
 tz       America/Los_Angeles         
 date     2020-04-23                  

─ Packages ───────────────────────────────────────────────────────────────────
 ! package     * version date       lib source        
 P assertthat    0.2.1   2019-03-21 [?] CRAN (R 3.6.0)
 P backports     1.1.6   2020-04-05 [?] CRAN (R 3.6.0)
 P blogdown      0.18    2020-03-04 [?] CRAN (R 3.6.0)
 P bookdown      0.18    2020-03-05 [?] CRAN (R 3.6.0)
 P broom         0.5.5   2020-02-29 [?] CRAN (R 3.6.0)
 P cellranger    1.1.0   2016-07-27 [?] CRAN (R 3.6.0)
 P cli           2.0.2   2020-02-28 [?] CRAN (R 3.6.0)
 P clipr         0.7.0   2019-07-23 [?] CRAN (R 3.6.0)
 P colorspace    1.4-1   2019-03-18 [?] CRAN (R 3.6.0)
 P crayon        1.3.4   2017-09-16 [?] CRAN (R 3.6.0)
 P data.table    1.12.8  2019-12-09 [?] CRAN (R 3.6.0)
 P DBI           1.1.0   2019-12-15 [?] CRAN (R 3.6.0)
 P dbplyr        1.4.2   2019-06-17 [?] CRAN (R 3.6.0)
 P desc          1.2.0   2018-05-01 [?] CRAN (R 3.6.0)
 P details     * 0.2.1   2020-01-12 [?] CRAN (R 3.6.0)
 P digest        0.6.25  2020-02-23 [?] CRAN (R 3.6.0)
 P dplyr       * 0.8.5   2020-03-07 [?] CRAN (R 3.6.0)
 P ellipsis      0.3.0   2019-09-20 [?] CRAN (R 3.6.0)
 P evaluate      0.14    2019-05-28 [?] CRAN (R 3.6.0)
 P fansi         0.4.1   2020-01-08 [?] CRAN (R 3.6.0)
 P forcats     * 0.5.0   2020-03-01 [?] CRAN (R 3.6.0)
 P fs            1.4.1   2020-04-04 [?] CRAN (R 3.6.0)
 P generics      0.0.2   2018-11-29 [?] CRAN (R 3.6.0)
 P ggplot2     * 3.3.0   2020-03-05 [?] CRAN (R 3.6.0)
 P glue          1.4.0   2020-04-03 [?] CRAN (R 3.6.0)
 P gtable        0.3.0   2019-03-25 [?] CRAN (R 3.6.0)
 P haven         2.2.0   2019-11-08 [?] CRAN (R 3.6.0)
 P hms           0.5.3   2020-01-08 [?] CRAN (R 3.6.0)
 P htmltools     0.4.0   2019-10-04 [?] CRAN (R 3.6.0)
 P httr          1.4.1   2019-08-05 [?] CRAN (R 3.6.0)
 P igraph        1.2.5   2020-03-19 [?] CRAN (R 3.6.0)
 P janeaustenr   0.1.5   2017-06-10 [?] CRAN (R 3.6.0)
 P jsonlite      1.6.1   2020-02-02 [?] CRAN (R 3.6.0)
 P knitr       * 1.28    2020-02-06 [?] CRAN (R 3.6.0)
 P lattice       0.20-41 2020-04-02 [?] CRAN (R 3.6.0)
 P lifecycle     0.2.0   2020-03-06 [?] CRAN (R 3.6.0)
 P lubridate     1.7.8   2020-04-06 [?] CRAN (R 3.6.0)
 P magrittr      1.5     2014-11-22 [?] CRAN (R 3.6.0)
 P Matrix        1.2-18  2019-11-27 [?] CRAN (R 3.6.0)
 P modelr        0.1.6   2020-02-22 [?] CRAN (R 3.6.0)
 P munsell       0.5.0   2018-06-12 [?] CRAN (R 3.6.0)
 P nlme          3.1-145 2020-03-04 [?] CRAN (R 3.6.0)
 P pillar        1.4.3   2019-12-20 [?] CRAN (R 3.6.0)
 P pkgconfig     2.0.3   2019-09-22 [?] CRAN (R 3.6.0)
 P png           0.1-7   2013-12-03 [?] CRAN (R 3.6.0)
 P purrr       * 0.3.3   2019-10-18 [?] CRAN (R 3.6.0)
 P R6            2.4.1   2019-11-12 [?] CRAN (R 3.6.0)
 P Rcpp          1.0.4.6 2020-04-09 [?] CRAN (R 3.6.0)
 P readr       * 1.3.1   2018-12-21 [?] CRAN (R 3.6.0)
 P readxl        1.3.1   2019-03-13 [?] CRAN (R 3.6.0)
   renv          0.9.3   2020-02-10 [1] CRAN (R 3.6.0)
 P reprex        0.3.0   2019-05-16 [?] CRAN (R 3.6.0)
 P rlang         0.4.5   2020-03-01 [?] CRAN (R 3.6.0)
 P rmarkdown     2.1     2020-01-20 [?] CRAN (R 3.6.0)
 P rprojroot     1.3-2   2018-01-03 [?] CRAN (R 3.6.0)
 P rstudioapi    0.11    2020-02-07 [?] CRAN (R 3.6.0)
 P rvest       * 0.3.5   2019-11-08 [?] CRAN (R 3.6.0)
 P scales        1.1.0   2019-11-18 [?] CRAN (R 3.6.0)
 P sessioninfo   1.1.1   2018-11-05 [?] CRAN (R 3.6.0)
 P SnowballC     0.7.0   2020-04-01 [?] CRAN (R 3.6.2)
 P stringi       1.4.6   2020-02-17 [?] CRAN (R 3.6.0)
 P stringr     * 1.4.0   2019-02-10 [?] CRAN (R 3.6.0)
 P textrank    * 0.3.0   2019-01-17 [?] CRAN (R 3.6.0)
 P tibble      * 3.0.1   2020-04-20 [?] CRAN (R 3.6.2)
 P tidyr       * 1.0.2   2020-01-24 [?] CRAN (R 3.6.0)
 P tidyselect    1.0.0   2020-01-27 [?] CRAN (R 3.6.0)
 P tidytext    * 0.2.3   2020-03-04 [?] CRAN (R 3.6.0)
 P tidyverse   * 1.3.0   2019-11-21 [?] CRAN (R 3.6.0)
 P tokenizers    0.2.1   2018-03-29 [?] CRAN (R 3.6.0)
 P vctrs         0.2.4   2020-03-10 [?] CRAN (R 3.6.0)
 P withr         2.1.2   2018-03-15 [?] CRAN (R 3.6.0)
 P xfun          0.13    2020-04-13 [?] CRAN (R 3.6.2)
 P xml2        * 1.3.0   2020-04-01 [?] CRAN (R 3.6.2)
 P yaml          2.2.1   2020-02-01 [?] CRAN (R 3.6.0)

[1] /Users/emilhvitfeldthansen/Desktop/blogv4/renv/library/R-3.6/x86_64-apple-darwin15.6.0
[2] /private/var/folders/m0/zmxymdmd7ps0q_tbhx0d_26w0000gn/T/RtmpxWontu/renv-system-library

 P ── Loaded and on-disk path mismatch.