This code have been lightly revised to make sure it works as of 2018-12-16.
In this post I will highlight a couple of the new features in the new update of my package ggpage.
first we load the packages we need, which is
tidyverse for general tidy tools,
ggpage for visualization and finally
rvest for data collection.
library(tidyverse) library(ggpage) library(rtweet) library(rvest)
The packages includes 2 main functions,
ggpage_plot that will transform the data in the right way and plot it respectively. The reason for the split of the functions is to allow additional transformations to be done on the tokenized data to be used in the plotting.
The package includes a example data set of the text Tinderbox by H.C. Andersen
tinderbox %>% head()
## # A tibble: 6 x 2 ## text book ## <chr> <chr> ## 1 "A soldier came marching along the high road: \"Left, right… The tinder-… ## 2 had his knapsack on his back, and a sword at his side; he h… The tinder-… ## 3 and was now returning home. As he walked on, he met a very … The tinder-… ## 4 witch in the road. Her under-lip hung quite down on her bre… The tinder-… ## 5 "and said, \"Good evening, soldier; you have a very fine sw… The tinder-… ## 6 knapsack, and you are a real soldier; so you shall have as … The tinder-…
This data set can be used directly with
ggpage_build(tinderbox) %>% ggpage_plot()
ggpage_build expects the column containing the text to be named “text” which it is in the tinderbox object. This visualization gets exiting when you start combining it with the other tools. We can show where the word “tinderbox” appears along with adding some page numbers.
ggpage_build(tinderbox) %>% mutate(tinderbox = word == "tinderbox") %>% ggpage_plot(mapping = aes(fill = tinderbox), page.number = "top-left")
And we see that the word tinderbox only appear 3 times in the middle of the story.
We can also use this to showcase a number of tweets. For this we will use the
rtweet package. We will load in 100 tweets that contain the hash tag #rstats.
## whatever name you assigned to your created app appname <- "********" ## api key (example below is not a real key) key <- "**********" ## api secret (example below is not a real key) secret <- "********" ## create token named "twitter_token" twitter_token <- create_token( app = appname, consumer_key = key, consumer_secret = secret)
rstats_tweets <- rtweet::search_tweets("#rstats") %>% mutate(status_id = as.numeric(as.factor(status_id)))
Since each tweet is too long by itself will we use the
nest_paragraphs function to wrap the texts within each tweet. By passing the tweet id to
page.col we will make it such that we have a tweet per page. Additionally we can indicate both whether the tweet is a retweet by coloring the paper blue if it is and green if it isn’t. Lastly we highlight where “rstats” is used.
rstats_tweets %>% select(status_id, text) %>% nest_paragraphs(text) %>% ggpage_build(page.col = "status_id", lpp = 4, ncol = 6) %>% mutate(rstats = word == "rstats") %>% ggpage_plot(mapping = aes(fill = rstats), paper.show = TRUE, paper.color = ifelse(rstats_tweets$is_retweet, "lightblue", "lightgreen")) + scale_fill_manual(values = c("grey60", "black")) + labs(title = "100 #rstats tweets", subtitle = "blue = retweet, green = original")
Next we will look at the Convention on the Rights of the Child which we will scrape with
url <- "http://www.ohchr.org/EN/ProfessionalInterest/Pages/CRC.aspx" rights_text <- read_html(url) %>% html_nodes('div[class="boxtext"]') %>% html_text() %>% str_split("\n") %>% unlist() %>% str_wrap() %>% str_split("\n") %>% unlist() %>% data.frame(text = ., stringsAsFactors = FALSE)
In this case will we remove the vertical space between the pages have it appear as a long paper like the website.
case_when comes in vary handy here when we want to highlight multiple different words.
for the purpose of the “United Nations” was it necessary to check that the words “united” and “nations” only appeared in pairs.
rights_text %>% ggpage_build(wtl = FALSE, y_space_pages = 0, ncol = 7) %>% mutate(highlight = case_when(word %in% c("child", "children") ~ "child", word %in% c("right", "rights") ~ "rights", word %in% c("united", "nations") ~ "United Nations", TRUE ~ "other")) %>% ggpage_plot(mapping = aes(fill = highlight)) + scale_fill_manual(values = c("darkgreen", "grey", "darkblue", "darkred")) + labs(title = "Word highlights in the 'Convention on the Rights of the Child'", fill = NULL)
This is just a couple of different ways to use this package. I look forward to see what you guys can come up with.