Textrecipes series: Pretrained Word Embedding
This is the fifth blog post in the textrecipes series where I go over the various text preprocessing workflows you can do with textrecipes. This post will be showcasing how to use pretrained word embeddings.
library(tidymodels) library(tidyverse) library(textrecipes) library(textdata) theme_set(theme_minimal())
Exploring the data ⛏
The following section is unchanged from the last post, if you are already familiar with this section then skip to the modeling section.
reviews <- read_csv("Womens Clothing E-Commerce Reviews.csv") ## Warning: Missing column names filled in: 'X1' 
We start by a quick
glimpse() of the data.
glimpse(reviews) ## Rows: 23,486 ## Columns: 11 ## $ X1 <dbl> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1… ## $ `Clothing ID` <dbl> 767, 1080, 1077, 1049, 847, 1080, 858, 858,… ## $ Age <dbl> 33, 34, 60, 50, 47, 49, 39, 39, 24, 34, 53,… ## $ Title <chr> NA, NA, "Some major design flaws", "My favo… ## $ `Review Text` <chr> "Absolutely wonderful - silky and sexy and … ## $ Rating <dbl> 4, 5, 3, 5, 5, 2, 5, 4, 5, 5, 3, 5, 5, 5, 3… ## $ `Recommended IND` <dbl> 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1… ## $ `Positive Feedback Count` <dbl> 0, 4, 0, 0, 6, 4, 1, 4, 0, 0, 14, 2, 2, 0, … ## $ `Division Name` <chr> "Initmates", "General", "General", "General… ## $ `Department Name` <chr> "Intimate", "Dresses", "Dresses", "Bottoms"… ## $ `Class Name` <chr> "Intimates", "Dresses", "Dresses", "Pants",…
We have a good split between text variables, numeric and categorical values.
Let us also take a look at the distribution of the
reviews %>% ggplot(aes(Rating)) + geom_bar()
Which is quite right-skewed.
Let us collapse the ratings into 2 groups, 5 and less-then-5.
Before we go on will I remove the row number variable
X1 and clean the column names with the janitor package to remove cases and spaces.
reviews <- reviews %>% select(-X1) %>% janitor::clean_names() %>% mutate(rating = factor(rating == 5, c(TRUE, FALSE), c("5", "<5")))
Before we do some exploratory analysis we will split the data into training and testing datasets. We do this to avoid learning this about the data that would only be available in the testing data set.
set.seed(1234) reviews_split <- initial_split(reviews) reviews_train <- training(reviews_split)
Our main objective is to predict the rating based on the text review.
This will naturally exclude variables such as
Recommended IND and
Positive Feedback Count as that information is unlikely to be known before the rating is given.
We will mostly be using the text variables (
Review Text and
Title) but I am going to take a look at some of the other variables before we move on in case they would be easy inclusions.
The ratings were pretty highly right-skewed and even when we collapsed them there are still more 5s.
reviews_train %>% ggplot(aes(rating)) + geom_bar()
Since we have the age let us take a look to make sure it has a reasonable range.
reviews_train %>% ggplot(aes(age)) + geom_bar()
Nothing too out of the ordinary, we have some young people and old people but nothing weird. Out of curiosity let us take a look at that one age that is above the pack.
reviews_train %>% count(age, sort = TRUE) ## # A tibble: 77 x 2 ## age n ## <dbl> <int> ## 1 39 987 ## 2 35 674 ## 3 36 623 ## 4 34 595 ## 5 38 590 ## 6 37 574 ## 7 41 570 ## 8 33 541 ## 9 46 533 ## 10 32 489 ## # … with 67 more rows
Since we have the clothing id, then I want to know if any of the reviews apply to the same articles of clothing.
reviews_train %>% count(clothing_id, sort = TRUE) ## # A tibble: 1,074 x 2 ## clothing_id n ## <dbl> <int> ## 1 1078 780 ## 2 862 615 ## 3 1094 553 ## 4 1081 438 ## 5 872 409 ## 6 829 403 ## 7 1110 359 ## 8 868 312 ## 9 895 293 ## 10 936 266 ## # … with 1,064 more rows
So out of 17615 we have 1074 clothing articles. Let us see how the reviews are split between the variables
reviews_train %>% count(clothing_id, sort = TRUE) %>% mutate(x = row_number()) %>% ggplot(aes(x, n)) + geom_point()
And we see quite a fast drop-off.
I’m trying to create a fairly simple model so I won’t be including much information.
We will restrict ourselves to only use the two text fields and the age of the customer. We are doing this so we can compare results with previous results.
Before we go on, let us take a look at the pre-trained word embeddings.
embedding_glove6b() function gives us access to the 6B tokens glove embedding from the Stanford NLP Group.
You will get a prompt the first time you use this function. This is expected and done to make sure that the user agrees to the license and agreements of any given resource.
I have specified
dimensions = 100 to get word vectors with 100 dimensions.
glove6b <- textdata::embedding_glove6b(dimensions = 100) glove6b ## # A tibble: 400,000 x 101 ## token d1 d2 d3 d4 d5 d6 d7 d8 d9 ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 "the" -0.0382 -0.245 0.728 -0.400 0.0832 0.0440 -0.391 0.334 -0.575 ## 2 "," -0.108 0.111 0.598 -0.544 0.674 0.107 0.0389 0.355 0.0635 ## 3 "." -0.340 0.209 0.463 -0.648 -0.384 0.0380 0.171 0.160 0.466 ## 4 "of" -0.153 -0.243 0.898 0.170 0.535 0.488 -0.588 -0.180 -1.36 ## 5 "to" -0.190 0.0500 0.191 -0.0492 -0.0897 0.210 -0.550 0.0984 -0.201 ## 6 "and" -0.0720 0.231 0.0237 -0.506 0.339 0.196 -0.329 0.184 -0.181 ## 7 "in" 0.0857 -0.222 0.166 0.134 0.382 0.354 0.0129 0.225 -0.438 ## 8 "a" -0.271 0.0440 -0.0203 -0.174 0.644 0.712 0.355 0.471 -0.296 ## 9 "\"" -0.305 -0.236 0.176 -0.729 -0.283 -0.256 0.266 0.0253 -0.0748 ## 10 "'s" 0.589 -0.202 0.735 -0.683 -0.197 -0.180 -0.392 0.342 -0.606 ## # … with 399,990 more rows, and 91 more variables: d10 <dbl>, d11 <dbl>, ## # d12 <dbl>, d13 <dbl>, d14 <dbl>, d15 <dbl>, d16 <dbl>, d17 <dbl>, ## # d18 <dbl>, d19 <dbl>, d20 <dbl>, d21 <dbl>, d22 <dbl>, d23 <dbl>, ## # d24 <dbl>, d25 <dbl>, d26 <dbl>, d27 <dbl>, d28 <dbl>, d29 <dbl>, ## # d30 <dbl>, d31 <dbl>, d32 <dbl>, d33 <dbl>, d34 <dbl>, d35 <dbl>, ## # d36 <dbl>, d37 <dbl>, d38 <dbl>, d39 <dbl>, d40 <dbl>, d41 <dbl>, ## # d42 <dbl>, d43 <dbl>, d44 <dbl>, d45 <dbl>, d46 <dbl>, d47 <dbl>, ## # d48 <dbl>, d49 <dbl>, d50 <dbl>, d51 <dbl>, d52 <dbl>, d53 <dbl>, ## # d54 <dbl>, d55 <dbl>, d56 <dbl>, d57 <dbl>, d58 <dbl>, d59 <dbl>, ## # d60 <dbl>, d61 <dbl>, d62 <dbl>, d63 <dbl>, d64 <dbl>, d65 <dbl>, ## # d66 <dbl>, d67 <dbl>, d68 <dbl>, d69 <dbl>, d70 <dbl>, d71 <dbl>, ## # d72 <dbl>, d73 <dbl>, d74 <dbl>, d75 <dbl>, d76 <dbl>, d77 <dbl>, ## # d78 <dbl>, d79 <dbl>, d80 <dbl>, d81 <dbl>, d82 <dbl>, d83 <dbl>, ## # d84 <dbl>, d85 <dbl>, d86 <dbl>, d87 <dbl>, d88 <dbl>, d89 <dbl>, ## # d90 <dbl>, d91 <dbl>, d92 <dbl>, d93 <dbl>, d94 <dbl>, d95 <dbl>, ## # d96 <dbl>, d97 <dbl>, d98 <dbl>, d99 <dbl>, d100 <dbl>
The format of these word vectors is perfectly tailored to work with textrecipes. The first column has the tokens, and the remaining numerical columns are the word vectors.
we need to specify a tokenizer that closely matches the tokenizer that was used in the pre-trained word embedding. Otherwise, will you get mismatches between words.
I was not able to find the same tokenizer used in this case. but I found that the default
strip_punct = FALSE gives very similar results.
we can pass arguments to the underlying tokenizer in
step_tokenize() by passing a named list to the
options = argument.
We will be using the default method of aggregating the vectors within each observation which is to sum them together. This can be changed using the
aggregation = argument.
rec_spec <- recipe(rating ~ age + title + review_text, data = reviews_train) %>% step_tokenize(title, review_text, options = list(strip_punct = FALSE)) %>% step_tokenmerge(title, review_text, prefix = "text") %>% step_word_embeddings(text, embeddings = glove6b)
We are using
step_tokenmerge() to combine the tokens created in
review_text into one list of tokens. There aren’t that many tokens in
title alone for it to warrant treating it as a separate list of tokens.
Next, we specify a lasso model.
lasso_spec <- logistic_reg(penalty = tune(), mixture = 1) %>% set_engine("glmnet")
I have specified
penalty = tune() because I want to use tune to find the best value of the penalty by doing hyperparameter tuning.
We set up a parameter grid using
param_grid <- grid_regular(penalty(), levels = 50)
searching over 50 levels might seem like a lot. But remember that glmnet can calculate them all at once. This is because it relies on its warms starts for speed and it is often faster to fit a whole path than compute a single fit.
We will also set up some bootstraps of the data so we can evaluate the performance multiple times for each level.
reviews_boot <- bootstraps(reviews_train, times = 10)
the last thing we need to use is to create a workflow object to combine the preprocessing step with the model. This is important because we want the preprocessing steps to happen in the bootstraps.
wf_fh <- workflow() %>% add_recipe(rec_spec) %>% add_model(lasso_spec)
now we are ready to perform the parameter tuning.
set.seed(42) lasso_grid <- tune_grid( wf_fh, resamples = reviews_boot, grid = param_grid )
Once we have finished parameter tuning we can use the
autoplot() function on the tuning results to get a nice chart showing the performance for different values of the penalty.
lasso_grid %>% autoplot()
It appears that the best value for the penalty for this workflow is on the low end.
It is worth noting that this data coming out of recipes is dense since we are using word vectors. We will only be having 100 predictors in this model (since we choose
dimensions = 100). This is an order of magnitude less than the last time where we had 1024 sparse predictors.
This more dense data representation also allows us to use models that we normally can’t use when doing count-based preprocessing since some models don’t handle sparseness that well.
Similarly, can we use the
show_best() function from tune to show to the best performing hyperparameter.
lasso_grid %>% show_best("roc_auc") ## # A tibble: 5 x 7 ## penalty .metric .estimator mean n std_err .config ## <dbl> <chr> <chr> <dbl> <int> <dbl> <chr> ## 1 0.000133 roc_auc binary 0.820 10 0.00103 Model31 ## 2 0.000212 roc_auc binary 0.820 10 0.00104 Model32 ## 3 0.0000829 roc_auc binary 0.820 10 0.00101 Model30 ## 4 0.0000518 roc_auc binary 0.820 10 0.000999 Model29 ## 5 0.0000324 roc_auc binary 0.820 10 0.00100 Model28
We will use the
select_best() function to extract the best performing penalty and finalize the workflow with that value of penalty.
wf_fh_final <- wf_fh %>% finalize_workflow(parameters = select_best(lasso_grid, "roc_auc"))
Now we can run
last_fit() on our training/testing split to fit our final model.
final_res <- last_fit(wf_fh_final, reviews_split)
With our final model can we create a ROC curve of our final model.
final_res %>% collect_predictions() %>% roc_curve(rating, .pred_5) %>% autoplot()
This example is mostly a showcase of how to use pre-trained word embeddings. We are not getting as good performance as we did with feature hashing of simple term frequencies from the last post.
We can improve the performance in a couple of ways. The first way is to use bigger embeddings. We can increase the number of dimensions we are using. This embedding comes with as many as 300 dimensions. We can also use an embedding with a larger vocabulary. I used the smallest pre-trained glove embedding, but bigger ones are available too.
Another way to get improved performance is to use word vectors that more closely matched the target domain you are working in. Many pre-trained word vectors and trained on text for the general web, Wikipedia and, news articles. The more removed your text is from these kinds of text, the less likely it is that the embedding will be helpful.
I will showcase in a later post how you can train your own word embedding.
If you want to know more about word embeddings, Julia Silge and I have a chapter in our upcoming book Supervised Machine Learning for Text Analysis in R which goes more in dept on Word Embeddings.