Textrecipes series: TF-IDF
This is the third blog post in the textrecipes series where I go over the various text preprocessing workflows you can do with textrecipes. This post will be showcasing how to perform term frequency-inverse document frequency (Tf-IDF for short).
The packages used in the post shouldn’t come to any surprise if you have been following the series. tidymodels for modeling, tidyverse for EDA, textrecipes for text preprocessing, vip for visualizing variable importance, and doParallel to parallelize the hyperparameter tuning.
library(tidymodels) library(tidyverse) library(textrecipes) library(vip) library(doParallel) theme_set(theme_minimal())
Exploring the data ⛏
David goes into a lot of detail explaining what he is doing and I highly recommend to watch this one if you are interested in using text in regression.
Fortunately he didn’t use tidymodels so this post will bring a little something new.
Our goal for this post is to build a model that predicts the score (denotes
points) a particular wine has.
wine_ratings <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-05-28/winemag-data-130k-v2.csv") ## Warning: Missing column names filled in: 'X1' 
We load in the data with
read_csv() and immediately use
glimpse() to get an idea of the data we have to work with
glimpse(wine_ratings) ## Rows: 129,971 ## Columns: 14 ## $ X1 <dbl> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 1… ## $ country <chr> "Italy", "Portugal", "US", "US", "US", "Spain",… ## $ description <chr> "Aromas include tropical fruit, broom, brimston… ## $ designation <chr> "Vulkà Bianco", "Avidagos", NA, "Reserve Late H… ## $ points <dbl> 87, 87, 87, 87, 87, 87, 87, 87, 87, 87, 87, 87,… ## $ price <dbl> NA, 15, 14, 13, 65, 15, 16, 24, 12, 27, 19, 30,… ## $ province <chr> "Sicily & Sardinia", "Douro", "Oregon", "Michig… ## $ region_1 <chr> "Etna", NA, "Willamette Valley", "Lake Michigan… ## $ region_2 <chr> NA, NA, "Willamette Valley", NA, "Willamette Va… ## $ taster_name <chr> "Kerin O’Keefe", "Roger Voss", "Paul Gregutt", … ## $ taster_twitter_handle <chr> "@kerinokeefe", "@vossroger", "@paulgwine ", NA… ## $ title <chr> "Nicosia 2013 Vulkà Bianco (Etna)", "Quinta do… ## $ variety <chr> "White Blend", "Portuguese Red", "Pinot Gris", … ## $ winery <chr> "Nicosia", "Quinta dos Avidagos", "Rainstorm", …
This dataset barely contains any numeric variables. The only numeric is the price. As with many prices in data, it is a good idea to log transform them since they are highly skewed
wine_ratings %>% mutate(price_log = log(price)) %>% pivot_longer(c(price, price_log)) %>% ggplot(aes(value)) + geom_histogram(bins = 50) + facet_wrap(~name, scales = "free_x") ## Warning: Removed 17992 rows containing non-finite values (stat_bin).
Since most of the data most likely will be factors let us take a look at the cardinality of each variable
map_int(wine_ratings, n_distinct) ## X1 country description ## 129971 44 119955 ## designation points price ## 37980 21 391 ## province region_1 region_2 ## 426 1230 18 ## taster_name taster_twitter_handle title ## 20 16 118840 ## variety winery ## 708 16757
But wait! the number of unique descriptions is not the same as the number of rows. This seems very odd since they will be multiple sentences long and the likelihood of two different people writing the exact same description is very low.
let us take a selection of duplicated descriptions and see if anything stands out.
wine_ratings %>% filter(duplicated(description)) %>% slice(1:3) %>% pull(description) ##  "This is weighty, creamy and medium to full in body. It has plenty of lime and pear flavors, plus slight brown sugar and vanilla notes." ##  "There's a touch of toasted almond at the start, but then this Grillo revs up in the glass to deliver notes of citrus, stone fruit, crushed stone and lemon tart. The mouthfeel is crisp and simple." ##  "Lightly herbal strawberry and raspberry aromas are authentic and fresh. On the palate, this is light and juicy, with snappy, lean flavors of red fruit and dry spice. The finish is dry and oaky."
as we feared these are pretty specific and would be unlikely to be duplications at random. We will assume that this problem is a scraping error and remove the duplicate entries. Additionally, some of the points values are missing, since this is our target variable will I remove those data points as well.
If you are working on a real project you shouldn’t simply delete observations like that. Both of these errors smell a little bit like bad scraping so your first course of action should be testing your data pipeline for errors.
I found out about the issue with duplicate descriptions when I was browsing through other people’s analyses of the dataset.
Before we do any more analysis, let us remove the troublesome observations.
wine_ratings <- wine_ratings %>% filter(!duplicated(description), !is.na(price)) wine_ratings ## # A tibble: 111,567 x 14 ## X1 country description designation points price province region_1 region_2 ## <dbl> <chr> <chr> <chr> <dbl> <dbl> <chr> <chr> <chr> ## 1 1 Portug… This is ri… Avidagos 87 15 Douro <NA> <NA> ## 2 2 US Tart and s… <NA> 87 14 Oregon Willame… Willame… ## 3 3 US Pineapple … Reserve La… 87 13 Michigan Lake Mi… <NA> ## 4 4 US Much like … Vintner's … 87 65 Oregon Willame… Willame… ## 5 5 Spain Blackberry… Ars In Vit… 87 15 Norther… Navarra <NA> ## 6 6 Italy Here's a b… Belsito 87 16 Sicily … Vittoria <NA> ## 7 7 France This dry a… <NA> 87 24 Alsace Alsace <NA> ## 8 8 Germany Savory dri… Shine 87 12 Rheinhe… <NA> <NA> ## 9 9 France This has g… Les Natures 87 27 Alsace Alsace <NA> ## 10 10 US Soft, supp… Mountain C… 87 19 Califor… Napa Va… Napa ## # … with 111,557 more rows, and 5 more variables: taster_name <chr>, ## # taster_twitter_handle <chr>, title <chr>, variety <chr>, winery <chr>
Countries look like it would be important to include, doing a little bar chart reveals a high imbalance in where the wines are coming from. We will definitely need to weed out some of the low count countries
wine_ratings %>% count(country, sort = TRUE) %>% ggplot(aes(n, fct_reorder(country, n), label = country)) + geom_col() + geom_text(hjust = 0, nudge_x = 1000) ## Warning: Removed 1 rows containing missing values (geom_text).
This dataset is restricted to review of wine that scored 80 points or more,
wine_ratings %>% ggplot(aes(points)) + geom_bar()
It looks like the 80 wasn’t as hard cutoff, and the points even look bell-shaped.
I’ll be using
variety as well in the final analysis.
We start by doing a simple training test split of the data using the yardstick package.
set.seed(1234) wine_split <- initial_split(wine_ratings) wine_training <- training(wine_split) wine_testing <- training(wine_split)
Next will we use recipes and textrecipes to specify the preprocessing of the data.
step_log() to take the logarithm of
step_uknowm() to turn missing values in factors into levels with name “unknown”
step_other() to lump together factor levels that don’t take up more the 1% of the counts.
step_dummy() to dummify the factor variables
step_tokenize() to turn the descriptions into tokens
step_stopwords() to remove stop words from the tokens (ALWAYS manually verify your stop word list)
step_tokenfilter() to limit the number of tokens we will use when calculating tf-idf. We will only keep tokens if they appear more then 100 times and of those will be at most take the 2000 most frequent tokens.
step_tfidf() to calculate the term frequency-inverse document frequency of the tokens.
step_normalize() to normalize all the predictors to have a standard deviation of one and a mean of zero. We need to do this because it’s important for lasso regularization.
rec_spec <- recipe(points ~ description + price + country + variety + taster_name, data = wine_training) %>% step_log(price) %>% step_unknown(country, variety, taster_name) %>% step_other(country, variety, threshold = 0.01) %>% step_dummy(country, variety, taster_name) %>% step_tokenize(description) %>% step_stopwords(description) %>% step_tokenfilter(description, min_times = 100, max_tokens = 2000) %>% step_tfidf(description) %>% step_normalize(all_predictors())
We will use lasso regression and we will use the “glmnet” engine.
lasso_spec <- linear_reg(penalty = tune(), mixture = 1) %>% set_engine("glmnet")
I have specified
penalty = tune() because I want to use tune to find the best value of the penalty by doing hyperparameter tuning.
We set up a parameter grid using
param_grid <- grid_regular(penalty(), levels = 50)
searching over 50 levels might seem like a lot. But remember that glmnet is able to calculate them all at once. This is because it relies on its warms starts for speed and it is often faster to fit a whole path than compute a single fit.
We will also set up some bootstraps of the data so we can evaluate the performance multiple times for each level.
wine_boot <- bootstraps(wine_training, times = 10)
the last thing we need to use is to create a workflow object to combine the preprocessing step with the model. This is important because we want the preprocessing steps to happen in the bootstraps.
lasso_wf <- workflow() %>% add_recipe(rec_spec) %>% add_model(lasso_spec)
now we are ready to perform the parameter tuning. We will be using
doParallel to speed up the calculations by using multiple cores.
doParallel::registerDoParallel() set.seed(42) lasso_grid <- tune_grid( lasso_wf, resamples = wine_boot, grid = param_grid ) lasso_grid ## # Bootstrap sampling ## # A tibble: 10 x 4 ## splits id .metrics .notes ## <list> <chr> <list> <list> ## 1 <split [83.7K/30.8K]> Bootstrap01 <tibble [100 × 4]> <tibble [1 × 1]> ## 2 <split [83.7K/30.8K]> Bootstrap02 <tibble [100 × 4]> <tibble [1 × 1]> ## 3 <split [83.7K/30.8K]> Bootstrap03 <tibble [100 × 4]> <tibble [1 × 1]> ## 4 <split [83.7K/30.6K]> Bootstrap04 <tibble [100 × 4]> <tibble [1 × 1]> ## 5 <split [83.7K/30.8K]> Bootstrap05 <tibble [100 × 4]> <tibble [1 × 1]> ## 6 <split [83.7K/30.7K]> Bootstrap06 <tibble [100 × 4]> <tibble [1 × 1]> ## 7 <split [83.7K/30.8K]> Bootstrap07 <tibble [100 × 4]> <tibble [1 × 1]> ## 8 <split [83.7K/30.8K]> Bootstrap08 <tibble [100 × 4]> <tibble [1 × 1]> ## 9 <split [83.7K/30.9K]> Bootstrap09 <tibble [100 × 4]> <tibble [1 × 1]> ## 10 <split [83.7K/30.8K]> Bootstrap10 <tibble [100 × 4]> <tibble [1 × 1]>
Now that the grid search has finished we can look at the best performing models with
show_best(lasso_grid, metric = "rmse") ## # A tibble: 5 x 6 ## penalty .metric .estimator mean n std_err ## <dbl> <chr> <chr> <dbl> <int> <dbl> ## 1 0.00356 rmse standard 1.64 10 0.00222 ## 2 0.00222 rmse standard 1.64 10 0.00220 ## 3 0.00139 rmse standard 1.64 10 0.00219 ## 4 0.00569 rmse standard 1.64 10 0.00227 ## 5 0.000869 rmse standard 1.64 10 0.00218
We are quite satisfied with these results!
select_best() to extract the best performing one
best_penalty <- select_best(lasso_grid, metric = "rmse")
and we will use that value of penalty in our final workflow object
final_wf <- finalize_workflow( lasso_wf, best_penalty )
Now all there is to do is to fit the workflow on the real training dataset.
final_lasso <- final_wf %>% fit(data = wine_training)
And then, finally, let’s return to our test data. The tune package has a function last_fit() which is nice for situations when you have tuned and finalized a model or workflow and want to fit it one last time on your training data and evaluate it on your testing data. You only have to pass this function your finalized model/workflow and your split.
Finally can we return to our testing dataset. We can use the
last_fit() function to apply our finalized workflow to the testing dataset and see what performance we are getting.
last_fit(final_lasso, wine_split) %>% collect_metrics() ## ! Resample1: model (predictions): There are new levels in a factor: NA ## # A tibble: 2 x 3 ## .metric .estimator .estimate ## <chr> <chr> <dbl> ## 1 rmse standard 1.65 ## 2 rsq standard 0.721
Look at best and worst-performing reviews
A good way of looking at how well your model is performing is to look at the observations it got right and which it got wrong.
wine_eval <- wine_training %>% bind_cols( predict(final_lasso, new_data = wine_training) )
First let us plot the observed vs the predicted values, we have added a little bit of horizontal noise to prevent overplotting too much. A red line what been added at the middle
wine_eval %>% ggplot(aes(points, .pred)) + geom_jitter(height = 0, width = 0.2, alpha = 0.1) + geom_abline(color = "firebrick", slope = 1, intercept = 0)
The main takeaway from this chart is that the model is overestimating the points given to low reviews and underestimates the points given to high reviews.
Below is the the the reviews that the model underestimates for the lowest-rated reviews
remember that the model didn’t only use text input to determine its predictions.
wine_eval %>% filter(points == 83) %>% arrange(.pred) %>% slice(1:5) %>% pull(description) ##  "Thin, awkward and excessively herbal, this sour wine seems barely ripe." ##  "Soft, sugary sweet and simple, with cherry jam and chocolate flavors." ##  "A bit harsh and bitter, with some unripe, green flavors alongside blackberries. Finishes dry and astringent." ##  "A very simple Chardonnay with sour apple flavor." ##  "Weedy and vegetal despite some decent raspberry, cherry and spice flavors. Okay for a big party where nobody's fussy."
All of these reviews are fairly short, and might not give as much weight as the other variables.
overestimated bad reviews
wine_eval %>% filter(points == 83) %>% arrange(desc(.pred)) %>% slice(1:5) %>% pull(description) ##  "Cherry and sautéed morels provide an intriguing introduction to this deeply hued wine, but there are also hints of something funkier, like wet newspaper. On the palate, the pleasing black cherry returns, as does anise, some bitter elements and chewy tannins." ##  "With refreshingly tart blackberry fruit, this Zweigelt has some concentration and displays a good balance between light body, refreshing acidity and clean, lively fruit notes. A great party wine." ##  "Pipe tobacco, raspberry reduction and a sweet sagebrush character promise an interesting experience. Once sipped, the overwhelmingly sweet cherry, cough-syrup flavor pushes this wine beyond a dry, and the typical Cabernet Franc herbs that emerge in the midpalate don't overwhelm its cloying nature. Good for dessert, by itself." ##  "Lifted floral and stone fruit aromatics add intensity to peachy, grape flavors on the palate of this wine. It's rich and luscious, with a bright burst of tangerine on the finish." ##  "A confected wine that shows polished wood and mint flavors but little fruit. It puts on a great show, but without depth."
underestimated good reviews
wine_eval %>% filter(points == 95) %>% arrange(.pred) %>% slice(1:5) %>% pull(description) ##  "Dark and racy, with a Northern Rhône-style aroma of black pepper, cedar and black currants, but there's nothing Rhône-like about the flavor. It's distinctly California, with soft, sweetly ripe tannins bursting with blackberry jam, black currants, anise, vanilla and sweet oak flavors. The wine is fully dry. Absolutely first-rate Syrah." ##  "Rued Ranch consistently is the winery's top bottling. The 2011, like its predecessors, is modest in alcohol but vast in citrus, tropical fruit, honeysuckle and mineral flavors, and enriched with smoky oak and buttery notes. It has a delicate mouthfeel that makes it a joy to savor." ##  "Apple and passion fruit are the aromatic top notes, hovering above a nose of honey and musk. Notes of honey, caramel and maple syrup seem to unite on the palate, guided and framed by the sharpest, most luminous acidity. Extremes are at work here, revealing both utter sweetness and utter acidity. Dazzling all of the senses, this wine leaves a rich aftertaste." ##  "This sweet, luscious, dessert-style wine keeps enough acidity to contrast and compensate for the sugar. There's nothing flabby here, but how sweet it is! Caramel apples, molasses, maple syrup and honey coalesce around concentrated fruits, ranging from citrus through tropical. Orange and peach, pineapple and papaya are all to be found. A half bottle is rich enough to satisfy four tasters." ##  "Hailing from the town of Suvereto in the southern Maremma, this sophomore Cabernet-Merlot blend from the same owner as Bellavista and Contadi Castaldi in Franciacorta is flat-out awesome. One whiff of the bouquet says it all: earth, currant, blackberry and coffee. The palate is equally sensational—a magic carpet ride of plum fruit, pure oak and solid but forgiving tannins. Drink and enjoy any time through 2006."
overrated good reviews
wine_eval %>% filter(points == 95) %>% arrange(desc(.pred)) %>% slice(1:5) %>% pull(description) ##  "This is a deeply rich, very impressive wine from a great white-wine vintage. It's concentrated with the essence of pure Chardonnay. It has tension along with a full texture. Age it for at least 10 years." ##  "Produced from vines mainly planted in the 1970s, this is a concentrated dark wine. Rich and ripe, it has powerful tannins mingled with beautifully perfumed fruits. The wine is dense and packed with floral fruitiness as well as fine, sweet tannins. Drink from 2025." ##  "Powerful and complex, this dark wine comes from vines at the heart of appellation, rich and dense. It offers the potential of great fruit, rich acidity and considerable structure, and will develop giving richness and long aging. Drink from 2023." ##  "The grandest of all white Burgundy vineyards, Montrachet has produced a powerful, great wine, with impressive ripe, dense fruit. There is still elegance, a sense of proportion, but the opulence of the Chardonnay here is impossible to resist." ##  "A solidly structured wine, it is both elegant and intensely powerful. It is disclosing its fruitiness slowly, at the moment dense and concentrated. A restrained edge provides complexity and balance. It will need aging, so don't drink before 2022."