Custom Profiler in R

This blogpost is going to describe how to write a customizable profiling function. If you are not familiar with profiling read the Profiling section of Advanced R to familiarize yourself, I’ll wait.

Welcome back!

Packages

While these packages aren’t strictly needed since most of what we are doing is happening in base R, am I still loading in tidyverse to do some easier string manipulations and plotting.

library(tidyverse)

Profiling basics

You have properly used the profvis package. It is an amazing package and I use it on a daily basis. However, the amount of information you get can be overwhelming at times depending on your profiling goals.

Lets propose in this scenario that we take in some data, scale and center it, apply PCA while only keeping the components that explain 90% of the variance and lastly apply CLARA clustering and return the classification.

The code to do that is contained in the following chunk.

pca_threshold <- function(x, threshold) {
  data_pca <- prcomp(x, scale. = TRUE)
  total_var <- sum(data_pca$sdev ^ 2)
  num_comp <- which.max(cumsum(data_pca$sdev ^ 2 / total_var) >= threshold)
  data_pca$x[, seq_len(num_comp)]
}

pca_kmeans <- function(x, threshold = 0.9, centers = 2) {
  data_matrix <- as.matrix(x)
  data_pca <- pca_threshold(data_matrix, threshold = threshold)
  data_kmeans <- cluster::clara(data_pca, k = centers)
  data_kmeans$cluster
}

Now we create some data, and run profvis on it

large_data <- diamonds %>%
  select_if(is.numeric) %>%
  sample_n(100000, replace = TRUE)
profvis::profvis({
  pca_kmeans(large_data)
})

And we get the following information back.

The Problem

It is very infomrative, but it is also giving a LOT of information. Lets propose we want to know the percentage of the computation time is used to do the PCA calculations. In the profvis framework you would need to do the calculation manually. All while waiting for the html widget to load.

The Idea

profvis uses the Rprof function internally to inspect what is happening. By using Rprof directly we can extract the profile and calculate out out output/metrix.

The base profiling steps are

tmp <- tempfile()
Rprof(tmp)
##################
# Code goes here #
##################
Rprof(NULL)
profile <- readLines(tmp)

This chunk will set up a temporary file, start the profiler and set it to write to the temporary file, stop the profiler and read the result from the profiler.

Trying it with our code we get

tmp <- tempfile()
Rprof(tmp)
x <- pca_kmeans(large_data)
Rprof(NULL)
profile <- readLines(tmp)

head(profile)
## [1] "sample.interval=20000"                                                                                                            
## [2] "\"aperm.default\" \"aperm\" \"apply\" \"scale.default\" \"scale\" \"prcomp.default\" \"prcomp\" \"pca_threshold\" \"pca_kmeans\" "
## [3] "\"is.na\" \"FUN\" \"apply\" \"scale.default\" \"scale\" \"prcomp.default\" \"prcomp\" \"pca_threshold\" \"pca_kmeans\" "          
## [4] "\"is.na\" \"FUN\" \"apply\" \"scale.default\" \"scale\" \"prcomp.default\" \"prcomp\" \"pca_threshold\" \"pca_kmeans\" "          
## [5] "\"is.na\" \"FUN\" \"apply\" \"scale.default\" \"scale\" \"prcomp.default\" \"prcomp\" \"pca_threshold\" \"pca_kmeans\" "          
## [6] "\"is.na\" \"FUN\" \"apply\" \"scale.default\" \"scale\" \"prcomp.default\" \"prcomp\" \"pca_threshold\" \"pca_kmeans\" "

Lets see what these lines mean. first we notice that the first line is just denoting the sample interval, so we can ignore that for now. Lets look at the next line

## [1] "\"aperm.default\" \"aperm\" \"apply\" \"scale.default\" \"scale\" \"prcomp.default\" \"prcomp\" \"pca_threshold\" \"pca_kmeans\" "

This is a snapshot of the “call-stack”, and it reads inside-out. So we have that aperm.default is called inside aperm which is called inside apply which is called inside scale.default and so on and so forth all the way up to pca_kmeans.

Now that we know how Rprof works, we can write some code that checks whether “pca_threshold” appear in the call-stack and then find the percentage.

The Solution

We can now create a function that will calculate the percentage of the time is being spend in a certain function.

prof_procentage <- function(expr, pattern) {
  tmp <- tempfile()
  Rprof(tmp)
  expr
  Rprof(NULL)
  profile <- readLines(tmp)[-1]
  
  mean(grepl(pattern, profile))
}

This function can now easily be used on our calculation.

prof_procentage(
  x <- pca_kmeans(large_data),
  pattern = "pca_threshold"
)
## [1] 0.6666667

And this is how to create a custom profiler. Simple modify the last line in the skeleton function prof_procentage to change its behavior.

the Extensions

The sky’s the limit! you are only limited by your regex abilities. You can also change the output. In the last example I returned a numeric of the percentage, But we can also have the output be a plot

prof_procentage_plot <- function(expr, pattern) {
  tmp <- tempfile()
  Rprof(tmp)
  expr
  Rprof(NULL)
  profile <- readLines(tmp)[-1]
  
  data.frame(x = grepl(pattern, profile)) %>%
    ggplot(aes(x)) +
    geom_bar()
}

prof_procentage_plot(
  x <- pca_kmeans(large_data),
  pattern = "pca_threshold"
)

The follow-up

After my initial announcement of this post I got a helpful tweet from Hadley Wickham about the profvis::parse_rprof(). In essence it will help you parse the file you write with Rprof to help you get to your answer faster and safer.

So some output like

## [1] "sample.interval=20000"                                                                                                            
## [2] "\"aperm.default\" \"aperm\" \"apply\" \"scale.default\" \"scale\" \"prcomp.default\" \"prcomp\" \"pca_threshold\" \"pca_kmeans\" "
## [3] "\"is.na\" \"FUN\" \"apply\" \"scale.default\" \"scale\" \"prcomp.default\" \"prcomp\" \"pca_threshold\" \"pca_kmeans\" "          
## [4] "\"is.na\" \"FUN\" \"apply\" \"scale.default\" \"scale\" \"prcomp.default\" \"prcomp\" \"pca_threshold\" \"pca_kmeans\" "          
## [5] "\"is.na\" \"FUN\" \"apply\" \"scale.default\" \"scale\" \"prcomp.default\" \"prcomp\" \"pca_threshold\" \"pca_kmeans\" "          
## [6] "\"is.na\" \"FUN\" \"apply\" \"scale.default\" \"scale\" \"prcomp.default\" \"prcomp\" \"pca_threshold\" \"pca_kmeans\" "

Will be transformed to a nice data.frame with profvis::parse_rprof()

##    time depth          label filenum linenum memalloc meminc filename
## 1     1     9  aperm.default      NA      NA        0      0     <NA>
## 2     1     8          aperm      NA      NA        0      0     <NA>
## 3     1     7          apply      NA      NA        0      0     <NA>
## 4     1     6  scale.default      NA      NA        0      0     <NA>
## 5     1     5          scale      NA      NA        0      0     <NA>
## 6     1     4 prcomp.default      NA      NA        0      0     <NA>
## 7     1     3         prcomp      NA      NA        0      0     <NA>
## 8     1     2  pca_threshold      NA      NA        0      0     <NA>
## 9     1     1     pca_kmeans      NA      NA        0      0     <NA>
## 10    2     9          is.na      NA      NA        0      0     <NA>
## 11    2     8            FUN      NA      NA        0      0     <NA>
## 12    2     7          apply      NA      NA        0      0     <NA>
## 13    2     6  scale.default      NA      NA        0      0     <NA>
## 14    2     5          scale      NA      NA        0      0     <NA>
## 15    2     4 prcomp.default      NA      NA        0      0     <NA>
## 16    2     3         prcomp      NA      NA        0      0     <NA>
## 17    2     2  pca_threshold      NA      NA        0      0     <NA>
## 18    2     1     pca_kmeans      NA      NA        0      0     <NA>
## 19    3     9          is.na      NA      NA        0      0     <NA>
## 20    3     8            FUN      NA      NA        0      0     <NA>
## 21    3     7          apply      NA      NA        0      0     <NA>
## 22    3     6  scale.default      NA      NA        0      0     <NA>
## 23    3     5          scale      NA      NA        0      0     <NA>
## 24    3     4 prcomp.default      NA      NA        0      0     <NA>
## 25    3     3         prcomp      NA      NA        0      0     <NA>
## 26    3     2  pca_threshold      NA      NA        0      0     <NA>
## 27    3     1     pca_kmeans      NA      NA        0      0     <NA>
## 28    4     9          is.na      NA      NA        0      0     <NA>
## 29    4     8            FUN      NA      NA        0      0     <NA>
## 30    4     7          apply      NA      NA        0      0     <NA>
## 31    4     6  scale.default      NA      NA        0      0     <NA>
## 32    4     5          scale      NA      NA        0      0     <NA>
## 33    4     4 prcomp.default      NA      NA        0      0     <NA>
## 34    4     3         prcomp      NA      NA        0      0     <NA>
## 35    4     2  pca_threshold      NA      NA        0      0     <NA>
## 36    4     1     pca_kmeans      NA      NA        0      0     <NA>
## 37    5     9          is.na      NA      NA        0      0     <NA>
## 38    5     8            FUN      NA      NA        0      0     <NA>
## 39    5     7          apply      NA      NA        0      0     <NA>
## 40    5     6  scale.default      NA      NA        0      0     <NA>
## 41    5     5          scale      NA      NA        0      0     <NA>
## 42    5     4 prcomp.default      NA      NA        0      0     <NA>
## 43    5     3         prcomp      NA      NA        0      0     <NA>
## 44    5     2  pca_threshold      NA      NA        0      0     <NA>
## 45    5     1     pca_kmeans      NA      NA        0      0     <NA>
comments powered by Disqus