Day 6: skimr

Welcome back for the 6th day of the #packagecalendar, When you get a new dataset it is important to do some exploratory data analysis. One of my first tools I reach at is the skimr package by rOpenSci.

the package is available from CRAN and can be downloaded with

install.packages("skimr")

To illustrate the package we will be using the festive dataset on Billboard Top 100 Christmas Carol Dataset.

billboards <- read.csv("~/Downloads/christmas_billboard_data.csv")

Traditionally I would have simply printed the data.frame

billboards
##                                                  url     weekid week_position
## 1 http://www.billboard.com/charts/hot-100/1958-12-13 12/13/1958            83
## 2 http://www.billboard.com/charts/hot-100/1958-12-20 12/20/1958            57
## 3 http://www.billboard.com/charts/hot-100/1958-12-20 12/20/1958            73
## 4 http://www.billboard.com/charts/hot-100/1958-12-20 12/20/1958            86
## 5 http://www.billboard.com/charts/hot-100/1958-12-27 12/27/1958            44
## 6 http://www.billboard.com/charts/hot-100/1958-12-27 12/27/1958            66
##               song    performer                      songid instance
## 1  RUN RUDOLPH RUN  Chuck Berry  Run Rudolph RunChuck Berry        1
## 2 JINGLE BELL ROCK  Bobby Helms Jingle Bell RockBobby Helms        1
## 3  RUN RUDOLPH RUN  Chuck Berry  Run Rudolph RunChuck Berry        1
## 4  WHITE CHRISTMAS  Bing Crosby  White ChristmasBing Crosby        1
## 5  GREEN CHRI$TMA$ Stan Freberg Green Chri$tma$Stan Freberg        1
## 6  WHITE CHRISTMAS  Bing Crosby  White ChristmasBing Crosby        1
##   previous_week_position peak_position weeks_on_chart year month day
## 1                     NA            69              3 1958    12  13
## 2                     NA            29             19 1958    12  20
## 3                     83            69              3 1958    12  20
## 4                     NA            12             13 1958    12  20
## 5                     NA            44              2 1958    12  27
## 6                     86            12             13 1958    12  27

used summary()

summary(billboards)
##                                                  url             weekid   
##  http://www.billboard.com/charts/hot-100/1960-12-17: 10   12/17/1960: 10  
##  http://www.billboard.com/charts/hot-100/1962-12-22:  9   12/22/1962:  9  
##  http://www.billboard.com/charts/hot-100/1960-12-24:  8   1/6/1962  :  8  
##  http://www.billboard.com/charts/hot-100/1961-12-23:  8   1/7/2017  :  8  
##  http://www.billboard.com/charts/hot-100/1961-12-30:  8   12/15/1962:  8  
##  http://www.billboard.com/charts/hot-100/1962-01-06:  8   12/23/1961:  8  
##  (Other)                                           :336   (Other)   :336  
##  week_position                                            song    
##  Min.   :  7.0   JINGLE BELL ROCK                           : 28  
##  1st Qu.: 38.5   ALL I WANT FOR CHRISTMAS IS YOU            : 20  
##  Median : 58.0   ROCKIN' AROUND THE CHRISTMAS TREE          : 19  
##  Mean   : 57.2   THE CHIPMUNK SONG (CHRISTMAS DON'T BE LATE): 16  
##  3rd Qu.: 78.0   WHITE CHRISTMAS                            : 16  
##  Max.   :100.0   MISTLETOE                                  : 14  
##                  (Other)                                    :274  
##                            performer  
##  Bobby Helms                    : 20  
##  Mariah Carey                   : 20  
##  Brenda Lee                     : 19  
##  Bing Crosby                    : 16  
##  David Seville And The Chipmunks: 16  
##  Goo Goo Dolls                  : 13  
##  (Other)                        :283  
##                                                                         songid   
##  Jingle Bell RockBobby Helms                                               : 20  
##  All I Want For Christmas Is YouMariah Carey                               : 19  
##  Rockin' Around The Christmas TreeBrenda Lee                               : 19  
##  The Chipmunk Song (Christmas Don't Be Late)David Seville And The Chipmunks: 16  
##  White ChristmasBing Crosby                                                : 14  
##  Better DaysGoo Goo Dolls                                                  : 13  
##  (Other)                                                                   :286  
##     instance     previous_week_position peak_position    weeks_on_chart  
##  Min.   :1.000   Min.   :  7.00         Min.   :  7.00   Min.   : 1.000  
##  1st Qu.:1.000   1st Qu.: 35.00         1st Qu.: 14.00   1st Qu.: 5.000  
##  Median :1.000   Median : 57.00         Median : 34.00   Median : 8.000  
##  Mean   :1.587   Mean   : 55.52         Mean   : 37.53   Mean   : 9.646  
##  3rd Qu.:1.000   3rd Qu.: 75.00         3rd Qu.: 53.50   3rd Qu.:15.000  
##  Max.   :6.000   Max.   :100.00         Max.   :100.00   Max.   :20.000  
##                  NA's   :108                                             
##       year          month             day       
##  Min.   :1958   Min.   : 1.000   Min.   : 1.00  
##  1st Qu.:1962   1st Qu.: 1.000   1st Qu.: 9.00  
##  Median :1975   Median :12.000   Median :17.00  
##  Mean   :1982   Mean   : 7.755   Mean   :16.67  
##  3rd Qu.:2005   3rd Qu.:12.000   3rd Qu.:24.00  
##  Max.   :2017   Max.   :12.000   Max.   :31.00  
## 

or used the structure function str()

str(billboards)
## 'data.frame':    387 obs. of  13 variables:
##  $ url                   : Factor w/ 206 levels "http://www.billboard.com/charts/hot-100/1958-12-13",..: 1 2 2 2 3 3 3 3 4 4 ...
##  $ weekid                : Factor w/ 206 levels "1/1/1994","1/1/2000",..: 111 137 137 137 167 167 167 167 54 54 ...
##  $ week_position         : int  83 57 73 86 44 66 69 35 45 53 ...
##  $ song                  : Factor w/ 70 levels "A GREAT BIG SLED",..: 45 31 45 69 21 69 45 31 31 21 ...
##  $ performer             : Factor w/ 69 levels "98 Degrees","Aly & AJ",..: 18 9 18 6 56 6 18 9 9 56 ...
##  $ songid                : Factor w/ 78 levels "A Great Big SledThe Killers Featuring Toni Halliday",..: 52 33 52 76 23 76 52 33 33 23 ...
##  $ instance              : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ previous_week_position: int  NA NA 83 NA NA 86 73 57 35 44 ...
##  $ peak_position         : int  69 29 69 12 44 12 69 29 29 44 ...
##  $ weeks_on_chart        : int  3 19 3 13 2 13 3 19 19 2 ...
##  $ year                  : int  1958 1958 1958 1958 1958 1958 1958 1958 1959 1959 ...
##  $ month                 : int  12 12 12 12 12 12 12 12 1 1 ...
##  $ day                   : int  13 20 20 20 27 27 27 27 3 3 ...

while they all give good information they all lack in one way or another, and I tend to go back and forth between the different functions. Introducing skimr, the main function skim() gives a nice overview of the data. Complete with top-level statistics, and column statistics sorted by column type.

library(skimr)
skim(billboards)
Table 1: Data summary
Name billboards
Number of rows 387
Number of columns 13
_______________________
Column type frequency:
factor 5
numeric 8
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
url 0 1 FALSE 206 htt: 10, htt: 9, htt: 8, htt: 8
weekid 0 1 FALSE 206 12/: 10, 12/: 9, 1/6: 8, 1/7: 8
song 0 1 FALSE 70 JIN: 28, ALL: 20, ROC: 19, THE: 16
performer 0 1 FALSE 69 Bob: 20, Mar: 20, Bre: 19, Bin: 16
songid 0 1 FALSE 78 Jin: 20, All: 19, Roc: 19, The: 16

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
week_position 0 1.00 57.20 25.40 7 38.5 58 78.0 100 ▅▆▇▇▇
instance 0 1.00 1.59 1.25 1 1.0 1 1.0 6 ▇▁▁▁▁
previous_week_position 108 0.72 55.52 25.15 7 35.0 57 75.0 100 ▅▆▇▇▆
peak_position 0 1.00 37.53 24.76 7 14.0 34 53.5 100 ▇▆▃▂▂
weeks_on_chart 0 1.00 9.65 6.14 1 5.0 8 15.0 20 ▆▇▃▃▆
year 0 1.00 1982.06 21.12 1958 1962.0 1975 2005.0 2017 ▇▃▁▂▅
month 0 1.00 7.75 5.33 1 1.0 12 12.0 12 ▅▁▁▁▇
day 0 1.00 16.67 8.90 1 9.0 17 24.0 31 ▇▇▆▇▇

Secondly the skim() function returns a skim object that allows manipulation by dplyr. Here it is simple to filter away all the variables without missing values.

library(dplyr)
skim(billboards) %>%
  filter(complete_rate < 1)
Table 2: Data summary
Name billboards
Number of rows 387
Number of columns 13
_______________________
Column type frequency:
numeric 1
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
previous_week_position 108 0.72 55.52 25.15 7 35 57 75 100 ▅▆▇▇▆

Hopefully, we will get to revisit this wonderfully merry dataset in the future.