A data.table and dplyr tour


data.table_logo dplyr_logo


Introduction

About

  • This document has been been inspired by this stackoverflow question and by the data.table cheat sheet published by Karlijn Willems. It has been written for my own self-teaching and may contain errors or imprecisions. Corrections and suggestions are welcome.
  • Resources for data.table can be found on the data.table wiki, in the data.table vignettes, and in the package documentation.
  • Reference documents for dplyr include the dplyr cheat sheet, the dplyr vignettes, and the package documentation.
  • For the sake of readability, the console outputs are hidden by default. Click on the button below to show or hide the outputs. Also, the R code used in this document is independently available and can be easily reproduced.


data.table and dplyr

data.table and dplyr are two R packages that both aim at an easier and more efficient manipulation of data frames. But while they share a lot of functionalities, their philosophies are quite different. Below is a quick overview of the main differences (from my basic user’s perspective).

Syntax:
- The general data.table syntax is as follows: DT[i, j, by, ...] which means: “Take DT, subset rows using i, then calculate j, grouped by by” with possible extra options .... It allows to combine several operations in a very concise and consistent expression.
- The syntax of dplyr is based on five main functions (filter(), arrange(), select(), mutate(), summarise()) and group_by() + their scoped variants (suffixed with _all, _at, or _if) + a bunch of helper functions. It is a ‘do one thing at a time’ approach, chaining together functions dedicated to a specific task.

Ecosystem:
The data.table package has no dependency whereas dplyr is part of the tidyverse. So, for example, while data.table includes functions to read, write, or reshape data, dplyr delegates these tasks to companion packages like readr or tidyr.
On the other hand, data.table is focused on the processing of local in-memory data, but dplyr offers a database backend.

Memory management and performance:
In data.table, objects can be manipulated ‘by reference’ (using the set*() functions or with the := symbol). It means that the data will be modified but not copied, minimizing the RAM requirements. The behaviour of dplyr is similar to the one of base R.
Memory management, parallelism, and shrewd optimization give data.table the advantage in terms of performance.



Create example data

library(data.table)
set.seed(1L)

## Create a data table
DT = data.table(V1 = c(1L,2L),  # recycling
                V2 = 1:9,
                V3 = round(rnorm(3),2),
                V4 = LETTERS[1:3])

class(DT)
DT
## [1] "data.table" "data.frame"
##    V1 V2    V3 V4
## 1:  1  1 -0.63  A
## 2:  2  2  0.18  B
## 3:  1  3 -0.84  C
## 4:  2  4 -0.63  A
## 5:  1  5  0.18  B
## 6:  2  6 -0.84  C
## 7:  1  7 -0.63  A
## 8:  2  8  0.18  B
## 9:  1  9 -0.84  C
library(dplyr)
set.seed(1L)

## Create a data frame (tibble)
DF = tibble(V1 = rep(c(1L,2L), 5)[-10],
            V2 = 1:9,
            V3 = rep(round(rnorm(3),2), 3),
            V4 = rep(LETTERS[1:3], 3))

class(DF)
DF
## [1] "tbl_df"     "tbl"        "data.frame"
## # A tibble: 9 x 4
##      V1    V2    V3 V4   
##   <int> <int> <dbl> <chr>
## 1     1     1 -0.63 A    
## 2     2     2  0.18 B    
## 3     1     3 -0.84 C    
## 4     2     4 -0.63 A    
## 5     1     5  0.18 B    
## 6     2     6 -0.84 C    
## 7     1     7 -0.63 A    
## 8     2     8  0.18 B    
## 9     1     9 -0.84 C

Below, the data.table code uses DT and the dplyr code uses DF. Also, the dplyr code uses the %>% operator: a basic knowledge of the magrittr syntax is assumed.




Basic operations

Filter rows

Filter rows using indices

DT[3:4,]
DT[3:4] # same
##    V1 V2    V3 V4
## 1:  1  3 -0.84  C
## 2:  2  4 -0.63  A
##    V1 V2    V3 V4
## 1:  1  3 -0.84  C
## 2:  2  4 -0.63  A
DF[3:4,]
DF %>% slice(3:4) # same
## # A tibble: 2 x 4
##      V1    V2    V3 V4   
##   <int> <int> <dbl> <chr>
## 1     1     3 -0.84 C    
## 2     2     4 -0.63 A    
## # A tibble: 2 x 4
##      V1    V2    V3 V4   
##   <int> <int> <dbl> <chr>
## 1     1     3 -0.84 C    
## 2     2     4 -0.63 A

Discard rows using negative indices

DT[!3:7,]
DT[-(3:7)] # same
##    V1 V2    V3 V4
## 1:  1  1 -0.63  A
## 2:  2  2  0.18  B
## 3:  2  8  0.18  B
## 4:  1  9 -0.84  C
##    V1 V2    V3 V4
## 1:  1  1 -0.63  A
## 2:  2  2  0.18  B
## 3:  2  8  0.18  B
## 4:  1  9 -0.84  C
DF[-(3:7),]
DF %>% slice(-(3:7)) # same
## # A tibble: 4 x 4
##      V1    V2    V3 V4   
##   <int> <int> <dbl> <chr>
## 1     1     1 -0.63 A    
## 2     2     2  0.18 B    
## 3     2     8  0.18 B    
## 4     1     9 -0.84 C    
## # A tibble: 4 x 4
##      V1    V2    V3 V4   
##   <int> <int> <dbl> <chr>
## 1     1     1 -0.63 A    
## 2     2     2  0.18 B    
## 3     2     8  0.18 B    
## 4     1     9 -0.84 C

Filter rows using a logical expression

DT[V2 > 5]
DT[V4 %chin% c("A","C")] # fast %in% for character
##    V1 V2    V3 V4
## 1:  2  6 -0.84  C
## 2:  1  7 -0.63  A
## 3:  2  8  0.18  B
## 4:  1  9 -0.84  C
##    V1 V2    V3 V4
## 1:  1  1 -0.63  A
## 2:  1  3 -0.84  C
## 3:  2  4 -0.63  A
## 4:  2  6 -0.84  C
## 5:  1  7 -0.63  A
## 6:  1  9 -0.84  C
DF %>% filter(V2 > 5)
DF %>% filter(V4 %in% c("A","C"))
## # A tibble: 4 x 4
##      V1    V2    V3 V4   
##   <int> <int> <dbl> <chr>
## 1     2     6 -0.84 C    
## 2     1     7 -0.63 A    
## 3     2     8  0.18 B    
## 4     1     9 -0.84 C    
## # A tibble: 6 x 4
##      V1    V2    V3 V4   
##   <int> <int> <dbl> <chr>
## 1     1     1 -0.63 A    
## 2     1     3 -0.84 C    
## 3     2     4 -0.63 A    
## 4     2     6 -0.84 C    
## 5     1     7 -0.63 A    
## 6     1     9 -0.84 C

Filter rows using multiple conditions

DT[V1 == 1 & V4 == "A"]
# any logical criteria can be used
##    V1 V2    V3 V4
## 1:  1  1 -0.63  A
## 2:  1  7 -0.63  A
DF %>% filter(V1 == 1, V4 == "A")
# any logical criteria can be used
## # A tibble: 2 x 4
##      V1    V2    V3 V4   
##   <int> <int> <dbl> <chr>
## 1     1     1 -0.63 A    
## 2     1     7 -0.63 A

Filter unique rows

unique(DT)
unique(DT, by = c("V1","V4")) # returns all cols
##    V1 V2    V3 V4
## 1:  1  1 -0.63  A
## 2:  2  2  0.18  B
## 3:  1  3 -0.84  C
## 4:  2  4 -0.63  A
## 5:  1  5  0.18  B
## 6:  2  6 -0.84  C
## 7:  1  7 -0.63  A
## 8:  2  8  0.18  B
## 9:  1  9 -0.84  C
##    V1 V2    V3 V4
## 1:  1  1 -0.63  A
## 2:  2  2  0.18  B
## 3:  1  3 -0.84  C
## 4:  2  4 -0.63  A
## 5:  1  5  0.18  B
## 6:  2  6 -0.84  C
distinct(DF) # distinct_all(DF)
distinct_at(DF, vars(V1, V4)) # returns selected cols
# see also ?distinct_if
## # A tibble: 9 x 4
##      V1    V2    V3 V4   
##   <int> <int> <dbl> <chr>
## 1     1     1 -0.63 A    
## 2     2     2  0.18 B    
## 3     1     3 -0.84 C    
## 4     2     4 -0.63 A    
## 5     1     5  0.18 B    
## 6     2     6 -0.84 C    
## 7     1     7 -0.63 A    
## 8     2     8  0.18 B    
## 9     1     9 -0.84 C    
## # A tibble: 6 x 2
##      V1 V4   
##   <int> <chr>
## 1     1 A    
## 2     2 B    
## 3     1 C    
## 4     2 A    
## 5     1 B    
## 6     2 C

Discard rows with missing values

na.omit(DT, cols = 1:4)  # fast S3 method with cols argument
##    V1 V2    V3 V4
## 1:  1  1 -0.63  A
## 2:  2  2  0.18  B
## 3:  1  3 -0.84  C
## 4:  2  4 -0.63  A
## 5:  1  5  0.18  B
## 6:  2  6 -0.84  C
## 7:  1  7 -0.63  A
## 8:  2  8  0.18  B
## 9:  1  9 -0.84  C
tidyr::drop_na(DF, names(DF))
## # A tibble: 9 x 4
##      V1    V2    V3 V4   
##   <int> <int> <dbl> <chr>
## 1     1     1 -0.63 A    
## 2     2     2  0.18 B    
## 3     1     3 -0.84 C    
## 4     2     4 -0.63 A    
## 5     1     5  0.18 B    
## 6     2     6 -0.84 C    
## 7     1     7 -0.63 A    
## 8     2     8  0.18 B    
## 9     1     9 -0.84 C

Other filters

In addition to the main filter() function, dplyr also offers the filter_all/at/if variants as well as three helper functions to filter rows. With data.table, we can simply use a custom expression in DT[i].
DT[sample(.N, 3)] # .N = nb of rows in DT
DT[sample(.N, .N/2)]
DT[frankv(-V1, ties.method = "dense") < 2]
##    V1 V2    V3 V4
## 1:  1  9 -0.84  C
## 2:  2  6 -0.84  C
## 3:  1  5  0.18  B
##    V1 V2    V3 V4
## 1:  1  1 -0.63  A
## 2:  2  2  0.18  B
## 3:  2  8  0.18  B
## 4:  1  5  0.18  B
##    V1 V2    V3 V4
## 1:  2  2  0.18  B
## 2:  2  4 -0.63  A
## 3:  2  6 -0.84  C
## 4:  2  8  0.18  B
DF %>% sample_n(3)  # n random rows
DF %>% sample_frac(0.5) # fraction of random rows
DF %>% top_n(1, V1) # top n entries (includes equals)
## # A tibble: 3 x 4
##      V1    V2    V3 V4   
##   <int> <int> <dbl> <chr>
## 1     2     4 -0.63 A    
## 2     1     7 -0.63 A    
## 3     1     9 -0.84 C    
## # A tibble: 4 x 4
##      V1    V2    V3 V4   
##   <int> <int> <dbl> <chr>
## 1     1     7 -0.63 A    
## 2     2     8  0.18 B    
## 3     1     3 -0.84 C    
## 4     1     5  0.18 B    
## # A tibble: 4 x 4
##      V1    V2    V3 V4   
##   <int> <int> <dbl> <chr>
## 1     2     2  0.18 B    
## 2     2     4 -0.63 A    
## 3     2     6 -0.84 C    
## 4     2     8  0.18 B
On the other hand, data.table also provides convenience functions to filter rows based on a regular expression or to find values lying in one (or several) interval(s).
DT[V4 %like% "^B"]
DT[V2 %between% c(3, 5)]
DT[data.table::between(V2, 3, 5, incbounds = FALSE)]
DT[V2 %inrange% list(-1:1, 1:3)] # see also ?inrange
##    V1 V2   V3 V4
## 1:  2  2 0.18  B
## 2:  1  5 0.18  B
## 3:  2  8 0.18  B
##    V1 V2    V3 V4
## 1:  1  3 -0.84  C
## 2:  2  4 -0.63  A
## 3:  1  5  0.18  B
##    V1 V2    V3 V4
## 1:  2  4 -0.63  A
##    V1 V2    V3 V4
## 1:  1  1 -0.63  A
## 2:  2  2  0.18  B
## 3:  1  3 -0.84  C
DF %>% filter(grepl("^B", V4))
DF %>% filter(dplyr::between(V2, 3, 5))
DF %>% filter(V2 > 3 & V2 < 5)
DF %>% filter(V2 >= -1:1 & V2 <= 1:3)
## # A tibble: 3 x 4
##      V1    V2    V3 V4   
##   <int> <int> <dbl> <chr>
## 1     2     2  0.18 B    
## 2     1     5  0.18 B    
## 3     2     8  0.18 B    
## # A tibble: 3 x 4
##      V1    V2    V3 V4   
##   <int> <int> <dbl> <chr>
## 1     1     3 -0.84 C    
## 2     2     4 -0.63 A    
## 3     1     5  0.18 B    
## # A tibble: 1 x 4
##      V1    V2    V3 V4   
##   <int> <int> <dbl> <chr>
## 1     2     4 -0.63 A    
## # A tibble: 3 x 4
##      V1    V2    V3 V4   
##   <int> <int> <dbl> <chr>
## 1     1     1 -0.63 A    
## 2     2     2  0.18 B    
## 3     1     3 -0.84 C

Below, we will see that data.table has two optimized mechanisms to filter rows efficiently (keys and indices).



Sort rows

Sort rows by column

DT[order(V3)]  # see also setorder
##    V1 V2    V3 V4
## 1:  1  3 -0.84  C
## 2:  2  6 -0.84  C
## 3:  1  9 -0.84  C
## 4:  1  1 -0.63  A
## 5:  2  4 -0.63  A
## 6:  1  7 -0.63  A
## 7:  2  2  0.18  B
## 8:  1  5  0.18  B
## 9:  2  8  0.18  B
DF %>% arrange(V3)
## # A tibble: 9 x 4
##      V1    V2    V3 V4   
##   <int> <int> <dbl> <chr>
## 1     1     3 -0.84 C    
## 2     2     6 -0.84 C    
## 3     1     9 -0.84 C    
## 4     1     1 -0.63 A    
## 5     2     4 -0.63 A    
## 6     1     7 -0.63 A    
## 7     2     2  0.18 B    
## 8     1     5  0.18 B    
## 9     2     8  0.18 B

Sort rows in decreasing order

DT[order(-V3)]
##    V1 V2    V3 V4
## 1:  2  2  0.18  B
## 2:  1  5  0.18  B
## 3:  2  8  0.18  B
## 4:  1  1 -0.63  A
## 5:  2  4 -0.63  A
## 6:  1  7 -0.63  A
## 7:  1  3 -0.84  C
## 8:  2  6 -0.84  C
## 9:  1  9 -0.84  C
DF %>% arrange(desc(V3))
## # A tibble: 9 x 4
##      V1    V2    V3 V4   
##   <int> <int> <dbl> <chr>
## 1     2     2  0.18 B    
## 2     1     5  0.18 B    
## 3     2     8  0.18 B    
## 4     1     1 -0.63 A    
## 5     2     4 -0.63 A    
## 6     1     7 -0.63 A    
## 7     1     3 -0.84 C    
## 8     2     6 -0.84 C    
## 9     1     9 -0.84 C

Sort rows based on several columns

DT[order(V1, -V2)]
##    V1 V2    V3 V4
## 1:  1  9 -0.84  C
## 2:  1  7 -0.63  A
## 3:  1  5  0.18  B
## 4:  1  3 -0.84  C
## 5:  1  1 -0.63  A
## 6:  2  8  0.18  B
## 7:  2  6 -0.84  C
## 8:  2  4 -0.63  A
## 9:  2  2  0.18  B
DF %>% arrange(V1, desc(V2))
## # A tibble: 9 x 4
##      V1    V2    V3 V4   
##   <int> <int> <dbl> <chr>
## 1     1     9 -0.84 C    
## 2     1     7 -0.63 A    
## 3     1     5  0.18 B    
## 4     1     3 -0.84 C    
## 5     1     1 -0.63 A    
## 6     2     8  0.18 B    
## 7     2     6 -0.84 C    
## 8     2     4 -0.63 A    
## 9     2     2  0.18 B


Select columns

Select one column using column name

DT[, list(V2)] # returns a data.table
DT[, .(V2)]    # returns a data.table
# . is an alias for list
DT[, "V2"]     # returns a data.table
DT[, V2]       # returns a vector
DT[["V2"]]     # returns a vector
##    V2
## 1:  1
## 2:  2
## 3:  3
## 4:  4
## 5:  5
## 6:  6
## 7:  7
## 8:  8
## 9:  9
##    V2
## 1:  1
## 2:  2
## 3:  3
## 4:  4
## 5:  5
## 6:  6
## 7:  7
## 8:  8
## 9:  9
##    V2
## 1:  1
## 2:  2
## 3:  3
## 4:  4
## 5:  5
## 6:  6
## 7:  7
## 8:  8
## 9:  9
## [1] 1 2 3 4 5 6 7 8 9
## [1] 1 2 3 4 5 6 7 8 9
DF %>% select(V2) # returns a tibble
DF %>% pull(V2)   # returns a vector
DF[, "V2"]        # returns a tibble
DF[["V2"]]        # returns a vector
## # A tibble: 9 x 1
##      V2
##   <int>
## 1     1
## 2     2
## 3     3
## 4     4
## 5     5
## 6     6
## 7     7
## 8     8
## 9     9
## [1] 1 2 3 4 5 6 7 8 9
## # A tibble: 9 x 1
##      V2
##   <int>
## 1     1
## 2     2
## 3     3
## 4     4
## 5     5
## 6     6
## 7     7
## 8     8
## 9     9
## [1] 1 2 3 4 5 6 7 8 9

Select several columns

DT[, .(V2, V3, V4)]
DT[, list(V2, V3, V4)]
DT[, V2:V4] # select columns between V2 and V4
##    V2    V3 V4
## 1:  1 -0.63  A
## 2:  2  0.18  B
## 3:  3 -0.84  C
## 4:  4 -0.63  A
## 5:  5  0.18  B
## 6:  6 -0.84  C
## 7:  7 -0.63  A
## 8:  8  0.18  B
## 9:  9 -0.84  C
##    V2    V3 V4
## 1:  1 -0.63  A
## 2:  2  0.18  B
## 3:  3 -0.84  C
## 4:  4 -0.63  A
## 5:  5  0.18  B
## 6:  6 -0.84  C
## 7:  7 -0.63  A
## 8:  8  0.18  B
## 9:  9 -0.84  C
##    V2    V3 V4
## 1:  1 -0.63  A
## 2:  2  0.18  B
## 3:  3 -0.84  C
## 4:  4 -0.63  A
## 5:  5  0.18  B
## 6:  6 -0.84  C
## 7:  7 -0.63  A
## 8:  8  0.18  B
## 9:  9 -0.84  C
DF %>% select(V2, V3, V4)
DF %>% select(V2:V4) # select columns between V2 and V4
## # A tibble: 9 x 3
##      V2    V3 V4   
##   <int> <dbl> <chr>
## 1     1 -0.63 A    
## 2     2  0.18 B    
## 3     3 -0.84 C    
## 4     4 -0.63 A    
## 5     5  0.18 B    
## 6     6 -0.84 C    
## 7     7 -0.63 A    
## 8     8  0.18 B    
## 9     9 -0.84 C    
## # A tibble: 9 x 3
##      V2    V3 V4   
##   <int> <dbl> <chr>
## 1     1 -0.63 A    
## 2     2  0.18 B    
## 3     3 -0.84 C    
## 4     4 -0.63 A    
## 5     5  0.18 B    
## 6     6 -0.84 C    
## 7     7 -0.63 A    
## 8     8  0.18 B    
## 9     9 -0.84 C

Exclude columns

DT[, !c("V2", "V3")]
##    V1 V4
## 1:  1  A
## 2:  2  B
## 3:  1  C
## 4:  2  A
## 5:  1  B
## 6:  2  C
## 7:  1  A
## 8:  2  B
## 9:  1  C
DF %>% select(-V2, -V3)
## # A tibble: 9 x 2
##      V1 V4   
##   <int> <chr>
## 1     1 A    
## 2     2 B    
## 3     1 C    
## 4     2 A    
## 5     1 B    
## 6     2 C    
## 7     1 A    
## 8     2 B    
## 9     1 C

Select/Exclude columns using a character vector

cols = c("V2", "V3")
DT[, ..cols] # .. prefix means 'one-level up'
DT[, !..cols] # or DT[, -..cols]
##    V2    V3
## 1:  1 -0.63
## 2:  2  0.18
## 3:  3 -0.84
## 4:  4 -0.63
## 5:  5  0.18
## 6:  6 -0.84
## 7:  7 -0.63
## 8:  8  0.18
## 9:  9 -0.84
##    V1 V4
## 1:  1  A
## 2:  2  B
## 3:  1  C
## 4:  2  A
## 5:  1  B
## 6:  2  C
## 7:  1  A
## 8:  2  B
## 9:  1  C
cols = c("V2", "V3")
DF %>% select(!!cols) # unquoting
DF %>% select(-!!cols)
## # A tibble: 9 x 2
##      V2    V3
##   <int> <dbl>
## 1     1 -0.63
## 2     2  0.18
## 3     3 -0.84
## 4     4 -0.63
## 5     5  0.18
## 6     6 -0.84
## 7     7 -0.63
## 8     8  0.18
## 9     9 -0.84
## # A tibble: 9 x 2
##      V1 V4   
##   <int> <chr>
## 1     1 A    
## 2     2 B    
## 3     1 C    
## 4     2 A    
## 5     1 B    
## 6     2 C    
## 7     1 A    
## 8     2 B    
## 9     1 C

Other selections

As for row filtering, dplyr includes helper functions to select column. With data.table, a possible solution is to first retrieve the column names (e.g. using a regular expression), then select these columns. Another way (using patterns()) is presented in a next section.
cols = grep("V",   names(DT))
cols = grep("3$",  names(DT))
cols = union("V4", names(DT))
cols = grep(".2",  names(DT))
cols = grep("^V1|X$",  names(DT))
cols = grep("^(?!V2)", names(DT), perl = TRUE)
DT[, ..cols] 
##    V1    V3 V4
## 1:  1 -0.63  A
## 2:  2  0.18  B
## 3:  1 -0.84  C
## 4:  2 -0.63  A
## 5:  1  0.18  B
## 6:  2 -0.84  C
## 7:  1 -0.63  A
## 8:  2  0.18  B
## 9:  1 -0.84  C
DF %>% select(contains("V"))
DF %>% select(ends_with("3"))
DF %>% select(V4, everything()) # reorder columns
DF %>% select(matches(".2"))
DF %>% select(num_range("V", 1:2))
DF %>% select(one_of(c("V1", "X")))
## Warning: Unknown columns: `X`
DF %>% select(-starts_with("V2"))
# remove variables using "-" prior to function
## # A tibble: 9 x 4
##      V1    V2    V3 V4   
##   <int> <int> <dbl> <chr>
## 1     1     1 -0.63 A    
## 2     2     2  0.18 B    
## 3     1     3 -0.84 C    
## 4     2     4 -0.63 A    
## 5     1     5  0.18 B    
## 6     2     6 -0.84 C    
## 7     1     7 -0.63 A    
## 8     2     8  0.18 B    
## 9     1     9 -0.84 C    
## # A tibble: 9 x 1
##      V3
##   <dbl>
## 1 -0.63
## 2  0.18
## 3 -0.84
## 4 -0.63
## 5  0.18
## 6 -0.84
## 7 -0.63
## 8  0.18
## 9 -0.84
## # A tibble: 9 x 4
##   V4       V1    V2    V3
##   <chr> <int> <int> <dbl>
## 1 A         1     1 -0.63
## 2 B         2     2  0.18
## 3 C         1     3 -0.84
## 4 A         2     4 -0.63
## 5 B         1     5  0.18
## 6 C         2     6 -0.84
## 7 A         1     7 -0.63
## 8 B         2     8  0.18
## 9 C         1     9 -0.84
## # A tibble: 9 x 1
##      V2
##   <int>
## 1     1
## 2     2
## 3     3
## 4     4
## 5     5
## 6     6
## 7     7
## 8     8
## 9     9
## # A tibble: 9 x 2
##      V1    V2
##   <int> <int>
## 1     1     1
## 2     2     2
## 3     1     3
## 4     2     4
## 5     1     5
## 6     2     6
## 7     1     7
## 8     2     8
## 9     1     9
## # A tibble: 9 x 1
##      V1
##   <int>
## 1     1
## 2     2
## 3     1
## 4     2
## 5     1
## 6     2
## 7     1
## 8     2
## 9     1
## # A tibble: 9 x 3
##      V1    V3 V4   
##   <int> <dbl> <chr>
## 1     1 -0.63 A    
## 2     2  0.18 B    
## 3     1 -0.84 C    
## 4     2 -0.63 A    
## 5     1  0.18 B    
## 6     2 -0.84 C    
## 7     1 -0.63 A    
## 8     2  0.18 B    
## 9     1 -0.84 C


Summarise data

Summary functions take vectors as input and return a single value (e.g. min(), mean(), var(), …).

Summarise one column

DT[, sum(V1)]    # returns a vector
DT[, .(sum(V1))] # returns a data.table
DT[, .(sumV1 = sum(V1))] # returns a data.table
## [1] 13
##    V1
## 1: 13
##    sumV1
## 1:    13
DF %>% summarise(sum(V1)) # returns a tibble
DF %>% summarise(sumV1 = sum(V1)) # returns a tibble
## # A tibble: 1 x 1
##   `sum(V1)`
##       <int>
## 1        13
## # A tibble: 1 x 1
##   sumV1
##   <int>
## 1    13

Summarise several columns

DT[, .(sum(V1), sd(V3))]
##    V1        V2
## 1: 13 0.4664494
DF %>% summarise(sum(V1), sd(V3))
## # A tibble: 1 x 2
##   `sum(V1)` `sd(V3)`
##       <int>    <dbl>
## 1        13    0.466

Summarise several columns and assign column names

DT[, .(sumv1 = sum(V1), sdv3 = sd(V3))]
##    sumv1      sdv3
## 1:    13 0.4664494
DF %>% summarise(sumv1 = sum(V1), sdv3 = sd(V3))
## # A tibble: 1 x 2
##   sumv1  sdv3
##   <int> <dbl>
## 1    13 0.466

Summarise a subset of rows

DT[1:4, sum(V1)]
## [1] 6
DF %>% slice(1:4) %>% summarise(sum(V1))
## # A tibble: 1 x 1
##   `sum(V1)`
##       <int>
## 1         6
dplyr helper functions for summarise() (or summarize()) include first(), last(), n(), nth(), and n_distinct(). The data.table package also include first(), last(), and uniqueN().
DT[, data.table::first(V3)]
DT[, data.table::last(V3)]
DT[5, V3]
DT[, uniqueN(V4)]
uniqueN(DT)
## [1] -0.63
## [1] -0.84
## [1] 0.18
## [1] 3
## [1] 9
DF %>% summarise(dplyr::first(V3))
DF %>% summarise(dplyr::last(V3))
DF %>% summarise(nth(V3, 5))
DF %>% summarise(n_distinct(V4))
n_distinct(DF)
## # A tibble: 1 x 1
##   `dplyr::first(V3)`
##                <dbl>
## 1              -0.63
## # A tibble: 1 x 1
##   `dplyr::last(V3)`
##               <dbl>
## 1             -0.84
## # A tibble: 1 x 1
##   `nth(V3, 5)`
##          <dbl>
## 1         0.18
## # A tibble: 1 x 1
##   `n_distinct(V4)`
##              <int>
## 1                3
## [1] 9


Add/update/delete columns

In the following commands, with data.table, columns are modified by reference using the column assignment symbol := (no copy performed) and the results are returned invisibly. With dplyr, we have to assign the results.

Modify a column

DT[, V1 := V1^2]
DT
##    V1 V2    V3 V4
## 1:  1  1 -0.63  A
## 2:  4  2  0.18  B
## 3:  1  3 -0.84  C
## 4:  4  4 -0.63  A
## 5:  1  5  0.18  B
## 6:  4  6 -0.84  C
## 7:  1  7 -0.63  A
## 8:  4  8  0.18  B
## 9:  1  9 -0.84  C
DF = DF %>% mutate(V1 = V1^2)
DF
## # A tibble: 9 x 4
##      V1    V2    V3 V4   
##   <dbl> <int> <dbl> <chr>
## 1     1     1 -0.63 A    
## 2     4     2  0.18 B    
## 3     1     3 -0.84 C    
## 4     4     4 -0.63 A    
## 5     1     5  0.18 B    
## 6     4     6 -0.84 C    
## 7     1     7 -0.63 A    
## 8     4     8  0.18 B    
## 9     1     9 -0.84 C

Add one column

DT[, v5 := log(V1)][] # adding [] prints the result
##    V1 V2    V3 V4       v5
## 1:  1  1 -0.63  A 0.000000
## 2:  4  2  0.18  B 1.386294
## 3:  1  3 -0.84  C 0.000000
## 4:  4  4 -0.63  A 1.386294
## 5:  1  5  0.18  B 0.000000
## 6:  4  6 -0.84  C 1.386294
## 7:  1  7 -0.63  A 0.000000
## 8:  4  8  0.18  B 1.386294
## 9:  1  9 -0.84  C 0.000000
DF = DF %>% mutate(v5 = log(V1))

Add several columns

DT[, ':='(v6 = sqrt(V1), v7 = "X")] # functional form
DT[, c("v6","v7") := .(sqrt(V1), "X")] # same
DF = DF %>% mutate(v6 = sqrt(V1), v7 = "X")
## recycling

Create one column and remove the others

DT[, .(v8 = V3 + 1)]
##      v8
## 1: 0.37
## 2: 1.18
## 3: 0.16
## 4: 0.37
## 5: 1.18
## 6: 0.16
## 7: 0.37
## 8: 1.18
## 9: 0.16
DF %>% transmute(v8 = V3 + 1)
## # A tibble: 9 x 1
##      v8
##   <dbl>
## 1  0.37
## 2  1.18
## 3  0.16
## 4  0.37
## 5  1.18
## 6  0.16
## 7  0.37
## 8  1.18
## 9  0.16

Remove one column

DT[, v5 := NULL] 
DF = select(DF, -v5)

Remove several columns

DT[, c("v6","v7") := NULL]
DF = select(DF, -v6, -v7)

Remove columns using a vector of colnames

cols = c("V3")
DT[, (cols) := NULL] # ! not DT[, cols := NULL]
cols = c("V3")
DF = select(DF, -one_of(cols))

Replace values for rows matching a condition

DT[V2 < 4, V2 := 0L]
DT
##    V1 V2 V4
## 1:  1  0  A
## 2:  4  0  B
## 3:  1  0  C
## 4:  4  4  A
## 5:  1  5  B
## 6:  4  6  C
## 7:  1  7  A
## 8:  4  8  B
## 9:  1  9  C
DF = DF %>% mutate(V2 = base::replace(V2, V2 < 4, 0L))
DF
## # A tibble: 9 x 3
##      V1    V2 V4   
##   <dbl> <int> <chr>
## 1     1     0 A    
## 2     4     0 B    
## 3     1     0 C    
## 4     4     4 A    
## 5     1     5 B    
## 6     4     6 C    
## 7     1     7 A    
## 8     4     8 B    
## 9     1     9 C


by

The dplyr::group_by() function and the corresponding by and keyby statements in data.table allow to run manipulate each group of observations and combine the results. The sole difference between by and keyby is that keyby orders the results and creates a key that will allow faster subsetting (cf. the indexing and keys section). Below, we arbitrary use one or the other.
Again, the all-at-if variants of group_by() are available in dplyr but not presented here.

By group

DT[, .(sumV2 = sum(V2)), by = V4]
# DT[, .(sumV2 = sum(V2)), by = "V4"]
##    V4 sumV2
## 1:  A    11
## 2:  B    13
## 3:  C    15
DF %>% group_by(V4) %>% summarise(sumV2 = sum(V2))
## # A tibble: 3 x 2
##   V4    sumV2
##   <chr> <int>
## 1 A        11
## 2 B        13
## 3 C        15

By several groups

DT[, .(sumV2 = sum(V2)), keyby = .(V4, V1)]
##    V4 V1 sumV2
## 1:  A  1     7
## 2:  A  4     4
## 3:  B  1     5
## 4:  B  4     8
## 5:  C  1     9
## 6:  C  4     6
DF %>% group_by(V4, V1) %>% summarise(sumV2 = sum(V2))
## # A tibble: 6 x 3
## # Groups:   V4 [3]
##   V4       V1 sumV2
##   <chr> <dbl> <int>
## 1 A         1     7
## 2 A         4     4
## 3 B         1     5
## 4 B         4     8
## 5 C         1     9
## 6 C         4     6

Calling function in by

DT[, .(sumV1 = sum(V1)), by = tolower(V4)]
##    tolower sumV1
## 1:       a     6
## 2:       b     9
## 3:       c     6
DF %>% group_by(tolower(V4)) %>% summarise(sumV1 = sum(V1))
## # A tibble: 3 x 2
##   `tolower(V4)` sumV1
##   <chr>         <dbl>
## 1 a                 6
## 2 b                 9
## 3 c                 6

Assigning column name in by

DT[, .(sumV1 = sum(V1)), keyby = .(abc = tolower(V4))]
##    abc sumV1
## 1:   a     6
## 2:   b     9
## 3:   c     6
DF %>% group_by(abc = tolower(V4)) %>% summarise(sumV1 = sum(V1))
## # A tibble: 3 x 2
##   abc   sumV1
##   <chr> <dbl>
## 1 a         6
## 2 b         9
## 3 c         6

Using a condition in by

DT[, sum(V1), keyby = V4 == "A"]
##       V4 V1
## 1: FALSE 15
## 2:  TRUE  6
DF %>% group_by(V4 == "A") %>% summarise(sum(V1))
## # A tibble: 2 x 2
##   `V4 == "A"` `sum(V1)`
##   <lgl>           <dbl>
## 1 FALSE              15
## 2 TRUE                6

By on a subset of rows

DT[1:5, .(sumV1 = sum(V1)), by = V4]
## complete DT[i, j, by] expression!
##    V4 sumV1
## 1:  A     5
## 2:  B     5
## 3:  C     1
DF %>% slice(1:5) %>% group_by(V4) %>% summarise(sumV1 = sum(V1))
## # A tibble: 3 x 2
##   V4    sumV1
##   <chr> <dbl>
## 1 A         5
## 2 B         5
## 3 C         1

Count number of observations for each group

DT[, .N, by = V4]
##    V4 N
## 1:  A 3
## 2:  B 3
## 3:  C 3
DF %>% group_by(V4) %>% tally()
DF %>% count(V4) # same
DF %>% group_by(V4) %>% summarise(n())
DF %>% group_by(V4) %>% group_size() # returns a vector
## # A tibble: 3 x 2
##   V4        n
##   <chr> <int>
## 1 A         3
## 2 B         3
## 3 C         3
## # A tibble: 3 x 2
##   V4        n
##   <chr> <int>
## 1 A         3
## 2 B         3
## 3 C         3
## # A tibble: 3 x 2
##   V4    `n()`
##   <chr> <int>
## 1 A         3
## 2 B         3
## 3 C         3
## [1] 3 3 3

Add a column with number of observations for each group

DT[, n := .N, by = V1][]
DT[, n := NULL] # rm column for consistency
##    V1 V2 V4 n
## 1:  1  0  A 5
## 2:  4  0  B 4
## 3:  1  0  C 5
## 4:  4  4  A 4
## 5:  1  5  B 5
## 6:  4  6  C 4
## 7:  1  7  A 5
## 8:  4  8  B 4
## 9:  1  9  C 5
DF %>% group_by(V1) %>% add_tally()
DF %>% add_count(V1)
## # A tibble: 9 x 4
## # Groups:   V1 [2]
##      V1    V2 V4        n
##   <dbl> <int> <chr> <int>
## 1     1     0 A         5
## 2     4     0 B         4
## 3     1     0 C         5
## 4     4     4 A         4
## 5     1     5 B         5
## 6     4     6 C         4
## 7     1     7 A         5
## 8     4     8 B         4
## 9     1     9 C         5
## # A tibble: 9 x 4
##      V1    V2 V4        n
##   <dbl> <int> <chr> <int>
## 1     1     0 A         5
## 2     4     0 B         4
## 3     1     0 C         5
## 4     4     4 A         4
## 5     1     5 B         5
## 6     4     6 C         4
## 7     1     7 A         5
## 8     4     8 B         4
## 9     1     9 C         5

Retrieve the first/last/nth observation for each group

DT[, data.table::first(V2), by = V4]
DT[, data.table::last(V2), by = V4]
DT[, V2[2], by = V4]
##    V4 V1
## 1:  A  0
## 2:  B  0
## 3:  C  0
##    V4 V1
## 1:  A  7
## 2:  B  8
## 3:  C  9
##    V4 V1
## 1:  A  4
## 2:  B  5
## 3:  C  6
DF %>% group_by(V4) %>% summarise(dplyr::first(V2))
DF %>% group_by(V4) %>% summarise(dplyr::last(V2))
DF %>% group_by(V4) %>% summarise(dplyr::nth(V2,2))
## # A tibble: 3 x 2
##   V4    `dplyr::first(V2)`
##   <chr>              <int>
## 1 A                      0
## 2 B                      0
## 3 C                      0
## # A tibble: 3 x 2
##   V4    `dplyr::last(V2)`
##   <chr>             <int>
## 1 A                     7
## 2 B                     8
## 3 C                     9
## # A tibble: 3 x 2
##   V4    `dplyr::nth(V2, 2)`
##   <chr>               <int>
## 1 A                       4
## 2 B                       5
## 3 C                       6



Going further

Advanced columns manipulation

To further manipulate columns, dplyr includes nine functions: the _all, _at, and _if versions of summarise(), mutate(), and transmute().
With data.table, we use .SD, which is a data.table containing the Subset of Data for each group, excluding the column(s) used in by. So, DT[, .SD] is DT itself and in the expression DT[, .SD, by = V4], .SD contains all the DT columns (except V4) for each values in V4 (see DT[, print(.SD), by = V4]). .SDcols allows to select the columns included in .SD.

Summarise all the columns

DT[, lapply(.SD, max)]
##    V1 V2 V4
## 1:  4  9  C
DF %>% summarise_all(max)
## # A tibble: 1 x 3
##      V1    V2 V4   
##   <dbl> <dbl> <chr>
## 1     4     9 C

Summarise several columns

DT[, lapply(.SD, mean), .SDcols = c("V1", "V2")]
DT[, lapply(.SD[,.(V1,V2)], mean)] # same
##          V1       V2
## 1: 2.333333 4.333333
##          V1       V2
## 1: 2.333333 4.333333
DF %>% summarise_at(c("V1", "V2"), mean)
## # A tibble: 1 x 2
##      V1    V2
##   <dbl> <dbl>
## 1  2.33  4.33

Summarise several columns by group

DT[, lapply(.SD, mean), by = V4, .SDcols = c("V1", "V2")]
## using patterns (regex)
DT[, lapply(.SD, mean), by = V4, .SDcols = patterns("V1|V2")]
##    V4 V1       V2
## 1:  A  2 3.666667
## 2:  B  3 4.333333
## 3:  C  2 5.000000
##    V4 V1       V2
## 1:  A  2 3.666667
## 2:  B  3 4.333333
## 3:  C  2 5.000000
DF %>% group_by(V4) %>% summarise_at(c("V1","V2"), mean)
## using select helpers
DF %>% group_by(V4) %>% summarise_at(vars(one_of("V1", "V2")), mean)
## # A tibble: 3 x 3
##   V4       V1    V2
##   <chr> <dbl> <dbl>
## 1 A         2  3.67
## 2 B         3  4.33
## 3 C         2  5   
## # A tibble: 3 x 3
##   V4       V1    V2
##   <chr> <dbl> <dbl>
## 1 A         2  3.67
## 2 B         3  4.33
## 3 C         2  5

Summarise with more than one function by group

DT[, c(lapply(.SD, sum), lapply(.SD, mean)), by = V4]
##    V4 V1 V2 V1       V2
## 1:  A  6 11  2 3.666667
## 2:  B  9 13  3 4.333333
## 3:  C  6 15  2 5.000000
DF %>% group_by(V4) %>% summarise_all(list(sum, mean)) # columns named automatically
## # A tibble: 3 x 5
##   V4    V1_fn1 V2_fn1 V1_fn2 V2_fn2
##   <chr>  <dbl>  <int>  <dbl>  <dbl>
## 1 A          6     11      2   3.67
## 2 B          9     13      3   4.33
## 3 C          6     15      2   5

Summarise using a condition

cols = names(DT)[sapply(DT, is.numeric)]
DT[, lapply(.SD, mean), .SDcols = cols]
##          V1       V2
## 1: 2.333333 4.333333
DF %>% summarise_if(is.numeric, mean)
## # A tibble: 1 x 2
##      V1    V2
##   <dbl> <dbl>
## 1  2.33  4.33

Modify all the columns

DT[, lapply(.SD, rev)]
##    V1 V2 V4
## 1:  1  9  C
## 2:  4  8  B
## 3:  1  7  A
## 4:  4  6  C
## 5:  1  5  B
## 6:  4  4  A
## 7:  1  0  C
## 8:  4  0  B
## 9:  1  0  A
DF %>% mutate_all(rev)
#DF %>% transmute_all(rev)
## # A tibble: 9 x 3
##      V1    V2 V4   
##   <dbl> <int> <chr>
## 1     1     9 C    
## 2     4     8 B    
## 3     1     7 A    
## 4     4     6 C    
## 5     1     5 B    
## 6     4     4 A    
## 7     1     0 C    
## 8     4     0 B    
## 9     1     0 A

Modify several columns (dropping the others)

DT[, lapply(.SD, sqrt), .SDcols = V1:V2]
DT[, lapply(.SD, exp), .SDcols = !"V4"]
##    V1       V2
## 1:  1 0.000000
## 2:  2 0.000000
## 3:  1 0.000000
## 4:  2 2.000000
## 5:  1 2.236068
## 6:  2 2.449490
## 7:  1 2.645751
## 8:  2 2.828427
## 9:  1 3.000000
##           V1         V2
## 1:  2.718282    1.00000
## 2: 54.598150    1.00000
## 3:  2.718282    1.00000
## 4: 54.598150   54.59815
## 5:  2.718282  148.41316
## 6: 54.598150  403.42879
## 7:  2.718282 1096.63316
## 8: 54.598150 2980.95799
## 9:  2.718282 8103.08393
DF %>% transmute_at(c("V1", "V2"), sqrt)
DF %>% transmute_at(vars(-V4), exp)
## # A tibble: 9 x 2
##      V1    V2
##   <dbl> <dbl>
## 1     1  0   
## 2     2  0   
## 3     1  0   
## 4     2  2   
## 5     1  2.24
## 6     2  2.45
## 7     1  2.65
## 8     2  2.83
## 9     1  3   
## # A tibble: 9 x 2
##      V1     V2
##   <dbl>  <dbl>
## 1  2.72    1  
## 2 54.6     1  
## 3  2.72    1  
## 4 54.6    54.6
## 5  2.72  148. 
## 6 54.6   403. 
## 7  2.72 1097. 
## 8 54.6  2981. 
## 9  2.72 8103.

Modify several columns (keeping the others)

DT[, c("V1", "V2") := lapply(.SD, "+", 1L), .SDcols = c("V1", "V2")]
cols = setdiff(names(DT), "V4")
DT[, (cols) := lapply(.SD, "-", 1L), .SDcols = cols]
DF = DF %>% mutate_at(c("V1", "V2"), "+", 1L)
DF = DF %>% mutate_at(vars(-V4), "-", 1L)

Modify columns using a condition (dropping the others)

cols = names(DT)[sapply(DT, is.numeric)]
DT[, .SD - 1, .SDcols = cols]
##    V1 V2
## 1:  0 -1
## 2:  3 -1
## 3:  0 -1
## 4:  3  3
## 5:  0  4
## 6:  3  5
## 7:  0  6
## 8:  3  7
## 9:  0  8
DF %>% transmute_if(is.numeric, list(~'-'(., 1L)))
## # A tibble: 9 x 2
##      V1    V2
##   <dbl> <int>
## 1     0    -1
## 2     3    -1
## 3     0    -1
## 4     3     3
## 5     0     4
## 6     3     5
## 7     0     6
## 8     3     7
## 9     0     8

Modify columns using a condition (keeping the others)

DT[, (cols) := lapply(.SD, as.integer), .SDcols = cols]
DF = DF %>% mutate_if(is.numeric, as.integer)

The use of DT[,j] is very flexible, allowing to pass complex expressions in a straightforward way, or combine expressions with multiple outputs.

Use a complex expression

DT[, .(V1[1:2], "X"), by = V4]
##    V4 V1 V2
## 1:  A  1  X
## 2:  A  4  X
## 3:  B  4  X
## 4:  B  1  X
## 5:  C  1  X
## 6:  C  4  X
DF %>% group_by(V4) %>% slice(1:2) %>% transmute(V1 = V1, V2 = "X")
## # A tibble: 6 x 3
## # Groups:   V4 [3]
##   V4       V1 V2   
##   <chr> <int> <chr>
## 1 A         1 X    
## 2 A         4 X    
## 3 B         4 X    
## 4 B         1 X    
## 5 C         1 X    
## 6 C         4 X

Use multiple expressions (with DT[,{j}])

DT[,{print(V1) #  comments here!
     print(summary(V1))
     x = V1 + sum(V2)
     .(A = 1:.N, B = x) # last list returned as a data.table
    }]
## [1] 1 4 1 4 1 4 1 4 1
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.000   1.000   2.333   4.000   4.000 
##    A  B
## 1: 1 40
## 2: 2 43
## 3: 3 40
## 4: 4 43
## 5: 5 40
## 6: 6 43
## 7: 7 40
## 8: 8 43
## 9: 9 40
## 


Chain expressions

The dplyr workflow relies on the magrittr pipe operator (%>%). The magrittr package can also be used with data.table objects, but data.table comes with its own chaining system: DT[ ... ][ ... ][ ... ].

Expression chaining using %>%

DT[, .(V1sum = sum(V1)), by = V4] %>% .[order(-V1sum)]
##    V4 V1sum
## 1:  B     9
## 2:  A     6
## 3:  C     6
DF %>% group_by(V4) %>% summarise(V1sum = sum(V1)) %>% arrange(desc(V1sum))
## # A tibble: 3 x 2
##   V4    V1sum
##   <chr> <int>
## 1 B         9
## 2 A         6
## 3 C         6


Indexing and Keys

Row subsetting in dplyr relies on the filter() and slice() functions, as shown in the first section. With data.table, in addition to the above-mentioned approach, two systems are available to make row filtering and join operations more convenient and blazingly fast (~170x speed-up): keys (primary ordered index) and indices (automatic secondary indexing).

The main differences between keys and indices are:
- When using keys, data are physically reordered in memory. When using indices, the order is stored as an attribute.
- Only one key is possible but several indices can coexist.
- Keys are defined explicitly. Indices can be created manually but are also created on-the-fly (and stored when using == or %in%).
- Indices are used with the on argument. It is optional when using keys, but recommended (and used below) for better readability.
Note: in the following code, we set both a key and an index to demonstrate their usage, but internally, indices are not used when a key already exists for the same columns(s).

Set the key/index

setkey(DT,V4)
setindex(DT, V4)
DF = DF %>% arrange(V4) # ordered just for consistency

Select the matching rows

DT["A", on = "V4"]
DT[c("A","C"), on = .(V4)] # same as on = "V4"
##    V1 V2 V4
## 1:  1  0  A
## 2:  4  4  A
## 3:  1  7  A
##    V1 V2 V4
## 1:  1  0  A
## 2:  4  4  A
## 3:  1  7  A
## 4:  1  0  C
## 5:  4  6  C
## 6:  1  9  C
filter(DF, V4 == "A")
filter(DF, V4 %in% c("A", "C")) %>% arrange(V4)
## # A tibble: 3 x 3
##      V1    V2 V4   
##   <int> <int> <chr>
## 1     1     0 A    
## 2     4     4 A    
## 3     1     7 A    
## # A tibble: 6 x 3
##      V1    V2 V4   
##   <int> <int> <chr>
## 1     1     0 A    
## 2     4     4 A    
## 3     1     7 A    
## 4     1     0 C    
## 5     4     6 C    
## 6     1     9 C

Select the first matching row

DT["B", on = "V4", mult = "first"]
DT[c("B", "C"), on = "V4", mult = "first"]
##    V1 V2 V4
## 1:  4  0  B
##    V1 V2 V4
## 1:  4  0  B
## 2:  1  0  C
DF %>% filter(V4 == "B") %>% slice(1)
# ?
## # A tibble: 1 x 3
##      V1    V2 V4   
##   <int> <int> <chr>
## 1     4     0 B

Select the last matching row

DT["A", on = "V4", mult = "last"]
##    V1 V2 V4
## 1:  1  7  A
DF %>% filter(V4 == "A") %>% slice(n())
## # A tibble: 1 x 3
##      V1    V2 V4   
##   <int> <int> <chr>
## 1     1     7 A

Nomatch argument

DT[c("A","D"), on = "V4", nomatch = NA] # (default) returns a row with "D" even if not found
DT[c("A","D"), on = "V4", nomatch = 0] # no rows for unmatched values
##    V1 V2 V4
## 1:  1  0  A
## 2:  4  4  A
## 3:  1  7  A
## 4: NA NA  D
##    V1 V2 V4
## 1:  1  0  A
## 2:  4  4  A
## 3:  1  7  A
##  -
DF %>% filter(V4 %in% c("A", "D"))
## # A tibble: 3 x 3
##      V1    V2 V4   
##   <int> <int> <chr>
## 1     1     0 A    
## 2     4     4 A    
## 3     1     7 A

Apply a function on the matching rows

DT[c("A","C"), sum(V1), on = "V4"]
## [1] 12
DF  %>% filter(V4 %in% c("A", "C")) %>% summarise(sum(V1))
## # A tibble: 1 x 1
##   `sum(V1)`
##       <int>
## 1        12

Modify values for matching rows

DT["A", V1 := 0, on = "V4"]
DT
##    V1 V2 V4
## 1:  0  0  A
## 2:  0  4  A
## 3:  0  7  A
## 4:  4  0  B
## 5:  1  5  B
## 6:  4  8  B
## 7:  1  0  C
## 8:  4  6  C
## 9:  1  9  C
DF = DF %>% mutate(V1 = base::replace(V1, V4 == "A", 0L)) %>% arrange(V4)
DF
## # A tibble: 9 x 3
##      V1    V2 V4   
##   <int> <int> <chr>
## 1     0     0 A    
## 2     0     4 A    
## 3     0     7 A    
## 4     4     0 B    
## 5     1     5 B    
## 6     4     8 B    
## 7     1     0 C    
## 8     4     6 C    
## 9     1     9 C

Use keys in by

DT[!"B", sum(V1), on = "V4", by = .EACHI]
DT[V4 != "B", sum(V1), by = V4] # same
##    V4 V1
## 1:  A  0
## 2:  C  6
##    V4 V1
## 1:  A  0
## 2:  C  6
DF %>% filter(V4 != "B") %>% group_by(V4) %>% summarise(sum(V1))
## # A tibble: 2 x 2
##   V4    `sum(V1)`
##   <chr>     <int>
## 1 A             0
## 2 C             6

Set keys/indices for multiple columns

setkey(DT, V4, V1) # or setkeyv(DT, c("V4", "V1"))
setindex(DT, V4, V1) # setindexv(DT, c("V4", "V1"))
#

Subset using multiple keys/indices

DT[.("C", 1), on = .(V4,V1)]
DT[.(c("B", "C"), 1), on = .(V4, V1)]
DT[.(c("B", "C"), 1), on = .(V4, V1), which = TRUE] # using which = TRUE only returns the matching rows indices
##    V1 V2 V4
## 1:  1  0  C
## 2:  1  9  C
##    V1 V2 V4
## 1:  1  5  B
## 2:  1  0  C
## 3:  1  9  C
## [1] 4 7 8
DF  %>% filter(V1 == 1, V4 == "C")
DF  %>% filter(V1 == 1, V4 %in% c("B", "C"))
# ?
## # A tibble: 2 x 3
##      V1    V2 V4   
##   <int> <int> <chr>
## 1     1     0 C    
## 2     1     9 C    
## # A tibble: 3 x 3
##      V1    V2 V4   
##   <int> <int> <chr>
## 1     1     5 B    
## 2     1     0 C    
## 3     1     9 C

Remove keys/indices

setkey(DT, NULL)
setindex(DT, NULL)
# 


set*() modifications

In data.table, set*() functions modify objects by reference, making these operations fast and memory-efficient. In case this is not a desired behaviour, users can use copy(). The corresponding expressions in dplyr will be less memory-efficient.

Replace values

set(DT, i = 1L, j = 2L, value = 3L)
DF[1, 2] = 3L

Reorder rows

setorder(DT, V4, -V1)
setorderv(DT, c("V4", "V1"), c(1, -1))
DF = DF %>% arrange(V4, desc(V1))

Modify colnames

setnames(DT, old = "V2", new = "v2")
setnames(DT, old = -(c(1, 3)), new = "V2")
DF = DF %>% rename(v2 = V2)
DF = DF %>% rename(V2 = v2) # reset upper

Reorder columns

setcolorder(DT, c("V4", "V1", "V2"))
DF = DF %>% select(V4, V1, V2)

Convert data

?setDT # data.frame or list to data.table
## starting httpd help server ... done
?setDF # data.table to data.frame
?setattr # modify attributes
# 


Advanced use of by

Select first/last/… row by group

DT[, .SD[1], by = V4]
DT[, .SD[c(1, .N)], by = V4]
DT[, tail(.SD, 2), by = V4]
##    V4 V1 V2
## 1:  A  0  3
## 2:  B  4  0
## 3:  C  4  6
##    V4 V1 V2
## 1:  A  0  3
## 2:  A  0  7
## 3:  B  4  0
## 4:  B  1  5
## 5:  C  4  6
## 6:  C  1  9
##    V4 V1 V2
## 1:  A  0  4
## 2:  A  0  7
## 3:  B  4  8
## 4:  B  1  5
## 5:  C  1  0
## 6:  C  1  9
DF %>% group_by(V4) %>% slice(1)
DF %>% group_by(V4) %>% slice(1, n())
DF %>% group_by(V4) %>% group_map(~ tail(.x, 2))
## # A tibble: 3 x 3
## # Groups:   V4 [3]
##   V4       V1    V2
##   <chr> <int> <int>
## 1 A         0     3
## 2 B         4     0
## 3 C         4     6
## # A tibble: 6 x 3
## # Groups:   V4 [3]
##   V4       V1    V2
##   <chr> <int> <int>
## 1 A         0     3
## 2 A         0     7
## 3 B         4     0
## 4 B         1     5
## 5 C         4     6
## 6 C         1     9
## # A tibble: 6 x 3
## # Groups:   V4 [3]
##   V4       V1    V2
##   <chr> <int> <int>
## 1 A         0     4
## 2 A         0     7
## 3 B         4     8
## 4 B         1     5
## 5 C         1     0
## 6 C         1     9

Select rows using a nested query

DT[, .SD[which.min(V2)], by = V4]
##    V4 V1 V2
## 1:  A  0  3
## 2:  B  4  0
## 3:  C  1  0
DF %>% group_by(V4) %>% arrange(V2) %>% slice(1)
## # A tibble: 3 x 3
## # Groups:   V4 [3]
##   V4       V1    V2
##   <chr> <int> <int>
## 1 A         0     3
## 2 B         4     0
## 3 C         1     0

Add a group counter column

DT[, Grp := .GRP, by = .(V4, V1)][]
DT[, Grp := NULL] # delete for consistency
##    V4 V1 V2 Grp
## 1:  A  0  3   1
## 2:  A  0  4   1
## 3:  A  0  7   1
## 4:  B  4  0   2
## 5:  B  4  8   2
## 6:  B  1  5   3
## 7:  C  4  6   4
## 8:  C  1  0   5
## 9:  C  1  9   5
DF %>% mutate(Grp = group_indices(., V4, V1))
## # A tibble: 9 x 4
##   V4       V1    V2   Grp
##   <chr> <int> <int> <int>
## 1 A         0     3     1
## 2 A         0     4     1
## 3 A         0     7     1
## 4 B         4     0     3
## 5 B         4     8     3
## 6 B         1     5     2
## 7 C         4     6     5
## 8 C         1     0     4
## 9 C         1     9     4

Get row number of first (and last) observation by group

DT[, .I, by = V4] # returns a data.table
DT[, .I[1], by = V4]
DT[, .I[c(1, .N)], by = V4]
##    V4 I
## 1:  A 1
## 2:  A 2
## 3:  A 3
## 4:  B 4
## 5:  B 5
## 6:  B 6
## 7:  C 7
## 8:  C 8
## 9:  C 9
##    V4 V1
## 1:  A  1
## 2:  B  4
## 3:  C  7
##    V4 V1
## 1:  A  1
## 2:  A  3
## 3:  B  4
## 4:  B  6
## 5:  C  7
## 6:  C  9
DF %>% group_by(V4) %>% group_data() %>% tidyr::unnest(.rows)
# DF %>% group_by(V4) %>% group_rows() # returns a list
#
#
## # A tibble: 9 x 2
##   V4    .rows
##   <chr> <int>
## 1 A         1
## 2 A         2
## 3 A         3
## 4 B         4
## 5 B         5
## 6 B         6
## 7 C         7
## 8 C         8
## 9 C         9

Handle list-columns by group

DT[, .(.(V1)), by = V4]  # return V1 as a list
DT[, .(.(.SD)), by = V4] # subsets of the data
##    V4    V1
## 1:  A 0,0,0
## 2:  B 4,4,1
## 3:  C 4,1,1
##    V4           V1
## 1:  A <data.table>
## 2:  B <data.table>
## 3:  C <data.table>
DF %>% group_by(V4) %>% summarise(list(V1))
DF %>% group_by(V4) %>% group_nest()
## # A tibble: 3 x 2
##   V4    `list(V1)`
##   <chr> <list>    
## 1 A     <int [3]> 
## 2 B     <int [3]> 
## 3 C     <int [3]> 
## # A tibble: 3 x 2
##   V4    data            
##   <chr> <list>          
## 1 A     <tibble [3 x 2]>
## 2 B     <tibble [3 x 2]>
## 3 C     <tibble [3 x 2]>

Grouping sets (multiple by at once)

rollup(DT, .(SumV2 = sum(V2)), by = c("V1","V4")) 
rollup(DT, .(SumV2 = sum(V2), .N), by = c("V1","V4"), id = TRUE)
cube(  DT, .(SumV2 = sum(V2), .N), by = c("V1","V4"), id = TRUE)
groupingsets(DT,
             .(SumV2 = sum(V2), .N), by = c("V1","V4"), 
             sets = list("V1", c("V1", "V4")),
             id = TRUE)
##    V1   V4 SumV2
## 1:  0    A    14
## 2:  4    B     8
## 3:  1    B     5
## 4:  4    C     6
## 5:  1    C     9
## 6:  0 <NA>    14
## 7:  4 <NA>    14
## 8:  1 <NA>    14
## 9: NA <NA>    42
##    grouping V1   V4 SumV2 N
## 1:        0  0    A    14 3
## 2:        0  4    B     8 2
## 3:        0  1    B     5 1
## 4:        0  4    C     6 1
## 5:        0  1    C     9 2
## 6:        1  0 <NA>    14 3
## 7:        1  4 <NA>    14 3
## 8:        1  1 <NA>    14 3
## 9:        3 NA <NA>    42 9
##     grouping V1   V4 SumV2 N
##  1:        0  0    A    14 3
##  2:        0  4    B     8 2
##  3:        0  1    B     5 1
##  4:        0  4    C     6 1
##  5:        0  1    C     9 2
##  6:        1  0 <NA>    14 3
##  7:        1  4 <NA>    14 3
##  8:        1  1 <NA>    14 3
##  9:        2 NA    A    14 3
## 10:        2 NA    B    13 3
## 11:        2 NA    C    15 3
## 12:        3 NA <NA>    42 9
##    grouping V1   V4 SumV2 N
## 1:        1  0 <NA>    14 3
## 2:        1  4 <NA>    14 3
## 3:        1  1 <NA>    14 3
## 4:        0  0    A    14 3
## 5:        0  4    B     8 2
## 6:        0  1    B     5 1
## 7:        0  4    C     6 1
## 8:        0  1    C     9 2
# NA



Miscellaneous

Read / Write data

fread() and fwrite() are among the most powerful functions of data.table. They are not only incredibly fast (see benchmarks), they are also extremely robust. The few commands below only scratch the surface and there are a lot of awesome features. For example, fread() accepts http and https URLs directly as well as operating system commands such as sed and awk output. Make sure to check the docs.
Here again, fread() and fwrite() are very versatile and allow to handle different file formats while dplyr delegates file reading and writing to the readr package with several specific functions (csv, tsv, delim, …).

Write data to a csv file

fwrite(DT, "DT.csv")
readr::write_csv(DF, "DF.csv")

Write data to a tab-delimited file

fwrite(DT, "DT.txt", sep = "\t")
readr::write_delim(DF, "DF.txt", delim = "\t")

Write list-column data to a csv file

fwrite(setDT(list(0, list(1:5))), "DT2.csv")
# NA

Read a csv / tab-delimited file

fread("DT.csv")
# fread("DT.csv", verbose = TRUE) # full details
fread("DT.txt", sep = "\t")
##    V4 V1 V2
## 1:  A  0  3
## 2:  A  0  4
## 3:  A  0  7
## 4:  B  4  0
## 5:  B  4  8
## 6:  B  1  5
## 7:  C  4  6
## 8:  C  1  0
## 9:  C  1  9
##    V4 V1 V2
## 1:  A  0  3
## 2:  A  0  4
## 3:  A  0  7
## 4:  B  4  0
## 5:  B  4  8
## 6:  B  1  5
## 7:  C  4  6
## 8:  C  1  0
## 9:  C  1  9
readr::read_csv("DF.csv")
## Parsed with column specification:
## cols(
##   V4 = col_character(),
##   V1 = col_double(),
##   V2 = col_double()
## )
readr::read_delim("DF.txt", delim = "\t")
## Parsed with column specification:
## cols(
##   V4 = col_character(),
##   V1 = col_double(),
##   V2 = col_double()
## )
## # A tibble: 9 x 3
##   V4       V1    V2
##   <chr> <dbl> <dbl>
## 1 A         0     3
## 2 A         0     4
## 3 A         0     7
## 4 B         4     0
## 5 B         4     8
## 6 B         1     5
## 7 C         4     6
## 8 C         1     0
## 9 C         1     9
## # A tibble: 9 x 3
##   V4       V1    V2
##   <chr> <dbl> <dbl>
## 1 A         0     3
## 2 A         0     4
## 3 A         0     7
## 4 B         4     0
## 5 B         4     8
## 6 B         1     5
## 7 C         4     6
## 8 C         1     0
## 9 C         1     9

Read a csv file selecting / droping columns

fread("DT.csv", select = c("V1", "V4"))
fread("DT.csv", drop = "V4")
##    V1 V4
## 1:  0  A
## 2:  0  A
## 3:  0  A
## 4:  4  B
## 5:  4  B
## 6:  1  B
## 7:  4  C
## 8:  1  C
## 9:  1  C
##    V1 V2
## 1:  0  3
## 2:  0  4
## 3:  0  7
## 4:  4  0
## 5:  4  8
## 6:  1  5
## 7:  4  6
## 8:  1  0
## 9:  1  9
# NA

Read and rbind several files

rbindlist(lapply(c("DT.csv", "DT.csv"), fread))
#c("DT.csv", "DT.csv") %>% lapply(fread) %>% rbindlist
##     V4 V1 V2
##  1:  A  0  3
##  2:  A  0  4
##  3:  A  0  7
##  4:  B  4  0
##  5:  B  4  8
##  6:  B  1  5
##  7:  C  4  6
##  8:  C  1  0
##  9:  C  1  9
## 10:  A  0  3
## 11:  A  0  4
## 12:  A  0  7
## 13:  B  4  0
## 14:  B  4  8
## 15:  B  1  5
## 16:  C  4  6
## 17:  C  1  0
## 18:  C  1  9
c("DF.csv", "DF.csv") %>% purrr::map_dfr(readr::read_csv)
## Parsed with column specification:
## cols(
##   V4 = col_character(),
##   V1 = col_double(),
##   V2 = col_double()
## )
## Parsed with column specification:
## cols(
##   V4 = col_character(),
##   V1 = col_double(),
##   V2 = col_double()
## )
## # A tibble: 18 x 3
##    V4       V1    V2
##    <chr> <dbl> <dbl>
##  1 A         0     3
##  2 A         0     4
##  3 A         0     7
##  4 B         4     0
##  5 B         4     8
##  6 B         1     5
##  7 C         4     6
##  8 C         1     0
##  9 C         1     9
## 10 A         0     3
## 11 A         0     4
## 12 A         0     7
## 13 B         4     0
## 14 B         4     8
## 15 B         1     5
## 16 C         4     6
## 17 C         1     0
## 18 C         1     9


Reshape data

See here and here for more details.

Melt data (from wide to long)

melt(DT, id.vars = "V4")
mDT = melt(DT, 
           id.vars       = "V4",
           measure.vars  = c("V1", "V2"),
           variable.name = "Variable",
           value.name    = "Value")
##     V4 variable value
##  1:  A       V1     0
##  2:  A       V1     0
##  3:  A       V1     0
##  4:  B       V1     4
##  5:  B       V1     4
##  6:  B       V1     1
##  7:  C       V1     4
##  8:  C       V1     1
##  9:  C       V1     1
## 10:  A       V2     3
## 11:  A       V2     4
## 12:  A       V2     7
## 13:  B       V2     0
## 14:  B       V2     8
## 15:  B       V2     5
## 16:  C       V2     6
## 17:  C       V2     0
## 18:  C       V2     9
DF %>% tidyr::gather(variable, value, -V4)
mDF = DF %>% tidyr::gather(key = Variable, value = Value, -V4)
# pivot_longer todo
## # A tibble: 18 x 3
##    V4    variable value
##    <chr> <chr>    <int>
##  1 A     V1           0
##  2 A     V1           0
##  3 A     V1           0
##  4 B     V1           4
##  5 B     V1           4
##  6 B     V1           1
##  7 C     V1           4
##  8 C     V1           1
##  9 C     V1           1
## 10 A     V2           3
## 11 A     V2           4
## 12 A     V2           7
## 13 B     V2           0
## 14 B     V2           8
## 15 B     V2           5
## 16 C     V2           6
## 17 C     V2           0
## 18 C     V2           9

Cast data (from long to wide)

dcast(mDT, V4 ~ Variable) # aggregate by count
## Using 'Value' as value column. Use 'value.var' to override
## Aggregate function missing, defaulting to 'length'
dcast(mDT, V4 ~ Variable, fun.aggregate = sum)
## Using 'Value' as value column. Use 'value.var' to override
dcast(mDT, V4 ~ Value > 5)
## Using 'Value' as value column. Use 'value.var' to override
## Aggregate function missing, defaulting to 'length'
# see ?dcast: multiple values / fun.aggregate
##    V4 V1 V2
## 1:  A  3  3
## 2:  B  3  3
## 3:  C  3  3
##    V4 V1 V2
## 1:  A  0 14
## 2:  B  9 13
## 3:  C  6 15
##    V4 FALSE TRUE
## 1:  A     5    1
## 2:  B     5    1
## 3:  C     4    2
tidyr::spread(count(mDF, V4, Variable), Variable, n, fill = 0)
# pivot_wider todo
## # A tibble: 3 x 3
##   V4       V1    V2
##   <chr> <dbl> <dbl>
## 1 A         3     3
## 2 B         3     3
## 3 C         3     3

Split

split(DT, by = "V4") # S3 method
## $A
##    V4 V1 V2
## 1:  A  0  3
## 2:  A  0  4
## 3:  A  0  7
## 
## $B
##    V4 V1 V2
## 1:  B  4  0
## 2:  B  4  8
## 3:  B  1  5
## 
## $C
##    V4 V1 V2
## 1:  C  4  6
## 2:  C  1  0
## 3:  C  1  9
group_split(DF, V4)
## [[1]]
## # A tibble: 3 x 3
##   V4       V1    V2
##   <chr> <int> <int>
## 1 A         0     3
## 2 A         0     4
## 3 A         0     7
## 
## [[2]]
## # A tibble: 3 x 3
##   V4       V1    V2
##   <chr> <int> <int>
## 1 B         4     0
## 2 B         4     8
## 3 B         1     5
## 
## [[3]]
## # A tibble: 3 x 3
##   V4       V1    V2
##   <chr> <int> <int>
## 1 C         4     6
## 2 C         1     0
## 3 C         1     9

Split and transpose a vector/column

vec = c("A:a","B:b","C:c")
tstrsplit(vec, split = ":", keep = 2L) # works on vector
setDT(tstrsplit(vec, split = ":"))[]
## [[1]]
## [1] "a" "b" "c"
## 
##    V1 V2
## 1:  A  a
## 2:  B  b
## 3:  C  c
vec = c("A:a","B:b","C:c")
# vector not handled
tibble(vec) %>% tidyr::separate(vec, c("V1", "V2"))
## # A tibble: 3 x 2
##   V1    V2   
##   <chr> <chr>
## 1 A     a    
## 2 B     b    
## 3 C     c


Other

Check package installation

# test.data.table()
# There's more lines of test code in data.table than there is code!
# NA

List data.tables/tibbles

tables()
##    NAME NROW NCOL MB              COLS KEY
## 1:   DT    9    3  0          V4,V1,V2    
## 2:  mDT   18    3  0 V4,Variable,Value    
## Total: 0MB
# ?

Get/Set number of threads when parallelized

getDTthreads() # setDTthreads()
## [1] 2
# NA

Lead/Lag

shift(1:10, n = 1,   fill = NA, type = "lag")
shift(1:10, n = 1:2, fill = NA, type = "lag") # multiple
shift(1:10, n = 1,   fill = NA, type = "lead")
##  [1] NA  1  2  3  4  5  6  7  8  9
## [[1]]
##  [1] NA  1  2  3  4  5  6  7  8  9
## 
## [[2]]
##  [1] NA NA  1  2  3  4  5  6  7  8
## 
##  [1]  2  3  4  5  6  7  8  9 10 NA
lag(1:10, n = 1, default = NA)
# NA
lead(1:10, n = 1, default = NA)
##  [1] NA  1  2  3  4  5  6  7  8  9
##  [1]  2  3  4  5  6  7  8  9 10 NA

Generate run-length ids

rleid(rep(c("a", "b", "a"), each = 3)) # see also ?rleidv
rleid(rep(c("a", "b", "a"), each = 3), prefix = "G")
## [1] 1 1 1 2 2 2 3 3 3
## [1] "G1" "G1" "G1" "G2" "G2" "G2" "G3" "G3" "G3"
# NA

Vectorised ifelse statements

# NA
x = 1:10
case_when(
  x %% 6 == 0 ~ "fizz buzz",
  x %% 2 == 0 ~ "fizz",
  x %% 3 == 0 ~ "buzz",
  TRUE ~ as.character(x)
)
##  [1] "1"         "fizz"      "buzz"      "fizz"      "5"        
##  [6] "fizz buzz" "7"         "fizz"      "buzz"      "fizz"

Rolling functions

# todo
#



Join/Bind data sets

Join

Joining data in data.table works like the fast subsetting approach described above. It can be performed using keys, using the ad hoc on argument, or using the merge.data.table method. For the sake of completeness, the three methods are presented below. As previously mentioned, the on and by (in merge) arguments are optional with keyed data.tables, but recommended to make the code more explicit.
In the examples below, the x, y, and z data.tables are also used with dplyr.

ix = seq(2L, 8L, 2L); iy = ix - 1L
x = data.table(Id = LETTERS[c(1,2,3,3)], X1 = iy, XY = paste0("x", ix), key = "Id")
y = data.table(Id = LETTERS[c(1,2,2,4)], Y1 = iy, XY = paste0("y", iy), key = "Id")
x; y
##    Id X1 XY
## 1:  A  1 x2
## 2:  B  3 x4
## 3:  C  5 x6
## 4:  C  7 x8
##    Id Y1 XY
## 1:  A  1 y1
## 2:  B  3 y3
## 3:  B  5 y5
## 4:  D  7 y7

Join matching rows from y to x

y[x, on = "Id"] 
merge(x, y, all.x = TRUE, by = "Id")
y[x] # requires keys
##    Id Y1   XY X1 i.XY
## 1:  A  1   y1  1   x2
## 2:  B  3   y3  3   x4
## 3:  B  5   y5  3   x4
## 4:  C NA <NA>  5   x6
## 5:  C NA <NA>  7   x8
##    Id X1 XY.x Y1 XY.y
## 1:  A  1   x2  1   y1
## 2:  B  3   x4  3   y3
## 3:  B  3   x4  5   y5
## 4:  C  5   x6 NA <NA>
## 5:  C  7   x8 NA <NA>
##    Id Y1   XY X1 i.XY
## 1:  A  1   y1  1   x2
## 2:  B  3   y3  3   x4
## 3:  B  5   y5  3   x4
## 4:  C NA <NA>  5   x6
## 5:  C NA <NA>  7   x8
left_join(x, y, by = "Id")
##   Id X1 XY.x Y1 XY.y
## 1  A  1   x2  1   y1
## 2  B  3   x4  3   y3
## 3  B  3   x4  5   y5
## 4  C  5   x6 NA <NA>
## 5  C  7   x8 NA <NA>

Join matching rows from x to y

x[y, on = "Id"]
merge(x, y, all.y = TRUE, by = "Id")
x[y] # requires keys
##    Id X1   XY Y1 i.XY
## 1:  A  1   x2  1   y1
## 2:  B  3   x4  3   y3
## 3:  B  3   x4  5   y5
## 4:  D NA <NA>  7   y7
##    Id X1 XY.x Y1 XY.y
## 1:  A  1   x2  1   y1
## 2:  B  3   x4  3   y3
## 3:  B  3   x4  5   y5
## 4:  D NA <NA>  7   y7
##    Id X1   XY Y1 i.XY
## 1:  A  1   x2  1   y1
## 2:  B  3   x4  3   y3
## 3:  B  3   x4  5   y5
## 4:  D NA <NA>  7   y7
right_join(x, y, by = "Id")
##   Id X1 XY.x Y1 XY.y
## 1  A  1   x2  1   y1
## 2  B  3   x4  3   y3
## 3  B  3   x4  5   y5
## 4  D NA <NA>  7   y7

Join matching rows from both x and y

x[y, on = "Id", nomatch = 0]
merge(x, y)
x[y, nomatch = 0] # requires keys
##    Id X1 XY Y1 i.XY
## 1:  A  1 x2  1   y1
## 2:  B  3 x4  3   y3
## 3:  B  3 x4  5   y5
##    Id X1 XY.x Y1 XY.y
## 1:  A  1   x2  1   y1
## 2:  B  3   x4  3   y3
## 3:  B  3   x4  5   y5
##    Id X1 XY Y1 i.XY
## 1:  A  1 x2  1   y1
## 2:  B  3 x4  3   y3
## 3:  B  3 x4  5   y5
inner_join(x, y, by = "Id")
##   Id X1 XY.x Y1 XY.y
## 1  A  1   x2  1   y1
## 2  B  3   x4  3   y3
## 3  B  3   x4  5   y5

Join keeping all the rows

merge(x, y, all = TRUE, by = "Id")
##    Id X1 XY.x Y1 XY.y
## 1:  A  1   x2  1   y1
## 2:  B  3   x4  3   y3
## 3:  B  3   x4  5   y5
## 4:  C  5   x6 NA <NA>
## 5:  C  7   x8 NA <NA>
## 6:  D NA <NA>  7   y7
full_join(x, y, by = "Id")
##   Id X1 XY.x Y1 XY.y
## 1  A  1   x2  1   y1
## 2  B  3   x4  3   y3
## 3  B  3   x4  5   y5
## 4  C  5   x6 NA <NA>
## 5  C  7   x8 NA <NA>
## 6  D NA <NA>  7   y7

Return rows from x matching y

unique(x[y$Id, on = "Id", nomatch = 0])
unique(x[y$Id, nomatch = 0]) # requires keys
##    Id X1 XY
## 1:  A  1 x2
## 2:  B  3 x4
##    Id X1 XY
## 1:  A  1 x2
## 2:  B  3 x4
semi_join(x, y, by = "Id")
##   Id X1 XY
## 1  A  1 x2
## 2  B  3 x4

Return rows from x not matching y

x[!y, on = "Id"]
x[!y] # requires keys
##    Id X1 XY
## 1:  C  5 x6
## 2:  C  7 x8
##    Id X1 XY
## 1:  C  5 x6
## 2:  C  7 x8
anti_join(x, y, by = "Id")
##   Id X1 XY
## 1  C  5 x6
## 2  C  7 x8


More joins

data.table excels at joining data, and offers additional functions and features.

Select columns while joining

x[y, .(Id, X1, i.XY)]   # i. prefix refers to cols in y
x[y, .(Id, x.XY, i.XY)] # x. prefix refers to cols in x
##    Id X1 i.XY
## 1:  A  1   y1
## 2:  B  3   y3
## 3:  B  3   y5
## 4:  D NA   y7
##    Id x.XY i.XY
## 1:  A   x2   y1
## 2:  B   x4   y3
## 3:  B   x4   y5
## 4:  D <NA>   y7
right_join(select(x, Id, X1), select(y, Id, XY), by = "Id")
right_join(select(x, Id, XY), select(y, Id, XY), by = "Id")
##   Id X1 XY
## 1  A  1 y1
## 2  B  3 y3
## 3  B  3 y5
## 4  D NA y7
##   Id XY.x XY.y
## 1  A   x2   y1
## 2  B   x4   y3
## 3  B   x4   y5
## 4  D <NA>   y7

Aggregate columns while joining

y[x, .(X1Y1 = sum(Y1) * X1), by = .EACHI]
##    Id X1Y1
## 1:  A    1
## 2:  B   24
## 3:  C   NA
## 4:  C   NA
y %>% group_by(Id) %>% summarise(SumY1 = sum(Y1)) %>% 
  right_join(x) %>% mutate(X1Y1 = SumY1 * X1) %>% select(Id, X1Y1)
## Joining, by = "Id"
## # A tibble: 4 x 2
##   Id     X1Y1
##   <chr> <int>
## 1 A         1
## 2 B        24
## 3 C        NA
## 4 C        NA

Update columns while joining

y[x, SqX1 := i.X1^2]
y[, SqX1 := x[.BY, X1^2, on = "Id"], by = Id] # more memory-efficient
y[, SqX1 := NULL] # rm column for consistency
x %>% select(Id, X1) %>% mutate(SqX1 = X1^2) %>% 
  right_join(y, by = "Id") %>% select(names(y), SqX1)
##   Id Y1 XY SqX1
## 1  A  1 y1    1
## 2  B  3 y3    9
## 3  B  5 y5    9
## 4  D  7 y7   NA

Adds a list column with rows from y matching x (nest-join)

x[, y := .(.(y[.BY, on = "Id"])), by = Id]
x[, y := NULL] # rm column for consistency
x %>% nest_join(y, by = "Id")
##   Id X1 XY            y
## 1  A  1 x2        1, y1
## 2  B  3 x4 3, 5, y3, y5
## 3  C  5 x6             
## 4  C  7 x8

Update columns while joining (using vectors of colnames)

cols  = c("NewXY", "NewX1")
icols = paste0("i.", c("XY", "X1"))
y[x, (cols) := mget(icols)]
y[, (cols) := NULL] # rm columns for consistency
# ?

Join passing columns to match in the on argument

z = data.table(ID = "C", Z1 = 5:9, Z2 = paste0("z", 5:9))
x[, X2 := paste0("x", X1)] # used to track the results
z; x
##    ID Z1 Z2
## 1:  C  5 z5
## 2:  C  6 z6
## 3:  C  7 z7
## 4:  C  8 z8
## 5:  C  9 z9
##    Id X1 XY X2
## 1:  A  1 x2 x1
## 2:  B  3 x4 x3
## 3:  C  5 x6 x5
## 4:  C  7 x8 x7
x[z, on = "X1==Z1"]
x[z, on = .(X1==Z1)] # same
x[z, on = .(Id==ID, X1==Z1)] # using two columns
##      Id X1   XY   X2 ID Z2
## 1:    C  5   x6   x5  C z5
## 2: <NA>  6 <NA> <NA>  C z6
## 3:    C  7   x8   x7  C z7
## 4: <NA>  8 <NA> <NA>  C z8
## 5: <NA>  9 <NA> <NA>  C z9
##      Id X1   XY   X2 ID Z2
## 1:    C  5   x6   x5  C z5
## 2: <NA>  6 <NA> <NA>  C z6
## 3:    C  7   x8   x7  C z7
## 4: <NA>  8 <NA> <NA>  C z8
## 5: <NA>  9 <NA> <NA>  C z9
##    Id X1   XY   X2 Z2
## 1:  C  5   x6   x5 z5
## 2:  C  6 <NA> <NA> z6
## 3:  C  7   x8   x7 z7
## 4:  C  8 <NA> <NA> z8
## 5:  C  9 <NA> <NA> z9
right_join(x, z, by = c("X1" = "Z1"))
right_join(x, z, by = c("Id" = "ID", "X1" = "Z1"))
##     Id X1   XY   X2 ID Z2
## 1    C  5   x6   x5  C z5
## 2 <NA>  6 <NA> <NA>  C z6
## 3    C  7   x8   x7  C z7
## 4 <NA>  8 <NA> <NA>  C z8
## 5 <NA>  9 <NA> <NA>  C z9
##   Id X1   XY   X2 Z2
## 1  C  5   x6   x5 z5
## 2  C  6 <NA> <NA> z6
## 3  C  7   x8   x7 z7
## 4  C  8 <NA> <NA> z8
## 5  C  9 <NA> <NA> z9

Non-equi joins

x[z, on = .(Id==ID, X1<=Z1)]
x[z, on = .(Id==ID, X1>Z1)]
x[z, on = .(X1<Z1), allow.cartesian = TRUE] # allows 'numerous' matching values
##    Id X1 XY X2 Z2
## 1:  C  5 x6 x5 z5
## 2:  C  6 x6 x5 z6
## 3:  C  7 x6 x5 z7
## 4:  C  7 x8 x7 z7
## 5:  C  8 x6 x5 z8
## 6:  C  8 x8 x7 z8
## 7:  C  9 x6 x5 z9
## 8:  C  9 x8 x7 z9
##    Id X1   XY   X2 Z2
## 1:  C  5   x8   x7 z5
## 2:  C  6   x8   x7 z6
## 3:  C  7 <NA> <NA> z7
## 4:  C  8 <NA> <NA> z8
## 5:  C  9 <NA> <NA> z9
##     Id X1 XY X2 ID Z2
##  1:  A  5 x2 x1  C z5
##  2:  B  5 x4 x3  C z5
##  3:  A  6 x2 x1  C z6
##  4:  B  6 x4 x3  C z6
##  5:  C  6 x6 x5  C z6
##  6:  A  7 x2 x1  C z7
##  7:  B  7 x4 x3  C z7
##  8:  C  7 x6 x5  C z7
##  9:  A  8 x2 x1  C z8
## 10:  B  8 x4 x3  C z8
## 11:  C  8 x6 x5  C z8
## 12:  C  8 x8 x7  C z8
## 13:  A  9 x2 x1  C z9
## 14:  B  9 x4 x3  C z9
## 15:  C  9 x6 x5  C z9
## 16:  C  9 x8 x7  C z9
# NA

Rolling joins/subsets (performed on the last numeric column)

# Nearest
x[z, on = .(Id==ID, X1==Z1), roll = "nearest"] 
## below, simplified examples with ad hoc subsets on a keyed data.table
setkey(x, Id, X1)
x[.("C", 5:9), roll = "nearest"]
##    Id X1 XY X2 Z2
## 1:  C  5 x6 x5 z5
## 2:  C  6 x6 x5 z6
## 3:  C  7 x8 x7 z7
## 4:  C  8 x8 x7 z8
## 5:  C  9 x8 x7 z9
##    Id X1 XY X2
## 1:  C  5 x6 x5
## 2:  C  6 x6 x5
## 3:  C  7 x8 x7
## 4:  C  8 x8 x7
## 5:  C  9 x8 x7
# NA
# Last Observation Carried Forward
x[.("C", 5:9), roll = Inf]
x[.("C", 5:9), roll = 0.5]  # bounded
x[.("C", 5:9), roll = Inf, rollends = c(FALSE, TRUE)]  # default
x[.("C", 5:9), roll = Inf, rollends = c(FALSE, FALSE)] # ends not rolled 
##    Id X1 XY X2
## 1:  C  5 x6 x5
## 2:  C  6 x6 x5
## 3:  C  7 x8 x7
## 4:  C  8 x8 x7
## 5:  C  9 x8 x7
##    Id X1   XY   X2
## 1:  C  5   x6   x5
## 2:  C  6 <NA> <NA>
## 3:  C  7   x8   x7
## 4:  C  8 <NA> <NA>
## 5:  C  9 <NA> <NA>
##    Id X1 XY X2
## 1:  C  5 x6 x5
## 2:  C  6 x6 x5
## 3:  C  7 x8 x7
## 4:  C  8 x8 x7
## 5:  C  9 x8 x7
##    Id X1   XY   X2
## 1:  C  5   x6   x5
## 2:  C  6   x6   x5
## 3:  C  7   x8   x7
## 4:  C  8 <NA> <NA>
## 5:  C  9 <NA> <NA>
# NA
# Next Observation Carried Backward
x[.("C", 5:9), roll = -Inf]
x[.("C", 5:9), roll = -0.5] # bounded
x[.("C", 5:9), roll = -Inf, rollends = c(TRUE, FALSE)]
x[.("C", 5:9), roll = -Inf, rollends = c(TRUE, TRUE)]  # roll both ends
##    Id X1   XY   X2
## 1:  C  5   x6   x5
## 2:  C  6   x8   x7
## 3:  C  7   x8   x7
## 4:  C  8 <NA> <NA>
## 5:  C  9 <NA> <NA>
##    Id X1   XY   X2
## 1:  C  5   x6   x5
## 2:  C  6 <NA> <NA>
## 3:  C  7   x8   x7
## 4:  C  8 <NA> <NA>
## 5:  C  9 <NA> <NA>
##    Id X1   XY   X2
## 1:  C  5   x6   x5
## 2:  C  6   x8   x7
## 3:  C  7   x8   x7
## 4:  C  8 <NA> <NA>
## 5:  C  9 <NA> <NA>
##    Id X1 XY X2
## 1:  C  5 x6 x5
## 2:  C  6 x8 x7
## 3:  C  7 x8 x7
## 4:  C  8 x8 x7
## 5:  C  9 x8 x7
# NA

Cross join (CJ ~ expand.grid)

CJ(c(2,1,1), 3:2)
CJ(c(2,1,1), 3:2, sorted = FALSE, unique = TRUE)
##    V1 V2
## 1:  1  2
## 2:  1  2
## 3:  1  3
## 4:  1  3
## 5:  2  2
## 6:  2  3
##    V1 V2
## 1:  2  3
## 2:  2  2
## 3:  1  3
## 4:  1  2
# base::expand.grid(c(2,1,1), 3:2)
# NA

Overlap join

It is important to mention foverlaps() from data.table that allows to perform ‘overlap joins’. This is a very powerful function, but a bit out of scope for this document. See these slides for more details.



Bind

x = data.table(1:3)
y = data.table(4:6)
z = data.table(7:9, 0L)

Bind rows

rbind(x, y)
rbind(x, z, fill = TRUE)
##    V1
## 1:  1
## 2:  2
## 3:  3
## 4:  4
## 5:  5
## 6:  6
##    V1 V2
## 1:  1 NA
## 2:  2 NA
## 3:  3 NA
## 4:  7  0
## 5:  8  0
## 6:  9  0
bind_rows(x, y)
bind_rows(x, z) # always fills
##    V1
## 1:  1
## 2:  2
## 3:  3
## 4:  4
## 5:  5
## 6:  6
##    V1 V2
## 1:  1 NA
## 2:  2 NA
## 3:  3 NA
## 4:  7  0
## 5:  8  0
## 6:  9  0

Bind rows using a list

rbindlist(list(x, y), idcol = TRUE)
##    .id V1
## 1:   1  1
## 2:   1  2
## 3:   1  3
## 4:   2  4
## 5:   2  5
## 6:   2  6
bind_rows(list(x, y), .id = "id")
##    id V1
## 1:  1  1
## 2:  1  2
## 3:  1  3
## 4:  2  4
## 5:  2  5
## 6:  2  6

Bind columns

base::cbind(x, y)
##    V1 V1
## 1:  1  4
## 2:  2  5
## 3:  3  6
bind_cols(x, y)
##    V1 V11
## 1:  1   4
## 2:  2   5
## 3:  3   6


Set operations

x = data.table(c(1,2,2,3,3))
y = data.table(c(2,2,3,4,4))

Intersection

fintersect(x, y)
fintersect(x, y, all = TRUE)
##    V1
## 1:  2
## 2:  3
##    V1
## 1:  2
## 2:  2
## 3:  3
dplyr::intersect(x, y)
# no all option
##    V1
## 1:  2
## 2:  3

Difference

fsetdiff(x, y)
fsetdiff(x, y, all = TRUE)
##    V1
## 1:  1
##    V1
## 1:  1
## 2:  3
dplyr::setdiff(x, y)
# no all option
##    V1
## 1:  1

Union

funion(x, y)
funion(x, y, all = TRUE)
##    V1
## 1:  1
## 2:  2
## 3:  3
## 4:  4
##     V1
##  1:  1
##  2:  2
##  3:  2
##  4:  3
##  5:  3
##  6:  2
##  7:  2
##  8:  3
##  9:  4
## 10:  4
dplyr::union(x, y)
union_all(x, y)
##    V1
## 1:  1
## 2:  2
## 3:  3
## 4:  4
##     V1
##  1:  1
##  2:  2
##  3:  2
##  4:  3
##  5:  3
##  6:  2
##  7:  2
##  8:  3
##  9:  4
## 10:  4

Equality

fsetequal(x, x[order(-V1),])
all.equal(x, x) # S3 method
## [1] TRUE
## [1] TRUE
setequal(x, x[order(-V1),])
all_equal(x, x)
## [1] TRUE
## [1] TRUE



Summary

This article presented the most important features of both data.table and dplyr, two packages that are now essential tools for data manipulation in R.
There are still a lot of features not covered in this document, in particular, data.table functions to deal with time-series or dplyr vectorized functions have not been discussed, but ‘done is better than perfect’…
Hopefully, this comparison is not too biased, but I must admit that my preference is for data.table. So, I hope this post will encourage some readers to give it a try!


## R version 3.3.2 (2016-10-31)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 7 x64 (build 7601) Service Pack 1
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] methods   stats     graphics  grDevices utils     datasets  base     
## 
## other attached packages:
## [1] dplyr_0.8.0.1     knitr_1.21        data.table_1.12.3
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.0       magrittr_1.5     hms_0.4.2        tidyselect_0.2.5
##  [5] R6_2.3.0         rlang_0.3.1      fansi_0.4.0      stringr_1.3.1   
##  [9] tools_3.3.2      xfun_0.4.11      utf8_1.1.4       cli_1.0.1       
## [13] htmltools_0.3.6  yaml_2.2.0       digest_0.6.18    assertthat_0.2.0
## [17] tibble_2.0.1     crayon_1.3.4     bookdown_0.9     readr_1.2.1     
## [21] tidyr_0.8.0      purrr_0.2.5      glue_1.3.0       evaluate_0.12   
## [25] rmarkdown_1.11   blogdown_0.10    stringi_1.2.4    pillar_1.3.1    
## [29] pkgconfig_2.0.2