Written by Atrebas
on March 3, 2019

A data.table and dplyr tour

Introduction

About

This document has been been inspired by this stackoverflow question and by the data.table cheat sheet published by Karlijn Willems. It has been written for my own self-teaching and may contain errors or imprecisions. Corrections and suggestions are welcome.
Resources for data.table can be found on the data.table wiki, in the data.table vignettes, and in the package documentation.
Reference documents for dplyr include the dplyr cheat sheet, the dplyr vignettes, and the package documentation.
For the sake of readability, the console outputs are hidden by default. Click on the button below to show or hide the outputs. Also, the R code used in this document is independently available and can be easily reproduced.

Update

2020-07-26: Update with data.table_1.13.0 (fcase(), functions in .SDcols) and dplyr_1.0.0 (across(), relocate(), …). ‘Superseded’ dplyr code has been commented and will probably be removed in the future.
2019-06-22: Coding style improved (thanks to @tylerburleigh for pointing that out). Some data.table expressions with by reordered and indented to better highlight the similarity with dplyr. Minor comments and improvements.

data.table and dplyr

data.table and dplyr are two R packages that both aim at an easier and more efficient manipulation of data frames. But while they share a lot of functionalities, their philosophies are quite different. Below is a quick overview of the main differences (from my basic user’s perspective).

Syntax:
- The general data.table syntax is as follows: DT[i, j, by, ...] which means: “Take DT, subset rows using i, then calculate j, grouped by by” with possible extra options .... It allows to combine several operations in a very concise and consistent expression.
- The syntax of dplyr is based on key verbs corresponding to the most common operations: filter(), arrange(), select(), mutate(), summarise(), … These functions can be combine with group_by() to aggregate data ‘by group’ and with a bunch of helper functions. It is a ‘do one thing at a time’ approach, chaining together functions dedicated to a specific task.

Ecosystem:
The data.table package has no dependency whereas dplyr is part of the tidyverse. So, for example, while data.table includes functions to read, write, or reshape data, dplyr delegates these tasks to companion packages like readr or tidyr.
On the other hand, data.table is focused on the processing of local in-memory data, but dplyr offers a database backend.

Memory management and performance:
In data.table, objects can be manipulated ‘by reference’ (using the set*() functions or with the := symbol). It means that the data will be modified but not copied, minimizing the RAM requirements. The behaviour of dplyr is similar to the one of base R.
Memory management, parallelism, and shrewd optimization give data.table the advantage in terms of performance.

Create example data

library(data.table)
set.seed(1L)

## Create a data table
DT <- data.table(V1 = rep(c(1L, 2L), 5)[-10],
                V2 = 1:9,
                V3 = c(0.5, 1.0, 1.5),
                V4 = rep(LETTERS[1:3], 3))

class(DT)
DT

## [1] "data.table" "data.frame"
##    V1 V2  V3 V4
## 1:  1  1 0.5  A
## 2:  2  2 1.0  B
## 3:  1  3 1.5  C
## 4:  2  4 0.5  A
## 5:  1  5 1.0  B
## 6:  2  6 1.5  C
## 7:  1  7 0.5  A
## 8:  2  8 1.0  B
## 9:  1  9 1.5  C

library(dplyr)
set.seed(1L)

## Create a data frame (tibble)
DF <- tibble(V1 = rep(c(1L, 2L), 5)[-10],
             V2 = 1:9,
             V3 = rep(c(0.5, 1.0, 1.5), 3),
             V4 = rep(LETTERS[1:3], 3))

class(DF)
DF

## [1] "tbl_df"     "tbl"        "data.frame"
## # A tibble: 9 x 4
##      V1    V2    V3 V4   
##   <int> <int> <dbl> <chr>
## 1     1     1   0.5 A    
## 2     2     2   1   B    
## 3     1     3   1.5 C    
## 4     2     4   0.5 A    
## 5     1     5   1   B    
## 6     2     6   1.5 C    
## 7     1     7   0.5 A    
## 8     2     8   1   B    
## 9     1     9   1.5 C

Below, the data.table code uses DT and the dplyr code uses DF. Also, the dplyr code uses the %>% operator: a basic knowledge of the magrittr syntax is assumed.

Basic operations

Filter rows

Filter rows using indices

DT[3:4,]
DT[3:4] # same

##    V1 V2  V3 V4
## 1:  1  3 1.5  C
## 2:  2  4 0.5  A
##    V1 V2  V3 V4
## 1:  1  3 1.5  C
## 2:  2  4 0.5  A

DF[3:4,]
slice(DF, 3:4) # same

## # A tibble: 2 x 4
##      V1    V2    V3 V4   
##   <int> <int> <dbl> <chr>
## 1     1     3   1.5 C    
## 2     2     4   0.5 A    
## # A tibble: 2 x 4
##      V1    V2    V3 V4   
##   <int> <int> <dbl> <chr>
## 1     1     3   1.5 C    
## 2     2     4   0.5 A

Discard rows using negative indices

DT[!3:7,]
DT[-(3:7)] # same

##    V1 V2  V3 V4
## 1:  1  1 0.5  A
## 2:  2  2 1.0  B
## 3:  2  8 1.0  B
## 4:  1  9 1.5  C
##    V1 V2  V3 V4
## 1:  1  1 0.5  A
## 2:  2  2 1.0  B
## 3:  2  8 1.0  B
## 4:  1  9 1.5  C

DF[-(3:7),]
slice(DF, -(3:7)) # same

## # A tibble: 4 x 4
##      V1    V2    V3 V4   
##   <int> <int> <dbl> <chr>
## 1     1     1   0.5 A    
## 2     2     2   1   B    
## 3     2     8   1   B    
## 4     1     9   1.5 C    
## # A tibble: 4 x 4
##      V1    V2    V3 V4   
##   <int> <int> <dbl> <chr>
## 1     1     1   0.5 A    
## 2     2     2   1   B    
## 3     2     8   1   B    
## 4     1     9   1.5 C

Filter rows using a logical expression

DT[V2 > 5]
DT[V4 %chin% c("A", "C")] # fast %in% for character

##    V1 V2  V3 V4
## 1:  2  6 1.5  C
## 2:  1  7 0.5  A
## 3:  2  8 1.0  B
## 4:  1  9 1.5  C
##    V1 V2  V3 V4
## 1:  1  1 0.5  A
## 2:  1  3 1.5  C
## 3:  2  4 0.5  A
## 4:  2  6 1.5  C
## 5:  1  7 0.5  A
## 6:  1  9 1.5  C

filter(DF, V2 > 5)
filter(DF, V4 %in% c("A", "C"))

## # A tibble: 4 x 4
##      V1    V2    V3 V4   
##   <int> <int> <dbl> <chr>
## 1     2     6   1.5 C    
## 2     1     7   0.5 A    
## 3     2     8   1   B    
## 4     1     9   1.5 C    
## # A tibble: 6 x 4
##      V1    V2    V3 V4   
##   <int> <int> <dbl> <chr>
## 1     1     1   0.5 A    
## 2     1     3   1.5 C    
## 3     2     4   0.5 A    
## 4     2     6   1.5 C    
## 5     1     7   0.5 A    
## 6     1     9   1.5 C

Filter rows using multiple conditions

DT[V1 == 1 & V4 == "A"]
# any logical criteria can be used

##    V1 V2  V3 V4
## 1:  1  1 0.5  A
## 2:  1  7 0.5  A

filter(DF, V1 == 1, V4 == "A")
# any logical criteria can be used

## # A tibble: 2 x 4
##      V1    V2    V3 V4   
##   <int> <int> <dbl> <chr>
## 1     1     1   0.5 A    
## 2     1     7   0.5 A

Filter unique rows

unique(DT)
unique(DT, by = c("V1", "V4")) # returns all cols

##    V1 V2  V3 V4
## 1:  1  1 0.5  A
## 2:  2  2 1.0  B
## 3:  1  3 1.5  C
## 4:  2  4 0.5  A
## 5:  1  5 1.0  B
## 6:  2  6 1.5  C
## 7:  1  7 0.5  A
## 8:  2  8 1.0  B
## 9:  1  9 1.5  C
##    V1 V2  V3 V4
## 1:  1  1 0.5  A
## 2:  2  2 1.0  B
## 3:  1  3 1.5  C
## 4:  2  4 0.5  A
## 5:  1  5 1.0  B
## 6:  2  6 1.5  C

distinct(DF) # distinct_all(DF)
distinct(DF, across(c(V1, V4))) # returns selected cols
# distinct_at(DF, vars(V1, V4))

## # A tibble: 9 x 4
##      V1    V2    V3 V4   
##   <int> <int> <dbl> <chr>
## 1     1     1   0.5 A    
## 2     2     2   1   B    
## 3     1     3   1.5 C    
## 4     2     4   0.5 A    
## 5     1     5   1   B    
## 6     2     6   1.5 C    
## 7     1     7   0.5 A    
## 8     2     8   1   B    
## 9     1     9   1.5 C    
## # A tibble: 6 x 2
##      V1 V4   
##   <int> <chr>
## 1     1 A    
## 2     2 B    
## 3     1 C    
## 4     2 A    
## 5     1 B    
## 6     2 C

Discard rows with missing values

na.omit(DT, cols = 1:4)  # fast S3 method with cols argument

##    V1 V2  V3 V4
## 1:  1  1 0.5  A
## 2:  2  2 1.0  B
## 3:  1  3 1.5  C
## 4:  2  4 0.5  A
## 5:  1  5 1.0  B
## 6:  2  6 1.5  C
## 7:  1  7 0.5  A
## 8:  2  8 1.0  B
## 9:  1  9 1.5  C

tidyr::drop_na(DF, names(DF))

## # A tibble: 9 x 4
##      V1    V2    V3 V4   
##   <int> <int> <dbl> <chr>
## 1     1     1   0.5 A    
## 2     2     2   1   B    
## 3     1     3   1.5 C    
## 4     2     4   0.5 A    
## 5     1     5   1   B    
## 6     2     6   1.5 C    
## 7     1     7   0.5 A    
## 8     2     8   1   B    
## 9     1     9   1.5 C

Other filters

In addition to the main filter() function, dplyr also offers slice_*() helpers. With data.table, we can simply use a custom expression in DT[i].

DT[sample(.N, 3)] # .N = nb of rows in DT
DT[sample(.N, .N / 2)]
DT[frankv(-V1, ties.method = "dense") < 2]

##    V1 V2  V3 V4
## 1:  1  9 1.5  C
## 2:  2  4 0.5  A
## 3:  1  7 0.5  A
##    V1 V2  V3 V4
## 1:  1  1 0.5  A
## 2:  2  2 1.0  B
## 3:  1  5 1.0  B
## 4:  1  3 1.5  C
##    V1 V2  V3 V4
## 1:  2  2 1.0  B
## 2:  2  4 0.5  A
## 3:  2  6 1.5  C
## 4:  2  8 1.0  B

slice_sample(DF, n = 3)      # n random rows
# sample_n(DF, 3)
slice_sample(DF, prop = 0.5) # fraction of rows
# sample_frac(DF, 0.5)
DF %>% slice_max(V1, n = 1)
# top_n(DF, 1, V1)

## # A tibble: 3 x 4
##      V1    V2    V3 V4   
##   <int> <int> <dbl> <chr>
## 1     2     2   1   B    
## 2     1     3   1.5 C    
## 3     2     8   1   B    
## # A tibble: 4 x 4
##      V1    V2    V3 V4   
##   <int> <int> <dbl> <chr>
## 1     1     1   0.5 A    
## 2     1     5   1   B    
## 3     2     8   1   B    
## 4     2     2   1   B    
## # A tibble: 4 x 4
##      V1    V2    V3 V4   
##   <int> <int> <dbl> <chr>
## 1     2     2   1   B    
## 2     2     4   0.5 A    
## 3     2     6   1.5 C    
## 4     2     8   1   B

On the other hand, data.table also provides convenience functions to filter rows based on a regular expression or to find values lying in one (or several) interval(s).

DT[V4 %like% "^B"]
DT[V2 %between% c(3, 5)]
DT[data.table::between(V2, 3, 5, incbounds = FALSE)]
DT[V2 %inrange% list(-1:1, 1:3)] # see also ?inrange

##    V1 V2 V3 V4
## 1:  2  2  1  B
## 2:  1  5  1  B
## 3:  2  8  1  B
##    V1 V2  V3 V4
## 1:  1  3 1.5  C
## 2:  2  4 0.5  A
## 3:  1  5 1.0  B
##    V1 V2  V3 V4
## 1:  2  4 0.5  A
##    V1 V2  V3 V4
## 1:  1  1 0.5  A
## 2:  2  2 1.0  B
## 3:  1  3 1.5  C

filter(DF, grepl("^B", V4))
filter(DF, dplyr::between(V2, 3, 5))
filter(DF, V2 > 3 & V2 < 5)
filter(DF, V2 >= -1:1 & V2 <= 1:3)

## # A tibble: 3 x 4
##      V1    V2    V3 V4   
##   <int> <int> <dbl> <chr>
## 1     2     2     1 B    
## 2     1     5     1 B    
## 3     2     8     1 B    
## # A tibble: 3 x 4
##      V1    V2    V3 V4   
##   <int> <int> <dbl> <chr>
## 1     1     3   1.5 C    
## 2     2     4   0.5 A    
## 3     1     5   1   B    
## # A tibble: 1 x 4
##      V1    V2    V3 V4   
##   <int> <int> <dbl> <chr>
## 1     2     4   0.5 A    
## # A tibble: 3 x 4
##      V1    V2    V3 V4   
##   <int> <int> <dbl> <chr>
## 1     1     1   0.5 A    
## 2     2     2   1   B    
## 3     1     3   1.5 C

Below, we will see that data.table has two optimized mechanisms to filter rows efficiently (keys and indices).

Sort rows

Sort rows by column

DT[order(V3)]  # see also setorder

##    V1 V2  V3 V4
## 1:  1  1 0.5  A
## 2:  2  4 0.5  A
## 3:  1  7 0.5  A
## 4:  2  2 1.0  B
## 5:  1  5 1.0  B
## 6:  2  8 1.0  B
## 7:  1  3 1.5  C
## 8:  2  6 1.5  C
## 9:  1  9 1.5  C

arrange(DF, V3)

## # A tibble: 9 x 4
##      V1    V2    V3 V4   
##   <int> <int> <dbl> <chr>
## 1     1     1   0.5 A    
## 2     2     4   0.5 A    
## 3     1     7   0.5 A    
## 4     2     2   1   B    
## 5     1     5   1   B    
## 6     2     8   1   B    
## 7     1     3   1.5 C    
## 8     2     6   1.5 C    
## 9     1     9   1.5 C

Sort rows in decreasing order

DT[order(-V3)]

##    V1 V2  V3 V4
## 1:  1  3 1.5  C
## 2:  2  6 1.5  C
## 3:  1  9 1.5  C
## 4:  2  2 1.0  B
## 5:  1  5 1.0  B
## 6:  2  8 1.0  B
## 7:  1  1 0.5  A
## 8:  2  4 0.5  A
## 9:  1  7 0.5  A

arrange(DF, desc(V3))

## # A tibble: 9 x 4
##      V1    V2    V3 V4   
##   <int> <int> <dbl> <chr>
## 1     1     3   1.5 C    
## 2     2     6   1.5 C    
## 3     1     9   1.5 C    
## 4     2     2   1   B    
## 5     1     5   1   B    
## 6     2     8   1   B    
## 7     1     1   0.5 A    
## 8     2     4   0.5 A    
## 9     1     7   0.5 A

Sort rows based on several columns

DT[order(V1, -V2)]

##    V1 V2  V3 V4
## 1:  1  9 1.5  C
## 2:  1  7 0.5  A
## 3:  1  5 1.0  B
## 4:  1  3 1.5  C
## 5:  1  1 0.5  A
## 6:  2  8 1.0  B
## 7:  2  6 1.5  C
## 8:  2  4 0.5  A
## 9:  2  2 1.0  B

arrange(DF, V1, desc(V2))

## # A tibble: 9 x 4
##      V1    V2    V3 V4   
##   <int> <int> <dbl> <chr>
## 1     1     9   1.5 C    
## 2     1     7   0.5 A    
## 3     1     5   1   B    
## 4     1     3   1.5 C    
## 5     1     1   0.5 A    
## 6     2     8   1   B    
## 7     2     6   1.5 C    
## 8     2     4   0.5 A    
## 9     2     2   1   B

Select columns

Select one column using an index (not recommended)

DT[[3]] # returns a vector
DT[, 3]  # returns a data.table

## [1] 0.5 1.0 1.5 0.5 1.0 1.5 0.5 1.0 1.5
##     V3
## 1: 0.5
## 2: 1.0
## 3: 1.5
## 4: 0.5
## 5: 1.0
## 6: 1.5
## 7: 0.5
## 8: 1.0
## 9: 1.5

DF[[3]] # returns a vector
DF[3]   # returns a tibble

## [1] 0.5 1.0 1.5 0.5 1.0 1.5 0.5 1.0 1.5
## # A tibble: 9 x 1
##      V3
##   <dbl>
## 1   0.5
## 2   1  
## 3   1.5
## 4   0.5
## 5   1  
## 6   1.5
## 7   0.5
## 8   1  
## 9   1.5

Select one column using column name

DT[, list(V2)] # returns a data.table
DT[, .(V2)]    # returns a data.table
# . is an alias for list
DT[, "V2"]     # returns a data.table
DT[, V2]       # returns a vector
DT[["V2"]]     # returns a vector

##    V2
## 1:  1
## 2:  2
## 3:  3
## 4:  4
## 5:  5
## 6:  6
## 7:  7
## 8:  8
## 9:  9
##    V2
## 1:  1
## 2:  2
## 3:  3
## 4:  4
## 5:  5
## 6:  6
## 7:  7
## 8:  8
## 9:  9
##    V2
## 1:  1
## 2:  2
## 3:  3
## 4:  4
## 5:  5
## 6:  6
## 7:  7
## 8:  8
## 9:  9
## [1] 1 2 3 4 5 6 7 8 9
## [1] 1 2 3 4 5 6 7 8 9

select(DF, V2) # returns a tibble
pull(DF, V2, name = V4)   # returns a (named) vector
DF[, "V2"]        # returns a tibble
DF[["V2"]]        # returns a vector

## # A tibble: 9 x 1
##      V2
##   <int>
## 1     1
## 2     2
## 3     3
## 4     4
## 5     5
## 6     6
## 7     7
## 8     8
## 9     9
## A B C A B C A B C 
## 1 2 3 4 5 6 7 8 9 
## # A tibble: 9 x 1
##      V2
##   <int>
## 1     1
## 2     2
## 3     3
## 4     4
## 5     5
## 6     6
## 7     7
## 8     8
## 9     9
## [1] 1 2 3 4 5 6 7 8 9

Select several columns

DT[, .(V2, V3, V4)]
DT[, list(V2, V3, V4)]
DT[, V2:V4] # select columns between V2 and V4

##    V2  V3 V4
## 1:  1 0.5  A
## 2:  2 1.0  B
## 3:  3 1.5  C
## 4:  4 0.5  A
## 5:  5 1.0  B
## 6:  6 1.5  C
## 7:  7 0.5  A
## 8:  8 1.0  B
## 9:  9 1.5  C
##    V2  V3 V4
## 1:  1 0.5  A
## 2:  2 1.0  B
## 3:  3 1.5  C
## 4:  4 0.5  A
## 5:  5 1.0  B
## 6:  6 1.5  C
## 7:  7 0.5  A
## 8:  8 1.0  B
## 9:  9 1.5  C
##    V2  V3 V4
## 1:  1 0.5  A
## 2:  2 1.0  B
## 3:  3 1.5  C
## 4:  4 0.5  A
## 5:  5 1.0  B
## 6:  6 1.5  C
## 7:  7 0.5  A
## 8:  8 1.0  B
## 9:  9 1.5  C

select(DF, V2, V3, V4)
select(DF, V2:V4) # select columns between V2 and V4

## # A tibble: 9 x 3
##      V2    V3 V4   
##   <int> <dbl> <chr>
## 1     1   0.5 A    
## 2     2   1   B    
## 3     3   1.5 C    
## 4     4   0.5 A    
## 5     5   1   B    
## 6     6   1.5 C    
## 7     7   0.5 A    
## 8     8   1   B    
## 9     9   1.5 C    
## # A tibble: 9 x 3
##      V2    V3 V4   
##   <int> <dbl> <chr>
## 1     1   0.5 A    
## 2     2   1   B    
## 3     3   1.5 C    
## 4     4   0.5 A    
## 5     5   1   B    
## 6     6   1.5 C    
## 7     7   0.5 A    
## 8     8   1   B    
## 9     9   1.5 C

Exclude columns

DT[, !c("V2", "V3")]

##    V1 V4
## 1:  1  A
## 2:  2  B
## 3:  1  C
## 4:  2  A
## 5:  1  B
## 6:  2  C
## 7:  1  A
## 8:  2  B
## 9:  1  C

select(DF, -V2, -V3)

## # A tibble: 9 x 2
##      V1 V4   
##   <int> <chr>
## 1     1 A    
## 2     2 B    
## 3     1 C    
## 4     2 A    
## 5     1 B    
## 6     2 C    
## 7     1 A    
## 8     2 B    
## 9     1 C

Select/Exclude columns using a character vector

cols <- c("V2", "V3")
DT[, ..cols] # .. prefix means 'one-level up'
DT[, !..cols] # or DT[, -..cols]

##    V2  V3
## 1:  1 0.5
## 2:  2 1.0
## 3:  3 1.5
## 4:  4 0.5
## 5:  5 1.0
## 6:  6 1.5
## 7:  7 0.5
## 8:  8 1.0
## 9:  9 1.5
##    V1 V4
## 1:  1  A
## 2:  2  B
## 3:  1  C
## 4:  2  A
## 5:  1  B
## 6:  2  C
## 7:  1  A
## 8:  2  B
## 9:  1  C

cols <- c("V2", "V3")
select(DF, !!cols) # unquoting
select(DF, -!!cols)

## # A tibble: 9 x 2
##      V2    V3
##   <int> <dbl>
## 1     1   0.5
## 2     2   1  
## 3     3   1.5
## 4     4   0.5
## 5     5   1  
## 6     6   1.5
## 7     7   0.5
## 8     8   1  
## 9     9   1.5
## # A tibble: 9 x 2
##      V1 V4   
##   <int> <chr>
## 1     1 A    
## 2     2 B    
## 3     1 C    
## 4     2 A    
## 5     1 B    
## 6     2 C    
## 7     1 A    
## 8     2 B    
## 9     1 C

Other selections

As for row filtering, dplyr includes helper functions to select column. With data.table, a possible solution is to first retrieve the column names (e.g. using a regular expression), then select these columns. Another way (using patterns()) is presented in a next section.

cols <- paste0("V", 1:2)
cols <- union("V4", names(DT))
cols <- grep("V",   names(DT))
cols <- grep("3$",  names(DT))
cols <- grep(".2",  names(DT))
cols <- grep("^V1|X$",  names(DT))
cols <- grep("^(?!V2)", names(DT), perl = TRUE)
DT[, ..cols]

##    V1  V3 V4
## 1:  1 0.5  A
## 2:  2 1.0  B
## 3:  1 1.5  C
## 4:  2 0.5  A
## 5:  1 1.0  B
## 6:  2 1.5  C
## 7:  1 0.5  A
## 8:  2 1.0  B
## 9:  1 1.5  C

select(DF, num_range("V", 1:2))
select(DF, V4, everything()) # reorder columns
select(DF, contains("V"))
select(DF, ends_with("3"))
select(DF, matches(".2"))
select(DF, one_of(c("V1", "X")))

## Warning: Unknown columns: `X`

select(DF, -starts_with("V2"))
# remove variables using "-" prior to function

## # A tibble: 9 x 2
##      V1    V2
##   <int> <int>
## 1     1     1
## 2     2     2
## 3     1     3
## 4     2     4
## 5     1     5
## 6     2     6
## 7     1     7
## 8     2     8
## 9     1     9
## # A tibble: 9 x 4
##   V4       V1    V2    V3
##   <chr> <int> <int> <dbl>
## 1 A         1     1   0.5
## 2 B         2     2   1  
## 3 C         1     3   1.5
## 4 A         2     4   0.5
## 5 B         1     5   1  
## 6 C         2     6   1.5
## 7 A         1     7   0.5
## 8 B         2     8   1  
## 9 C         1     9   1.5
## # A tibble: 9 x 4
##      V1    V2    V3 V4   
##   <int> <int> <dbl> <chr>
## 1     1     1   0.5 A    
## 2     2     2   1   B    
## 3     1     3   1.5 C    
## 4     2     4   0.5 A    
## 5     1     5   1   B    
## 6     2     6   1.5 C    
## 7     1     7   0.5 A    
## 8     2     8   1   B    
## 9     1     9   1.5 C    
## # A tibble: 9 x 1
##      V3
##   <dbl>
## 1   0.5
## 2   1  
## 3   1.5
## 4   0.5
## 5   1  
## 6   1.5
## 7   0.5
## 8   1  
## 9   1.5
## # A tibble: 9 x 1
##      V2
##   <int>
## 1     1
## 2     2
## 3     3
## 4     4
## 5     5
## 6     6
## 7     7
## 8     8
## 9     9
## # A tibble: 9 x 1
##      V1
##   <int>
## 1     1
## 2     2
## 3     1
## 4     2
## 5     1
## 6     2
## 7     1
## 8     2
## 9     1
## # A tibble: 9 x 3
##      V1    V3 V4   
##   <int> <dbl> <chr>
## 1     1   0.5 A    
## 2     2   1   B    
## 3     1   1.5 C    
## 4     2   0.5 A    
## 5     1   1   B    
## 6     2   1.5 C    
## 7     1   0.5 A    
## 8     2   1   B    
## 9     1   1.5 C

Summarise data

Summarise one column

DT[, sum(V1)]    # returns a vector
DT[, .(sum(V1))] # returns a data.table
DT[, .(sumV1 = sum(V1))] # returns a data.table

## [1] 13
##    V1
## 1: 13
##    sumV1
## 1:    13

summarise(DF, sum(V1)) # returns a tibble
summarise(DF, sumV1 = sum(V1)) # returns a tibble

## # A tibble: 1 x 1
##   `sum(V1)`
##       <int>
## 1        13
## # A tibble: 1 x 1
##   sumV1
##   <int>
## 1    13

Summarise several columns

DT[, .(sum(V1), sd(V3))]

##    V1        V2
## 1: 13 0.4330127

summarise(DF, sum(V1), sd(V3))

## # A tibble: 1 x 2
##   `sum(V1)` `sd(V3)`
##       <int>    <dbl>
## 1        13    0.433

Summarise several columns and assign column names

DT[, .(sumv1 = sum(V1),
       sdv3  = sd(V3))]

##    sumv1      sdv3
## 1:    13 0.4330127

DF %>%
  summarise(sumv1 = sum(V1),
            sdv3  = sd(V3))

## # A tibble: 1 x 2
##   sumv1  sdv3
##   <int> <dbl>
## 1    13 0.433

Summarise a subset of rows

DT[1:4, sum(V1)]

## [1] 6

DF %>%
  slice(1:4) %>%
  summarise(sum(V1))

## # A tibble: 1 x 1
##   `sum(V1)`
##       <int>
## 1         6

dplyr helper functions for summarise() (or summarize()) include first(), last(), n(), nth(), and n_distinct(). The data.table package also include first(), last(), and uniqueN().

DT[, data.table::first(V3)]
DT[, data.table::last(V3)]
DT[5, V3]
DT[, uniqueN(V4)]
uniqueN(DT)

## [1] 0.5
## [1] 1.5
## [1] 1
## [1] 3
## [1] 9

summarise(DF, dplyr::first(V3))
summarise(DF, dplyr::last(V3))
summarise(DF, nth(V3, 5))
summarise(DF, n_distinct(V4))
n_distinct(DF)

## # A tibble: 1 x 1
##   `dplyr::first(V3)`
##                <dbl>
## 1                0.5
## # A tibble: 1 x 1
##   `dplyr::last(V3)`
##               <dbl>
## 1               1.5
## # A tibble: 1 x 1
##   `nth(V3, 5)`
##          <dbl>
## 1            1
## # A tibble: 1 x 1
##   `n_distinct(V4)`
##              <int>
## 1                3
## [1] 9

Add/update/delete columns

In the following commands, with data.table, columns are modified by reference using the column assignment symbol := (no copy performed) and the results are returned invisibly. With dplyr, we have to assign the results.

Modify a column

DT[, V1 := V1^2]
DT

##    V1 V2  V3 V4
## 1:  1  1 0.5  A
## 2:  4  2 1.0  B
## 3:  1  3 1.5  C
## 4:  4  4 0.5  A
## 5:  1  5 1.0  B
## 6:  4  6 1.5  C
## 7:  1  7 0.5  A
## 8:  4  8 1.0  B
## 9:  1  9 1.5  C

DF <- DF %>% mutate(V1 = V1^2)
DF

## # A tibble: 9 x 4
##      V1    V2    V3 V4   
##   <dbl> <int> <dbl> <chr>
## 1     1     1   0.5 A    
## 2     4     2   1   B    
## 3     1     3   1.5 C    
## 4     4     4   0.5 A    
## 5     1     5   1   B    
## 6     4     6   1.5 C    
## 7     1     7   0.5 A    
## 8     4     8   1   B    
## 9     1     9   1.5 C

Add one column

DT[, v5 := log(V1)][] # adding [] prints the result

##    V1 V2  V3 V4       v5
## 1:  1  1 0.5  A 0.000000
## 2:  4  2 1.0  B 1.386294
## 3:  1  3 1.5  C 0.000000
## 4:  4  4 0.5  A 1.386294
## 5:  1  5 1.0  B 0.000000
## 6:  4  6 1.5  C 1.386294
## 7:  1  7 0.5  A 0.000000
## 8:  4  8 1.0  B 1.386294
## 9:  1  9 1.5  C 0.000000

DF <- mutate(DF, v5 = log(V1))
# see ?mutate for options (.keep, .before, ...)

Add several columns

DT[, c("v6", "v7") := .(sqrt(V1), "X")]

DT[, ':='(v6 = sqrt(V1),
          v7 = "X")]     # same, functional form

DF <- mutate(DF, v6 = sqrt(V1), v7 = "X")

Create one column and remove the others

DT[, .(v8 = V3 + 1)]

##     v8
## 1: 1.5
## 2: 2.0
## 3: 2.5
## 4: 1.5
## 5: 2.0
## 6: 2.5
## 7: 1.5
## 8: 2.0
## 9: 2.5

transmute(DF, v8 = V3 + 1)

## # A tibble: 9 x 1
##      v8
##   <dbl>
## 1   1.5
## 2   2  
## 3   2.5
## 4   1.5
## 5   2  
## 6   2.5
## 7   1.5
## 8   2  
## 9   2.5

Remove one column

DT[, v5 := NULL]

DF <- select(DF, -v5)

Remove several columns

DT[, c("v6", "v7") := NULL]

DF <- select(DF, -v6, -v7)

Remove columns using a vector of colnames

cols <- c("V3")
DT[, (cols) := NULL] # ! not DT[, cols := NULL]

cols <- c("V3")
DF <- select(DF, -one_of(cols))

Replace values for rows matching a condition

DT[V2 < 4, V2 := 0L]
DT

##    V1 V2 V4
## 1:  1  0  A
## 2:  4  0  B
## 3:  1  0  C
## 4:  4  4  A
## 5:  1  5  B
## 6:  4  6  C
## 7:  1  7  A
## 8:  4  8  B
## 9:  1  9  C

DF <- mutate(DF, V2 = base::replace(V2, V2 < 4, 0L))
DF

## # A tibble: 9 x 3
##      V1    V2 V4   
##   <dbl> <int> <chr>
## 1     1     0 A    
## 2     4     0 B    
## 3     1     0 C    
## 4     4     4 A    
## 5     1     5 B    
## 6     4     6 C    
## 7     1     7 A    
## 8     4     8 B    
## 9     1     9 C

by

The dplyr::group_by() function and the corresponding by and keyby statements in data.table allow to run manipulate each group of observations and combine the results. The sole difference between by and keyby is that keyby orders the results and creates a key that will allow faster subsetting (cf. the indexing and keys section). Below, we arbitrary use one or the other.
group_by() takes an existing tibble and converts it into a grouped tibble where operations will always be performed “by group”. Using ungroup() removes grouping. With data.table, by is always used on the fly.
Note that it is possible to reorder the arguments in data.table: DT[i, j, by] <=> DT[i, by, j]. This is done below to better highlight the similarity with dplyr.

By group

# one-liner:
DT[, .(sumV2 = sum(V2)), by = "V4"]
# reordered and indented:
DT[, by = V4,
     .(sumV2 = sum(V2))]
#

##    V4 sumV2
## 1:  A    11
## 2:  B    13
## 3:  C    15
##    V4 sumV2
## 1:  A    11
## 2:  B    13
## 3:  C    15

DF %>%
  group_by(V4) %>%
  summarise(sumV2 = sum(V2))

## `summarise()` ungrouping output (override with `.groups` argument)

# ungrouped output

## # A tibble: 3 x 2
##   V4    sumV2
##   <chr> <int>
## 1 A        11
## 2 B        13
## 3 C        15

By several groups

DT[, keyby = .(V4, V1),
     .(sumV2 = sum(V2))]

##    V4 V1 sumV2
## 1:  A  1     7
## 2:  A  4     4
## 3:  B  1     5
## 4:  B  4     8
## 5:  C  1     9
## 6:  C  4     6

DF %>%
  group_by(V4, V1) %>%
  summarise(sumV2 = sum(V2))

## `summarise()` regrouping output by 'V4' (override with `.groups` argument)

# output grouped by V4

## # A tibble: 6 x 3
## # Groups:   V4 [3]
##   V4       V1 sumV2
##   <chr> <dbl> <int>
## 1 A         1     7
## 2 A         4     4
## 3 B         1     5
## 4 B         4     8
## 5 C         1     9
## 6 C         4     6

Calling function in by

DT[, by = tolower(V4),
     .(sumV1 = sum(V1))]

##    tolower sumV1
## 1:       a     6
## 2:       b     9
## 3:       c     6

DF %>%
  group_by(tolower(V4)) %>%
  summarise(sumV1 = sum(V1))

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 3 x 2
##   `tolower(V4)` sumV1
##   <chr>         <dbl>
## 1 a                 6
## 2 b                 9
## 3 c                 6

Assigning column name in by

DT[, keyby = .(abc = tolower(V4)),
     .(sumV1 = sum(V1))]

##    abc sumV1
## 1:   a     6
## 2:   b     9
## 3:   c     6

DF %>%
  group_by(abc = tolower(V4)) %>%
  summarise(sumV1 = sum(V1))

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 3 x 2
##   abc   sumV1
##   <chr> <dbl>
## 1 a         6
## 2 b         9
## 3 c         6

Using a condition in by

DT[, keyby = V4 == "A",
     sum(V1)]

##       V4 V1
## 1: FALSE 15
## 2:  TRUE  6

DF %>%
  group_by(V4 == "A") %>%
  summarise(sum(V1))

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 2 x 2
##   `V4 == "A"` `sum(V1)`
##   <lgl>           <dbl>
## 1 FALSE              15
## 2 TRUE                6

By on a subset of rows

DT[1:5,                # i
   .(sumV1 = sum(V1)), # j
   by = V4]            # by
## complete DT[i, j, by] expression!

##    V4 sumV1
## 1:  A     5
## 2:  B     5
## 3:  C     1

DF %>%
  slice(1:5) %>%
  group_by(V4) %>%
  summarise(sumV1 = sum(V1))

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 3 x 2
##   V4    sumV1
##   <chr> <dbl>
## 1 A         5
## 2 B         5
## 3 C         1

Count number of observations for each group

DT[, .N, by = V4]

##    V4 N
## 1:  A 3
## 2:  B 3
## 3:  C 3

count(DF, V4)
DF %>%
  group_by(V4) %>%
  tally()
DF %>%
  group_by(V4) %>%
  summarise(n())

## `summarise()` ungrouping output (override with `.groups` argument)

DF %>%
  group_by(V4) %>%
  group_size() # returns a vector

## # A tibble: 3 x 2
##   V4        n
##   <chr> <int>
## 1 A         3
## 2 B         3
## 3 C         3
## # A tibble: 3 x 2
##   V4        n
##   <chr> <int>
## 1 A         3
## 2 B         3
## 3 C         3
## # A tibble: 3 x 2
##   V4    `n()`
##   <chr> <int>
## 1 A         3
## 2 B         3
## 3 C         3
## [1] 3 3 3

Add a column with number of observations for each group

DT[, n := .N, by = V1][]
DT[, n := NULL] # rm column for consistency

##    V1 V2 V4 n
## 1:  1  0  A 5
## 2:  4  0  B 4
## 3:  1  0  C 5
## 4:  4  4  A 4
## 5:  1  5  B 5
## 6:  4  6  C 4
## 7:  1  7  A 5
## 8:  4  8  B 4
## 9:  1  9  C 5

add_count(DF, V1)
DF %>%
  group_by(V1) %>%
  add_tally()

## # A tibble: 9 x 4
##      V1    V2 V4        n
##   <dbl> <int> <chr> <int>
## 1     1     0 A         5
## 2     4     0 B         4
## 3     1     0 C         5
## 4     4     4 A         4
## 5     1     5 B         5
## 6     4     6 C         4
## 7     1     7 A         5
## 8     4     8 B         4
## 9     1     9 C         5
## # A tibble: 9 x 4
## # Groups:   V1 [2]
##      V1    V2 V4        n
##   <dbl> <int> <chr> <int>
## 1     1     0 A         5
## 2     4     0 B         4
## 3     1     0 C         5
## 4     4     4 A         4
## 5     1     5 B         5
## 6     4     6 C         4
## 7     1     7 A         5
## 8     4     8 B         4
## 9     1     9 C         5

Retrieve the first/last/nth observation for each group

DT[, data.table::first(V2), by = V4]
DT[, data.table::last(V2), by = V4]
DT[, V2[2], by = V4]

##    V4 V1
## 1:  A  0
## 2:  B  0
## 3:  C  0
##    V4 V1
## 1:  A  7
## 2:  B  8
## 3:  C  9
##    V4 V1
## 1:  A  4
## 2:  B  5
## 3:  C  6

DF %>%
  group_by(V4) %>%
  summarise(dplyr::first(V2))

## `summarise()` ungrouping output (override with `.groups` argument)

DF %>%
  group_by(V4) %>%
  summarise(dplyr::last(V2))

## `summarise()` ungrouping output (override with `.groups` argument)

DF %>%
  group_by(V4) %>%
  summarise(dplyr::nth(V2, 2))

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 3 x 2
##   V4    `dplyr::first(V2)`
##   <chr>              <int>
## 1 A                      0
## 2 B                      0
## 3 C                      0
## # A tibble: 3 x 2
##   V4    `dplyr::last(V2)`
##   <chr>             <int>
## 1 A                     7
## 2 B                     8
## 3 C                     9
## # A tibble: 3 x 2
##   V4    `dplyr::nth(V2, 2)`
##   <chr>               <int>
## 1 A                       4
## 2 B                       5
## 3 C                       6

Going further

Advanced columns manipulation

To manipulate multiple columns, dplyr_1.0.0 has introduced the across() function, superseding the _all, _at, and _if versions of summarise(), mutate(), and transmute().
With data.table, we use .SD, which is a data.table containing the Subset of Data for each group, excluding the column(s) used in by. So, DT[, .SD] is DT itself and in the expression DT[, .SD, by = V4], .SD contains all the DT columns (except V4) for each values in V4 (see DT[, print(.SD), by = V4]). .SDcols allows to select the columns included in .SD.

Summarise all the columns

DT[, lapply(.SD, max)]

##    V1 V2 V4
## 1:  4  9  C

DF %>% summarise(across(everything(), max))
# summarise_all(DF, max)

## # A tibble: 1 x 3
##      V1    V2 V4   
##   <dbl> <int> <chr>
## 1     4     9 C

Summarise several columns

DT[, lapply(.SD, mean),
     .SDcols = c("V1", "V2")]

##          V1       V2
## 1: 2.333333 4.333333

DF %>% summarise(across(c(V1, V2), mean))
# summarise_at(DF, c("V1", "V2"), mean)

## # A tibble: 1 x 2
##      V1    V2
##   <dbl> <dbl>
## 1  2.33  4.33

Summarise several columns by group

DT[, by = V4,
     lapply(.SD, mean),
     .SDcols = c("V1", "V2")]
## using patterns (regex)
DT[, by = V4,
     lapply(.SD, mean),
     .SDcols = patterns("V1|V2|Z0")]

##    V4 V1       V2
## 1:  A  2 3.666667
## 2:  B  3 4.333333
## 3:  C  2 5.000000
##    V4 V1       V2
## 1:  A  2 3.666667
## 2:  B  3 4.333333
## 3:  C  2 5.000000

DF %>%
  group_by(V4) %>%
  summarise(across(c(V1, V2), mean))

## `summarise()` ungrouping output (override with `.groups` argument)

# DF %>%
#   group_by(V4) %>%
#   summarise_at(c("V1", "V2"), mean)
DF %>%
  group_by(V4) %>%
  summarise(across(any_of(c("V1", "V2", "Z0")), mean))

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 3 x 3
##   V4       V1    V2
##   <chr> <dbl> <dbl>
## 1 A         2  3.67
## 2 B         3  4.33
## 3 C         2  5   
## # A tibble: 3 x 3
##   V4       V1    V2
##   <chr> <dbl> <dbl>
## 1 A         2  3.67
## 2 B         3  4.33
## 3 C         2  5

Summarise with more than one function by group

DT[, by = V4,
     c(lapply(.SD, sum),
       lapply(.SD, mean))]

##    V4 V1 V2 V1       V2
## 1:  A  6 11  2 3.666667
## 2:  B  9 13  3 4.333333
## 3:  C  6 15  2 5.000000

DF %>%
  group_by(V4) %>%
  summarise(across(everything(),
                   list(sum = sum, mean = mean)))

## `summarise()` ungrouping output (override with `.groups` argument)

# columns named automatically
# DF %>%
#   group_by(V4) %>%
#   summarise_all(list(sum, mean))

## # A tibble: 3 x 5
##   V4    V1_sum V1_mean V2_sum V2_mean
##   <chr>  <dbl>   <dbl>  <int>   <dbl>
## 1 A          6       2     11    3.67
## 2 B          9       3     13    4.33
## 3 C          6       2     15    5

Summarise using a condition

DT[, lapply(.SD, mean),
     .SDcols = is.numeric]
foo <- function(x) {is.numeric(x) && mean(x) > 3}
DT[, lapply(.SD, mean),
   .SDcols = foo]

##          V1       V2
## 1: 2.333333 4.333333
##          V2
## 1: 4.333333

DF %>%
  summarise(across(where(is.numeric),
                   mean))
# summarise_if(DF, is.numeric, mean)
DF %>% summarise(across(where(~ is.numeric(.x) && mean(.x) > 3), 
                        mean))

## # A tibble: 1 x 2
##      V1    V2
##   <dbl> <dbl>
## 1  2.33  4.33
## # A tibble: 1 x 1
##      V2
##   <dbl>
## 1  4.33

Modify all the columns

DT[, lapply(.SD, rev)]

##    V1 V2 V4
## 1:  1  9  C
## 2:  4  8  B
## 3:  1  7  A
## 4:  4  6  C
## 5:  1  5  B
## 6:  4  4  A
## 7:  1  0  C
## 8:  4  0  B
## 9:  1  0  A

DF %>%
  mutate(across(everything(),
                rev))
# mutate_all(DF, rev)
# transmute_all(DF, rev)

## # A tibble: 9 x 3
##      V1    V2 V4   
##   <dbl> <int> <chr>
## 1     1     9 C    
## 2     4     8 B    
## 3     1     7 A    
## 4     4     6 C    
## 5     1     5 B    
## 6     4     4 A    
## 7     1     0 C    
## 8     4     0 B    
## 9     1     0 A

Modify several columns (dropping the others)

DT[, lapply(.SD, sqrt),
     .SDcols = V1:V2]
DT[, lapply(.SD, exp),
     .SDcols = !"V4"]

##    V1       V2
## 1:  1 0.000000
## 2:  2 0.000000
## 3:  1 0.000000
## 4:  2 2.000000
## 5:  1 2.236068
## 6:  2 2.449490
## 7:  1 2.645751
## 8:  2 2.828427
## 9:  1 3.000000
##           V1         V2
## 1:  2.718282    1.00000
## 2: 54.598150    1.00000
## 3:  2.718282    1.00000
## 4: 54.598150   54.59815
## 5:  2.718282  148.41316
## 6: 54.598150  403.42879
## 7:  2.718282 1096.63316
## 8: 54.598150 2980.95799
## 9:  2.718282 8103.08393

DF %>% transmute(across(c(V1, V2),
                 sqrt))
# transmute_at(DF, c("V1", "V2"), sqrt)
DF %>% transmute(across(-any_of("V4"),
                 exp))
# transmute_at(DF, vars(-V4), exp)

## # A tibble: 9 x 2
##      V1    V2
##   <dbl> <dbl>
## 1     1  0   
## 2     2  0   
## 3     1  0   
## 4     2  2   
## 5     1  2.24
## 6     2  2.45
## 7     1  2.65
## 8     2  2.83
## 9     1  3   
## # A tibble: 9 x 2
##      V1     V2
##   <dbl>  <dbl>
## 1  2.72    1  
## 2 54.6     1  
## 3  2.72    1  
## 4 54.6    54.6
## 5  2.72  148. 
## 6 54.6   403. 
## 7  2.72 1097. 
## 8 54.6  2981. 
## 9  2.72 8103.

Modify several columns (keeping the others)

DT[, c("V1", "V2") := lapply(.SD, sqrt),
     .SDcols = c("V1", "V2")]

cols <- setdiff(names(DT), "V4")
DT[, (cols) := lapply(.SD, "^", 2L),
     .SDcols = cols]
DT

##    V1 V2 V4
## 1:  1  0  A
## 2:  4  0  B
## 3:  1  0  C
## 4:  4  4  A
## 5:  1  5  B
## 6:  4  6  C
## 7:  1  7  A
## 8:  4  8  B
## 9:  1  9  C

DF <- DF %>%
  mutate(across(all_of(c("V1", "V2")), sqrt))
# DF <- mutate_at(DF, c("V1", "V2"), sqrt)
DF <- DF %>%
  mutate(across(-any_of("V4"),
                 ~ "^"(.x, 2L)))
DF
# DF <- mutate_at(DF, vars(-V4), "^", 2L)

## # A tibble: 9 x 3
##      V1    V2 V4   
##   <dbl> <dbl> <chr>
## 1     1  0    A    
## 2     4  0    B    
## 3     1  0    C    
## 4     4  4    A    
## 5     1  5.   B    
## 6     4  6.00 C    
## 7     1  7.   A    
## 8     4  8.   B    
## 9     1  9    C

Modify columns using a condition (dropping the others)

DT[, .SD - 1,
     .SDcols = is.numeric]
DT

##    V1 V2
## 1:  0 -1
## 2:  3 -1
## 3:  0 -1
## 4:  3  3
## 5:  0  4
## 6:  3  5
## 7:  0  6
## 8:  3  7
## 9:  0  8
##    V1 V2 V4
## 1:  1  0  A
## 2:  4  0  B
## 3:  1  0  C
## 4:  4  4  A
## 5:  1  5  B
## 6:  4  6  C
## 7:  1  7  A
## 8:  4  8  B
## 9:  1  9  C

DF %>%
  transmute(across(where(is.numeric),
                   ~ '-'(., 1L)))
DF
# transmute_if(DF, is.numeric, list(~ '-'(., 1L)))

## # A tibble: 9 x 2
##      V1    V2
##   <dbl> <dbl>
## 1     0 -1   
## 2     3 -1   
## 3     0 -1   
## 4     3  3   
## 5     0  4.  
## 6     3  5.00
## 7     0  6.  
## 8     3  7.  
## 9     0  8   
## # A tibble: 9 x 3
##      V1    V2 V4   
##   <dbl> <dbl> <chr>
## 1     1  0    A    
## 2     4  0    B    
## 3     1  0    C    
## 4     4  4    A    
## 5     1  5.   B    
## 6     4  6.00 C    
## 7     1  7.   A    
## 8     4  8.   B    
## 9     1  9    C

Modify columns using a condition (keeping the others)

DT[, (cols) := lapply(.SD, as.integer),
     .SDcols = is.numeric]

DF <- DF %>%
  mutate(across(where(is.numeric),
         as.integer))
# DF <- mutate_if(DF, is.numeric, as.integer)

The use of DT[,j] is very flexible, allowing to pass complex expressions in a straightforward way, or combine expressions with multiple outputs.

Use a complex expression

DT[, by = V4,
     .(V1[1:2], "X")]

##    V4 V1 V2
## 1:  A  1  X
## 2:  A  4  X
## 3:  B  4  X
## 4:  B  1  X
## 5:  C  1  X
## 6:  C  4  X

DF %>%
  group_by(V4) %>%
  slice(1:2) %>%
  transmute(V1 = V1,
            V2 = "X")

## # A tibble: 6 x 3
## # Groups:   V4 [3]
##   V4       V1 V2   
##   <chr> <int> <chr>
## 1 A         1 X    
## 2 A         4 X    
## 3 B         4 X    
## 4 B         1 X    
## 5 C         1 X    
## 6 C         4 X

Use multiple expressions (with `DT[,{j}]`)

DT[, {print(V1) #  comments here!
      print(summary(V1))
      x <- V1 + sum(V2)
     .(A = 1:.N, B = x) # last list returned as a data.table
     }]

## [1] 1 4 1 4 1 4 1 4 1
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.000   1.000   2.333   4.000   4.000 
##    A  B
## 1: 1 39
## 2: 2 42
## 3: 3 39
## 4: 4 42
## 5: 5 39
## 6: 6 42
## 7: 7 39
## 8: 8 42
## 9: 9 39

Chain expressions

The dplyr workflow relies on the magrittr pipe operator (%>%). The magrittr package can also be used with data.table objects, but data.table comes with its own chaining system: DT[ ... ][ ... ][ ... ].

Expression chaining using `DT[][]` (recommended)

DT[, by = V4, 
     .(V1sum = sum(V1)) ][
     V1sum > 5]

##    V4 V1sum
## 1:  A     6
## 2:  B     9
## 3:  C     6

DF %>%
  group_by(V4) %>%
  summarise(V1sum = sum(V1)) %>%
  filter(V1sum > 5)

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 3 x 2
##   V4    V1sum
##   <chr> <int>
## 1 A         6
## 2 B         9
## 3 C         6

Expression chaining using `%>%`

DT[, by = V4, 
     .(V1sum = sum(V1))] %>%
  .[order(-V1sum)]

##    V4 V1sum
## 1:  B     9
## 2:  A     6
## 3:  C     6

DF %>%
  group_by(V4) %>%
  summarise(V1sum = sum(V1)) %>%
  arrange(desc(V1sum))

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 3 x 2
##   V4    V1sum
##   <chr> <int>
## 1 B         9
## 2 A         6
## 3 C         6

Indexing and Keys

Row subsetting in dplyr relies on the filter() and slice() functions, as shown in the first section. With data.table, in addition to the above-mentioned approach, two systems are available to make row filtering and join operations more convenient and blazingly fast (~170x speed-up): keys (primary ordered index) and indices (automatic secondary indexing).

The main differences between keys and indices are:
- When using keys, data are physically reordered in memory. When using indices, the order is stored as an attribute.
- Only one key is possible but several indices can coexist.
- Keys are defined explicitly. Indices can be created manually but are also created on-the-fly (and stored when using == or %in%).
- Indices are used with the on argument. It is optional when using keys, but recommended (and used below) for better readability.
Note: in the following code, we set both a key and an index to demonstrate their usage, but internally, indices are not used when a key already exists for the same columns(s).

Set the key/index

setkey(DT, V4)
setindex(DT, V4)
DT

##    V1 V2 V4
## 1:  1  0  A
## 2:  4  4  A
## 3:  1  7  A
## 4:  4  0  B
## 5:  1  5  B
## 6:  4  8  B
## 7:  1  0  C
## 8:  4  5  C
## 9:  1  9  C

DF <- arrange(DF, V4) # ordered just for consistency
DF

## # A tibble: 9 x 3
##      V1    V2 V4   
##   <int> <int> <chr>
## 1     1     0 A    
## 2     4     4 A    
## 3     1     7 A    
## 4     4     0 B    
## 5     1     5 B    
## 6     4     8 B    
## 7     1     0 C    
## 8     4     5 C    
## 9     1     9 C

Select the matching rows

DT["A", on = "V4"]
DT[c("A", "C"), on = .(V4)] # same as on = "V4"

##    V1 V2 V4
## 1:  1  0  A
## 2:  4  4  A
## 3:  1  7  A
##    V1 V2 V4
## 1:  1  0  A
## 2:  4  4  A
## 3:  1  7  A
## 4:  1  0  C
## 5:  4  5  C
## 6:  1  9  C

filter(DF, V4 == "A")
filter(DF, V4 %in% c("A", "C"))

## # A tibble: 3 x 3
##      V1    V2 V4   
##   <int> <int> <chr>
## 1     1     0 A    
## 2     4     4 A    
## 3     1     7 A    
## # A tibble: 6 x 3
##      V1    V2 V4   
##   <int> <int> <chr>
## 1     1     0 A    
## 2     4     4 A    
## 3     1     7 A    
## 4     1     0 C    
## 5     4     5 C    
## 6     1     9 C

Select the first matching row

DT["B", on = "V4", mult = "first"]
DT[c("B", "C"), on = "V4", mult = "first"]

##    V1 V2 V4
## 1:  4  0  B
##    V1 V2 V4
## 1:  4  0  B
## 2:  1  0  C

DF %>%
  filter(V4 == "B") %>%
  slice(1)
# ?

## # A tibble: 1 x 3
##      V1    V2 V4   
##   <int> <int> <chr>
## 1     4     0 B

Select the last matching row

DT["A", on = "V4", mult = "last"]

##    V1 V2 V4
## 1:  1  7  A

DF %>%
  filter(V4 == "A") %>%
  slice(n())

## # A tibble: 1 x 3
##      V1    V2 V4   
##   <int> <int> <chr>
## 1     1     7 A

Nomatch argument

# (default) returns a row with "D" even if not found
DT[c("A", "D"), on = "V4", nomatch = NA]
# no rows for unmatched values
DT[c("A", "D"), on = "V4", nomatch = 0]

##    V1 V2 V4
## 1:  1  0  A
## 2:  4  4  A
## 3:  1  7  A
## 4: NA NA  D
##    V1 V2 V4
## 1:  1  0  A
## 2:  4  4  A
## 3:  1  7  A

#
filter(DF, V4 %in% c("A", "D"))

## # A tibble: 3 x 3
##      V1    V2 V4   
##   <int> <int> <chr>
## 1     1     0 A    
## 2     4     4 A    
## 3     1     7 A

Apply a function on the matching rows

DT[c("A", "C"), sum(V1), on = "V4"]

## [1] 12

DF %>%
  filter(V4 %in% c("A", "C")) %>%
  summarise(sum(V1))

## # A tibble: 1 x 1
##   `sum(V1)`
##       <int>
## 1        12

Modify values for matching rows

DT["A", V1 := 0, on = "V4"]
DT

##    V1 V2 V4
## 1:  0  0  A
## 2:  0  4  A
## 3:  0  7  A
## 4:  4  0  B
## 5:  1  5  B
## 6:  4  8  B
## 7:  1  0  C
## 8:  4  5  C
## 9:  1  9  C

DF <- DF %>%
  mutate(V1 = base::replace(V1, V4 == "A", 0L)) %>%
  arrange(V4)
DF

## # A tibble: 9 x 3
##      V1    V2 V4   
##   <int> <int> <chr>
## 1     0     0 A    
## 2     0     4 A    
## 3     0     7 A    
## 4     4     0 B    
## 5     1     5 B    
## 6     4     8 B    
## 7     1     0 C    
## 8     4     5 C    
## 9     1     9 C

Use keys in by

DT[!"B", sum(V1), on = "V4", by = .EACHI]
DT[V4 != "B",
   by = V4,
   sum(V1)]   # same

##    V4 V1
## 1:  A  0
## 2:  C  6
##    V4 V1
## 1:  A  0
## 2:  C  6

DF %>%
  filter(V4 != "B") %>%
  group_by(V4) %>%
  summarise(sum(V1))

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 2 x 2
##   V4    `sum(V1)`
##   <chr>     <int>
## 1 A             0
## 2 C             6

Set keys/indices for multiple columns

setkey(DT, V4, V1) # or setkeyv(DT, c("V4", "V1"))
setindex(DT, V4, V1) # setindexv(DT, c("V4", "V1"))
DT

##    V1 V2 V4
## 1:  0  0  A
## 2:  0  4  A
## 3:  0  7  A
## 4:  1  5  B
## 5:  4  0  B
## 6:  4  8  B
## 7:  1  0  C
## 8:  1  9  C
## 9:  4  5  C

DF <- arrange(DF, V4, V1) # ordered just for consistency
DF

## # A tibble: 9 x 3
##      V1    V2 V4   
##   <int> <int> <chr>
## 1     0     0 A    
## 2     0     4 A    
## 3     0     7 A    
## 4     1     5 B    
## 5     4     0 B    
## 6     4     8 B    
## 7     1     0 C    
## 8     1     9 C    
## 9     4     5 C

Subset using multiple keys/indices

DT[.("C", 1), on = .(V4, V1)]
DT[.(c("B", "C"), 1), on = .(V4, V1)]
# using which = TRUE only returns the matching rows indices
DT[.(c("B", "C"), 1), on = .(V4, V1), which = TRUE]

##    V1 V2 V4
## 1:  1  0  C
## 2:  1  9  C
##    V1 V2 V4
## 1:  1  5  B
## 2:  1  0  C
## 3:  1  9  C
## [1] 4 7 8

filter(DF, V1 == 1, V4 == "C")
filter(DF, V1 == 1, V4 %in% c("B", "C"))
# ?

## # A tibble: 2 x 3
##      V1    V2 V4   
##   <int> <int> <chr>
## 1     1     0 C    
## 2     1     9 C    
## # A tibble: 3 x 3
##      V1    V2 V4   
##   <int> <int> <chr>
## 1     1     5 B    
## 2     1     0 C    
## 3     1     9 C

Remove keys/indices

setkey(DT, NULL)
setindex(DT, NULL)
DT

##    V1 V2 V4
## 1:  0  0  A
## 2:  0  4  A
## 3:  0  7  A
## 4:  1  5  B
## 5:  4  0  B
## 6:  4  8  B
## 7:  1  0  C
## 8:  1  9  C
## 9:  4  5  C

# 
DF

## # A tibble: 9 x 3
##      V1    V2 V4   
##   <int> <int> <chr>
## 1     0     0 A    
## 2     0     4 A    
## 3     0     7 A    
## 4     1     5 B    
## 5     4     0 B    
## 6     4     8 B    
## 7     1     0 C    
## 8     1     9 C    
## 9     4     5 C

`set*()` modifications

In data.table, set*() functions modify objects by reference, making these operations fast and memory-efficient. In case this is not a desired behaviour, users can use copy(). The corresponding expressions in dplyr will be less memory-efficient.

Replace values

set(DT, i = 1L, j = 2L, value = 3L)
DT

##    V1 V2 V4
## 1:  0  3  A
## 2:  0  4  A
## 3:  0  7  A
## 4:  1  5  B
## 5:  4  0  B
## 6:  4  8  B
## 7:  1  0  C
## 8:  1  9  C
## 9:  4  5  C

DF[1, 2] <- 3L
DF

## # A tibble: 9 x 3
##      V1    V2 V4   
##   <int> <int> <chr>
## 1     0     3 A    
## 2     0     4 A    
## 3     0     7 A    
## 4     1     5 B    
## 5     4     0 B    
## 6     4     8 B    
## 7     1     0 C    
## 8     1     9 C    
## 9     4     5 C

Reorder rows

setorder(DT, V4, -V1)
setorderv(DT, c("V4", "V1"), c(1, -1))

DF <- arrange(DF, V4, desc(V1))

Modify colnames

setnames(DT, old = "V2", new = "v2")
setnames(DT, old = -(c(1, 3)), new = "V2")
DT

##    V1 V2 V4
## 1:  0  3  A
## 2:  0  4  A
## 3:  0  7  A
## 4:  4  0  B
## 5:  4  8  B
## 6:  1  5  B
## 7:  4  5  C
## 8:  1  0  C
## 9:  1  9  C

DF <- rename(DF, v2 = V2)
DF <- rename(DF, V2 = v2) # reset upper
DF

## # A tibble: 9 x 3
##      V1    V2 V4   
##   <int> <int> <chr>
## 1     0     3 A    
## 2     0     4 A    
## 3     0     7 A    
## 4     4     0 B    
## 5     4     8 B    
## 6     1     5 B    
## 7     4     5 C    
## 8     1     0 C    
## 9     1     9 C

Reorder columns

setcolorder(DT, c("V4", "V1", "V2"))
DT

##    V4 V1 V2
## 1:  A  0  3
## 2:  A  0  4
## 3:  A  0  7
## 4:  B  4  0
## 5:  B  4  8
## 6:  B  1  5
## 7:  C  4  5
## 8:  C  1  0
## 9:  C  1  9

DF <- select(DF, V4, V1, V2)
DF

## # A tibble: 9 x 3
##   V4       V1    V2
##   <chr> <int> <int>
## 1 A         0     3
## 2 A         0     4
## 3 A         0     7
## 4 B         4     0
## 5 B         4     8
## 6 B         1     5
## 7 C         4     5
## 8 C         1     0
## 9 C         1     9

Relocate columns

# coming soon (#4358)

DF %>%
  relocate(V4, .after = V2)

## # A tibble: 9 x 3
##      V1    V2 V4   
##   <int> <int> <chr>
## 1     0     3 A    
## 2     0     4 A    
## 3     0     7 A    
## 4     4     0 B    
## 5     4     8 B    
## 6     1     5 B    
## 7     4     5 C    
## 8     1     0 C    
## 9     1     9 C

Convert data

# ?setDT # data.frame or list to data.table
# ?setDF # data.table to data.frame
# ?setattr # modify attributes

Advanced use of by

Context dependent expressions (cur_*()) have been introduced in dplyr 1.0.0, reflecting data.table aliases .I, .GRP, …

Select first/last/… row by group

DT[, .SD[1], by = V4]
DT[, .SD[c(1, .N)], by = V4]
DT[, tail(.SD, 2), by = V4]

##    V4 V1 V2
## 1:  A  0  3
## 2:  B  4  0
## 3:  C  4  5
##    V4 V1 V2
## 1:  A  0  3
## 2:  A  0  7
## 3:  B  4  0
## 4:  B  1  5
## 5:  C  4  5
## 6:  C  1  9
##    V4 V1 V2
## 1:  A  0  4
## 2:  A  0  7
## 3:  B  4  8
## 4:  B  1  5
## 5:  C  1  0
## 6:  C  1  9

DF %>%
  group_by(V4) %>%
  slice(1)
DF %>%
  group_by(V4) %>%
  slice(1, n())
DF %>%
  group_by(V4) %>%
  group_map(~ tail(.x, 2))

## # A tibble: 3 x 3
## # Groups:   V4 [3]
##   V4       V1    V2
##   <chr> <int> <int>
## 1 A         0     3
## 2 B         4     0
## 3 C         4     5
## # A tibble: 6 x 3
## # Groups:   V4 [3]
##   V4       V1    V2
##   <chr> <int> <int>
## 1 A         0     3
## 2 A         0     7
## 3 B         4     0
## 4 B         1     5
## 5 C         4     5
## 6 C         1     9
## [[1]]
## # A tibble: 2 x 2
##      V1    V2
##   <int> <int>
## 1     0     4
## 2     0     7
## 
## [[2]]
## # A tibble: 2 x 2
##      V1    V2
##   <int> <int>
## 1     4     8
## 2     1     5
## 
## [[3]]
## # A tibble: 2 x 2
##      V1    V2
##   <int> <int>
## 1     1     0
## 2     1     9

Select rows using a nested query

DT[, .SD[which.min(V2)], by = V4]

##    V4 V1 V2
## 1:  A  0  3
## 2:  B  4  0
## 3:  C  1  0

DF %>%
  group_by(V4) %>%
  arrange(V2) %>%
  slice(1)

## # A tibble: 3 x 3
## # Groups:   V4 [3]
##   V4       V1    V2
##   <chr> <int> <int>
## 1 A         0     3
## 2 B         4     0
## 3 C         1     0

Add a group counter column

DT[, Grp := .GRP, by = .(V4, V1)][]
DT[, Grp := NULL] # delete for consistency

##    V4 V1 V2 Grp
## 1:  A  0  3   1
## 2:  A  0  4   1
## 3:  A  0  7   1
## 4:  B  4  0   2
## 5:  B  4  8   2
## 6:  B  1  5   3
## 7:  C  4  5   4
## 8:  C  1  0   5
## 9:  C  1  9   5

DF %>%
  group_by(V4, V1) %>%
  mutate(Grp = cur_group_id())
# DF %>% mutate(Grp = group_indices(., V4, V1))

## # A tibble: 9 x 4
## # Groups:   V4, V1 [5]
##   V4       V1    V2   Grp
##   <chr> <int> <int> <int>
## 1 A         0     3     1
## 2 A         0     4     1
## 3 A         0     7     1
## 4 B         4     0     3
## 5 B         4     8     3
## 6 B         1     5     2
## 7 C         4     5     5
## 8 C         1     0     4
## 9 C         1     9     4

Get row number of first (and last) observation by group

DT[, .I, by = V4] # returns a data.table
DT[, .I[1], by = V4]
DT[, .I[c(1, .N)], by = V4]

##    V4 I
## 1:  A 1
## 2:  A 2
## 3:  A 3
## 4:  B 4
## 5:  B 5
## 6:  B 6
## 7:  C 7
## 8:  C 8
## 9:  C 9
##    V4 V1
## 1:  A  1
## 2:  B  4
## 3:  C  7
##    V4 V1
## 1:  A  1
## 2:  A  3
## 3:  B  4
## 4:  B  6
## 5:  C  7
## 6:  C  9

DF %>%
  group_by(V4) %>%
  mutate(cur_group_rows())
#DF %>%
#  group_by(V4) %>%
#  group_data() %>%
#  tidyr::unnest(.rows)
DF %>%
  group_by(V4) %>%
  summarize(cur_group_rows()[1])

## `summarise()` ungrouping output (override with `.groups` argument)

DF %>%
  group_by(V4) %>%
  summarize(cur_group_rows()[c(1, n())])

## `summarise()` regrouping output by 'V4' (override with `.groups` argument)

# DF %>% group_by(V4) %>% group_rows() # returns a list

## # A tibble: 9 x 4
## # Groups:   V4 [3]
##   V4       V1    V2 `cur_group_rows()`
##   <chr> <int> <int>              <int>
## 1 A         0     3                  1
## 2 A         0     4                  2
## 3 A         0     7                  3
## 4 B         4     0                  4
## 5 B         4     8                  5
## 6 B         1     5                  6
## 7 C         4     5                  7
## 8 C         1     0                  8
## 9 C         1     9                  9
## # A tibble: 3 x 2
##   V4    `cur_group_rows()[1]`
##   <chr>                 <int>
## 1 A                         1
## 2 B                         4
## 3 C                         7
## # A tibble: 6 x 2
## # Groups:   V4 [3]
##   V4    `cur_group_rows()[c(1, n())]`
##   <chr>                         <int>
## 1 A                                 1
## 2 A                                 3
## 3 B                                 4
## 4 B                                 6
## 5 C                                 7
## 6 C                                 9

Handle list-columns by group

DT[, .(.(V1)),  by = V4]  # return V1 as a list
DT[, .(.(.SD)), by = V4] # subsets of the data

##    V4    V1
## 1:  A 0,0,0
## 2:  B 4,4,1
## 3:  C 4,1,1
##    V4                V1
## 1:  A <data.table[3x2]>
## 2:  B <data.table[3x2]>
## 3:  C <data.table[3x2]>

DF %>%
  group_by(V4) %>%
  summarise(list(V1))

## `summarise()` ungrouping output (override with `.groups` argument)

DF %>%
  group_by(V4) %>%
  group_nest()

## # A tibble: 3 x 2
##   V4    `list(V1)`
##   <chr> <list>    
## 1 A     <int [3]> 
## 2 B     <int [3]> 
## 3 C     <int [3]> 
## # A tibble: 3 x 2
##   V4                  data
##   <chr> <list<tbl_df[,2]>>
## 1 A                [3 × 2]
## 2 B                [3 × 2]
## 3 C                [3 × 2]

Grouping sets (multiple by at once)

rollup(DT,
       .(SumV2 = sum(V2)),
       by = c("V1", "V4"))

rollup(DT,
       .(SumV2 = sum(V2), .N),
       by = c("V1", "V4"),
       id = TRUE)

cube(DT,
     .(SumV2 = sum(V2), .N),
     by = c("V1", "V4"),
     id = TRUE)

groupingsets(DT,
             .(SumV2 = sum(V2), .N),
             by   = c("V1", "V4"),
             sets = list("V1", c("V1", "V4")),
             id   = TRUE)

##    V1   V4 SumV2
## 1:  0    A    14
## 2:  4    B     8
## 3:  1    B     5
## 4:  4    C     5
## 5:  1    C     9
## 6:  0 <NA>    14
## 7:  4 <NA>    13
## 8:  1 <NA>    14
## 9: NA <NA>    41
##    grouping V1   V4 SumV2 N
## 1:        0  0    A    14 3
## 2:        0  4    B     8 2
## 3:        0  1    B     5 1
## 4:        0  4    C     5 1
## 5:        0  1    C     9 2
## 6:        1  0 <NA>    14 3
## 7:        1  4 <NA>    13 3
## 8:        1  1 <NA>    14 3
## 9:        3 NA <NA>    41 9
##     grouping V1   V4 SumV2 N
##  1:        0  0    A    14 3
##  2:        0  4    B     8 2
##  3:        0  1    B     5 1
##  4:        0  4    C     5 1
##  5:        0  1    C     9 2
##  6:        1  0 <NA>    14 3
##  7:        1  4 <NA>    13 3
##  8:        1  1 <NA>    14 3
##  9:        2 NA    A    14 3
## 10:        2 NA    B    13 3
## 11:        2 NA    C    14 3
## 12:        3 NA <NA>    41 9
##    grouping V1   V4 SumV2 N
## 1:        1  0 <NA>    14 3
## 2:        1  4 <NA>    13 3
## 3:        1  1 <NA>    14 3
## 4:        0  0    A    14 3
## 5:        0  4    B     8 2
## 6:        0  1    B     5 1
## 7:        0  4    C     5 1
## 8:        0  1    C     9 2

Miscellaneous

Read / Write data

fread() and fwrite() are among the most powerful functions of data.table. They are not only incredibly fast (see benchmarks), they are also extremely robust. The few commands below only scratch the surface and there are a lot of awesome features. For example, fread() accepts http and https URLs directly as well as operating system commands such as sed and awk output. Make sure to check the docs.
Here again, fread() and fwrite() are very versatile and allow to handle different file formats while dplyr delegates file reading and writing to the readr package with several specific functions (csv, tsv, delim, …).

Write data to a csv file

fwrite(DT, "DT.csv")

readr::write_csv(DF, "DF.csv")
# see also vroom

Write data to a tab-delimited file

fwrite(DT, "DT.txt", sep = "\t")

readr::write_delim(DF, "DF.txt", delim = "\t")

Write list-column data to a csv file

fwrite(setDT(list(0, list(1:5))), "DT2.csv")

Read a csv / tab-delimited file

fread("DT.csv")
# fread("DT.csv", verbose = TRUE) # full details
fread("DT.txt", sep = "\t")

##    V4 V1 V2
## 1:  A  0  3
## 2:  A  0  4
## 3:  A  0  7
## 4:  B  4  0
## 5:  B  4  8
## 6:  B  1  5
## 7:  C  4  5
## 8:  C  1  0
## 9:  C  1  9
##    V4 V1 V2
## 1:  A  0  3
## 2:  A  0  4
## 3:  A  0  7
## 4:  B  4  0
## 5:  B  4  8
## 6:  B  1  5
## 7:  C  4  5
## 8:  C  1  0
## 9:  C  1  9

readr::read_csv("DF.csv")

## Parsed with column specification:
## cols(
##   V4 = col_character(),
##   V1 = col_double(),
##   V2 = col_double()
## )

readr::read_delim("DF.txt", delim = "\t")

## Parsed with column specification:
## cols(
##   V4 = col_character(),
##   V1 = col_double(),
##   V2 = col_double()
## )

## # A tibble: 9 x 3
##   V4       V1    V2
##   <chr> <dbl> <dbl>
## 1 A         0     3
## 2 A         0     4
## 3 A         0     7
## 4 B         4     0
## 5 B         4     8
## 6 B         1     5
## 7 C         4     5
## 8 C         1     0
## 9 C         1     9
## # A tibble: 9 x 3
##   V4       V1    V2
##   <chr> <dbl> <dbl>
## 1 A         0     3
## 2 A         0     4
## 3 A         0     7
## 4 B         4     0
## 5 B         4     8
## 6 B         1     5
## 7 C         4     5
## 8 C         1     0
## 9 C         1     9

Read a csv file selecting / droping columns

fread("DT.csv", select = c("V1", "V4"))
fread("DT.csv", drop = "V4")

##    V1 V4
## 1:  0  A
## 2:  0  A
## 3:  0  A
## 4:  4  B
## 5:  4  B
## 6:  1  B
## 7:  4  C
## 8:  1  C
## 9:  1  C
##    V1 V2
## 1:  0  3
## 2:  0  4
## 3:  0  7
## 4:  4  0
## 5:  4  8
## 6:  1  5
## 7:  4  5
## 8:  1  0
## 9:  1  9

# NA

Read and rbind several files

rbindlist(lapply(c("DT.csv", "DT.csv"), fread))
# c("DT.csv", "DT.csv") %>% lapply(fread) %>% rbindlist

##     V4 V1 V2
##  1:  A  0  3
##  2:  A  0  4
##  3:  A  0  7
##  4:  B  4  0
##  5:  B  4  8
##  6:  B  1  5
##  7:  C  4  5
##  8:  C  1  0
##  9:  C  1  9
## 10:  A  0  3
## 11:  A  0  4
## 12:  A  0  7
## 13:  B  4  0
## 14:  B  4  8
## 15:  B  1  5
## 16:  C  4  5
## 17:  C  1  0
## 18:  C  1  9

c("DF.csv", "DF.csv") %>%
  purrr::map_dfr(readr::read_csv)

## Parsed with column specification:
## cols(
##   V4 = col_character(),
##   V1 = col_double(),
##   V2 = col_double()
## )
## Parsed with column specification:
## cols(
##   V4 = col_character(),
##   V1 = col_double(),
##   V2 = col_double()
## )

## # A tibble: 18 x 3
##    V4       V1    V2
##    <chr> <dbl> <dbl>
##  1 A         0     3
##  2 A         0     4
##  3 A         0     7
##  4 B         4     0
##  5 B         4     8
##  6 B         1     5
##  7 C         4     5
##  8 C         1     0
##  9 C         1     9
## 10 A         0     3
## 11 A         0     4
## 12 A         0     7
## 13 B         4     0
## 14 B         4     8
## 15 B         1     5
## 16 C         4     5
## 17 C         1     0
## 18 C         1     9

Reshape data

This part is still a bit clunky. I need to find better examples. See here and here for more details.

Melt data (from wide to long)

melt(DT, id.vars = "V4")
mDT <- melt(DT,
            id.vars       = "V4",
            measure.vars  = c("V1", "V2"),
            variable.name = "Variable",
            value.name    = "Value")

##     V4 variable value
##  1:  A       V1     0
##  2:  A       V1     0
##  3:  A       V1     0
##  4:  B       V1     4
##  5:  B       V1     4
##  6:  B       V1     1
##  7:  C       V1     4
##  8:  C       V1     1
##  9:  C       V1     1
## 10:  A       V2     3
## 11:  A       V2     4
## 12:  A       V2     7
## 13:  B       V2     0
## 14:  B       V2     8
## 15:  B       V2     5
## 16:  C       V2     5
## 17:  C       V2     0
## 18:  C       V2     9

tidyr::gather(DF, variable, value, -V4)
mDF <- tidyr::gather(DF,
                     key = Variable,
                     value = Value,
                     -V4)
# pivot_longer todo

## # A tibble: 18 x 3
##    V4    variable value
##    <chr> <chr>    <int>
##  1 A     V1           0
##  2 A     V1           0
##  3 A     V1           0
##  4 B     V1           4
##  5 B     V1           4
##  6 B     V1           1
##  7 C     V1           4
##  8 C     V1           1
##  9 C     V1           1
## 10 A     V2           3
## 11 A     V2           4
## 12 A     V2           7
## 13 B     V2           0
## 14 B     V2           8
## 15 B     V2           5
## 16 C     V2           5
## 17 C     V2           0
## 18 C     V2           9

Cast data (from long to wide)

dcast(mDT, V4 ~ Variable) # aggregate by count

## Using 'Value' as value column. Use 'value.var' to override

## Aggregate function missing, defaulting to 'length'

dcast(mDT, V4 ~ Variable, fun.aggregate = sum)

## Using 'Value' as value column. Use 'value.var' to override

dcast(mDT, V4 ~ Value > 5)

## Using 'Value' as value column. Use 'value.var' to override
## Aggregate function missing, defaulting to 'length'

# see ?dcast: multiple values / fun.aggregate

##    V4 V1 V2
## 1:  A  3  3
## 2:  B  3  3
## 3:  C  3  3
##    V4 V1 V2
## 1:  A  0 14
## 2:  B  9 13
## 3:  C  6 14
##    V4 FALSE TRUE
## 1:  A     5    1
## 2:  B     5    1
## 3:  C     5    1

tidyr::spread(data  = count(mDF, V4, Variable),
              key   = Variable,
              value = n,
              fill  = 0)
# pivot_wider todo

## # A tibble: 3 x 3
##   V4       V1    V2
##   <chr> <dbl> <dbl>
## 1 A         3     3
## 2 B         3     3
## 3 C         3     3

Split

split(DT, by = "V4") # S3 method

## $A
##    V4 V1 V2
## 1:  A  0  3
## 2:  A  0  4
## 3:  A  0  7
## 
## $B
##    V4 V1 V2
## 1:  B  4  0
## 2:  B  4  8
## 3:  B  1  5
## 
## $C
##    V4 V1 V2
## 1:  C  4  5
## 2:  C  1  0
## 3:  C  1  9

group_split(DF, V4)

## <list_of<
##   tbl_df<
##     V4: character
##     V1: integer
##     V2: integer
##   >
## >[3]>
## [[1]]
## # A tibble: 3 x 3
##   V4       V1    V2
##   <chr> <int> <int>
## 1 A         0     3
## 2 A         0     4
## 3 A         0     7
## 
## [[2]]
## # A tibble: 3 x 3
##   V4       V1    V2
##   <chr> <int> <int>
## 1 B         4     0
## 2 B         4     8
## 3 B         1     5
## 
## [[3]]
## # A tibble: 3 x 3
##   V4       V1    V2
##   <chr> <int> <int>
## 1 C         4     5
## 2 C         1     0
## 3 C         1     9

Split and transpose a vector/column

vec <- c("A:a", "B:b", "C:c")
tstrsplit(vec, split = ":", keep = 2L) # works on vector
setDT(tstrsplit(vec, split = ":"))[]

## [[1]]
## [1] "a" "b" "c"
## 
##    V1 V2
## 1:  A  a
## 2:  B  b
## 3:  C  c

vec <- c("A:a", "B:b", "C:c")
# vector not handled
tidyr::separate(tibble(vec), vec, c("V1", "V2"))

## # A tibble: 3 x 2
##   V1    V2   
##   <chr> <chr>
## 1 A     a    
## 2 B     b    
## 3 C     c

Other

Check package installation

# test.data.table()
# There's more lines of test code in data.table than there is code!

List data.tables/tibbles

tables()

##    NAME NROW NCOL MB              COLS KEY
## 1:   DT    9    3  0          V4,V1,V2    
## 2:  mDT   18    3  0 V4,Variable,Value    
## Total: 0MB

Get/Set number of threads when parallelized

getDTthreads() # setDTthreads()

## [1] 2

Lead/Lag

shift(1:10, n = 1,   fill = NA, type = "lag")
shift(1:10, n = 1:2, fill = NA, type = "lag") # multiple
shift(1:10, n = 1,   fill = NA, type = "lead")

##  [1] NA  1  2  3  4  5  6  7  8  9
## [[1]]
##  [1] NA  1  2  3  4  5  6  7  8  9
## 
## [[2]]
##  [1] NA NA  1  2  3  4  5  6  7  8
## 
##  [1]  2  3  4  5  6  7  8  9 10 NA

lag(1:10, n = 1, default = NA)
purrr::map(1:2, ~lag(1:10, n = .x))
lead(1:10, n = 1, default = NA)

##  [1] NA  1  2  3  4  5  6  7  8  9
## [[1]]
##  [1] NA  1  2  3  4  5  6  7  8  9
## 
## [[2]]
##  [1] NA NA  1  2  3  4  5  6  7  8
## 
##  [1]  2  3  4  5  6  7  8  9 10 NA

Generate run-length ids

rleid(rep(c("a", "b", "a"), each = 3)) # see also ?rleidv
rleid(rep(c("a", "b", "a"), each = 3), prefix = "G")

## [1] 1 1 1 2 2 2 3 3 3
## [1] "G1" "G1" "G1" "G2" "G2" "G2" "G3" "G3" "G3"

Fast version of `ifelse()`

x <- c(-3:3, NA)
fifelse(test = x < 0,
        yes  = "neg",
        no   = "pos",
        na   = "NA")

## [1] "neg" "neg" "neg" "pos" "pos" "pos" "pos" "NA"

x <- c(-3:3, NA)
if_else(condition = x < 0,
        true      = "neg",
        false     = "pos",
        missing   = "NA")

## [1] "neg" "neg" "neg" "pos" "pos" "pos" "pos" "NA"

Vectorised `ifelse` statements

x <- 1:10
fcase(
  x %% 6 == 0, "fizz buzz",
  x %% 2 == 0, "fizz",
  x %% 3 == 0, "buzz",
  default = NA_character_
)

##  [1] NA          "fizz"      "buzz"      "fizz"      NA          "fizz buzz"
##  [7] NA          "fizz"      "buzz"      "fizz"

x <- 1:10
case_when(
  x %% 6 == 0 ~ "fizz buzz",
  x %% 2 == 0 ~ "fizz",
  x %% 3 == 0 ~ "buzz",
  TRUE ~ as.character(x)
)

##  [1] "1"         "fizz"      "buzz"      "fizz"      "5"         "fizz buzz"
##  [7] "7"         "fizz"      "buzz"      "fizz"

Rolling functions

# todo

Join/Bind data sets

Join

Joining data in data.table works like the fast subsetting approach described above. It can be performed using keys, using the ad hoc on argument, or using the merge.data.table method. For the sake of completeness, the three methods are presented below. As previously mentioned, the on and by (in merge) arguments are optional with keyed data.tables, but recommended to make the code more explicit.
In the examples below, the x, y, and z data.tables are also used with dplyr.

x <- data.table(Id  = c("A", "B", "C", "C"),
                X1  = c(1L, 3L, 5L, 7L),
                XY  = c("x2", "x4", "x6", "x8"),
                key = "Id")

y <- data.table(Id  = c("A", "B", "B", "D"),
                Y1  = c(1L, 3L, 5L, 7L),
                XY  = c("y1", "y3", "y5", "y7"),
                key = "Id")

Join matching rows from y to x

y[x, on = "Id"]
merge(x, y, all.x = TRUE, by = "Id")
y[x] # requires keys

##    Id Y1   XY X1 i.XY
## 1:  A  1   y1  1   x2
## 2:  B  3   y3  3   x4
## 3:  B  5   y5  3   x4
## 4:  C NA <NA>  5   x6
## 5:  C NA <NA>  7   x8
##    Id X1 XY.x Y1 XY.y
## 1:  A  1   x2  1   y1
## 2:  B  3   x4  3   y3
## 3:  B  3   x4  5   y5
## 4:  C  5   x6 NA <NA>
## 5:  C  7   x8 NA <NA>
##    Id Y1   XY X1 i.XY
## 1:  A  1   y1  1   x2
## 2:  B  3   y3  3   x4
## 3:  B  5   y5  3   x4
## 4:  C NA <NA>  5   x6
## 5:  C NA <NA>  7   x8

left_join(x, y, by = "Id")

##    Id X1 XY.x Y1 XY.y
## 1:  A  1   x2  1   y1
## 2:  B  3   x4  3   y3
## 3:  B  3   x4  5   y5
## 4:  C  5   x6 NA <NA>
## 5:  C  7   x8 NA <NA>

Join matching rows from x to y

x[y, on = "Id"]
merge(x, y, all.y = TRUE, by = "Id")
x[y] # requires keys

##    Id X1   XY Y1 i.XY
## 1:  A  1   x2  1   y1
## 2:  B  3   x4  3   y3
## 3:  B  3   x4  5   y5
## 4:  D NA <NA>  7   y7
##    Id X1 XY.x Y1 XY.y
## 1:  A  1   x2  1   y1
## 2:  B  3   x4  3   y3
## 3:  B  3   x4  5   y5
## 4:  D NA <NA>  7   y7
##    Id X1   XY Y1 i.XY
## 1:  A  1   x2  1   y1
## 2:  B  3   x4  3   y3
## 3:  B  3   x4  5   y5
## 4:  D NA <NA>  7   y7

right_join(x, y, by = "Id")

##    Id X1 XY.x Y1 XY.y
## 1:  A  1   x2  1   y1
## 2:  B  3   x4  3   y3
## 3:  B  3   x4  5   y5
## 4:  D NA <NA>  7   y7

Join matching rows from both x and y

x[y, on = "Id", nomatch = 0]
merge(x, y)
x[y, nomatch = 0] # requires keys

##    Id X1 XY Y1 i.XY
## 1:  A  1 x2  1   y1
## 2:  B  3 x4  3   y3
## 3:  B  3 x4  5   y5
##    Id X1 XY.x Y1 XY.y
## 1:  A  1   x2  1   y1
## 2:  B  3   x4  3   y3
## 3:  B  3   x4  5   y5
##    Id X1 XY Y1 i.XY
## 1:  A  1 x2  1   y1
## 2:  B  3 x4  3   y3
## 3:  B  3 x4  5   y5

inner_join(x, y, by = "Id")

##    Id X1 XY.x Y1 XY.y
## 1:  A  1   x2  1   y1
## 2:  B  3   x4  3   y3
## 3:  B  3   x4  5   y5

Join keeping all the rows

merge(x, y, all = TRUE, by = "Id")

##    Id X1 XY.x Y1 XY.y
## 1:  A  1   x2  1   y1
## 2:  B  3   x4  3   y3
## 3:  B  3   x4  5   y5
## 4:  C  5   x6 NA <NA>
## 5:  C  7   x8 NA <NA>
## 6:  D NA <NA>  7   y7

full_join(x, y, by = "Id")

##    Id X1 XY.x Y1 XY.y
## 1:  A  1   x2  1   y1
## 2:  B  3   x4  3   y3
## 3:  B  3   x4  5   y5
## 4:  C  5   x6 NA <NA>
## 5:  C  7   x8 NA <NA>
## 6:  D NA <NA>  7   y7

Return rows from x matching y

unique(x[y$Id, on = "Id", nomatch = 0])
unique(x[y$Id, nomatch = 0]) # requires keys

##    Id X1 XY
## 1:  A  1 x2
## 2:  B  3 x4
##    Id X1 XY
## 1:  A  1 x2
## 2:  B  3 x4

semi_join(x, y, by = "Id")

##    Id X1 XY
## 1:  A  1 x2
## 2:  B  3 x4

Return rows from x not matching y

x[!y, on = "Id"]
x[!y] # requires keys

##    Id X1 XY
## 1:  C  5 x6
## 2:  C  7 x8
##    Id X1 XY
## 1:  C  5 x6
## 2:  C  7 x8

anti_join(x, y, by = "Id")

##    Id X1 XY
## 1:  C  5 x6
## 2:  C  7 x8

More joins

data.table excels at joining data, and offers additional functions and features.

Select columns while joining

x[y, .(Id, X1, i.XY)]   # i. prefix refers to cols in y
x[y, .(Id, x.XY, i.XY)] # x. prefix refers to cols in x

##    Id X1 i.XY
## 1:  A  1   y1
## 2:  B  3   y3
## 3:  B  3   y5
## 4:  D NA   y7
##    Id x.XY i.XY
## 1:  A   x2   y1
## 2:  B   x4   y3
## 3:  B   x4   y5
## 4:  D <NA>   y7

right_join(select(x, Id, X1),
           select(y, Id, XY),
           by = "Id")
right_join(select(x, Id, XY),
           select(y, Id, XY),
           by = "Id")

##    Id X1 XY
## 1:  A  1 y1
## 2:  B  3 y3
## 3:  B  3 y5
## 4:  D NA y7
##    Id XY.x XY.y
## 1:  A   x2   y1
## 2:  B   x4   y3
## 3:  B   x4   y5
## 4:  D <NA>   y7

Aggregate columns while joining

y[x, .(X1Y1 = sum(Y1) * X1), by = .EACHI]

##    Id X1Y1
## 1:  A    1
## 2:  B   24
## 3:  C   NA
## 4:  C   NA

y %>%
 group_by(Id) %>%
 summarise(SumY1 = sum(Y1)) %>%
 right_join(x) %>%
 mutate(X1Y1 = SumY1 * X1) %>%
 select(Id, X1Y1)

## `summarise()` ungrouping output (override with `.groups` argument)

## Joining, by = "Id"

## # A tibble: 4 x 2
##   Id     X1Y1
##   <chr> <int>
## 1 A         1
## 2 B        24
## 3 C        NA
## 4 C        NA

Update columns while joining

y[x, SqX1 := i.X1^2]
y[, SqX1 := x[.BY, X1^2, on = "Id"], by = Id] # more memory-efficient
y[, SqX1 := NULL] # rm column for consistency

x %>%
 select(Id, X1) %>%
 mutate(SqX1 = X1^2) %>%
 right_join(y, by = "Id") %>%
 select(names(y), SqX1)

##    Id Y1 XY SqX1
## 1:  A  1 y1    1
## 2:  B  3 y3    9
## 3:  B  5 y5    9
## 4:  D  7 y7   NA

Adds a list column with rows from y matching x (nest-join)

x[, y := .(.(y[.BY, on = "Id"])), by = Id]
x[, y := NULL] # rm column for consistency

nest_join(x, y, by = "Id")

##    Id X1 XY             y
## 1:  A  1 x2 <tbl_df[1x2]>
## 2:  B  3 x4 <tbl_df[2x2]>
## 3:  C  5 x6 <tbl_df[0x2]>
## 4:  C  7 x8 <tbl_df[0x2]>

Update columns while joining (using vectors of colnames)

cols  <- c("NewXY", "NewX1")
icols <- paste0("i.", c("XY", "X1"))

y[x, (cols) := mget(icols)]

y[, (cols) := NULL] # rm columns for consistency

# ?

Join passing columns to match in the `on` argument

z <- data.table(ID = "C", Z1 = 5:9, Z2 = paste0("z", 5:9))
x[, X2 := paste0("x", X1)] # used to track the results
z; x

##    ID Z1 Z2
## 1:  C  5 z5
## 2:  C  6 z6
## 3:  C  7 z7
## 4:  C  8 z8
## 5:  C  9 z9
##    Id X1 XY X2
## 1:  A  1 x2 x1
## 2:  B  3 x4 x3
## 3:  C  5 x6 x5
## 4:  C  7 x8 x7

x[z, on = "X1 == Z1"]
x[z, on = .(X1 == Z1)] # same
x[z, on = .(Id == ID, X1 == Z1)] # using two columns

##      Id X1   XY   X2 ID Z2
## 1:    C  5   x6   x5  C z5
## 2: <NA>  6 <NA> <NA>  C z6
## 3:    C  7   x8   x7  C z7
## 4: <NA>  8 <NA> <NA>  C z8
## 5: <NA>  9 <NA> <NA>  C z9
##      Id X1   XY   X2 ID Z2
## 1:    C  5   x6   x5  C z5
## 2: <NA>  6 <NA> <NA>  C z6
## 3:    C  7   x8   x7  C z7
## 4: <NA>  8 <NA> <NA>  C z8
## 5: <NA>  9 <NA> <NA>  C z9
##    Id X1   XY   X2 Z2
## 1:  C  5   x6   x5 z5
## 2:  C  6 <NA> <NA> z6
## 3:  C  7   x8   x7 z7
## 4:  C  8 <NA> <NA> z8
## 5:  C  9 <NA> <NA> z9

right_join(x, z, by = c("X1" = "Z1"))
right_join(x, z, by = c("Id" = "ID", "X1" = "Z1"))

##      Id X1   XY   X2 ID Z2
## 1:    C  5   x6   x5  C z5
## 2:    C  7   x8   x7  C z7
## 3: <NA>  6 <NA> <NA>  C z6
## 4: <NA>  8 <NA> <NA>  C z8
## 5: <NA>  9 <NA> <NA>  C z9
##    Id X1   XY   X2 Z2
## 1:  C  5   x6   x5 z5
## 2:  C  7   x8   x7 z7
## 3:  C  6 <NA> <NA> z6
## 4:  C  8 <NA> <NA> z8
## 5:  C  9 <NA> <NA> z9

Non-equi joins

x[z, on = .(Id == ID, X1 <= Z1)]
x[z, on = .(Id == ID, X1 > Z1)]
x[z, on = .(X1 < Z1), allow.cartesian = TRUE] # allows 'numerous' matching values

##    Id X1 XY X2 Z2
## 1:  C  5 x6 x5 z5
## 2:  C  6 x6 x5 z6
## 3:  C  7 x6 x5 z7
## 4:  C  7 x8 x7 z7
## 5:  C  8 x6 x5 z8
## 6:  C  8 x8 x7 z8
## 7:  C  9 x6 x5 z9
## 8:  C  9 x8 x7 z9
##    Id X1   XY   X2 Z2
## 1:  C  5   x8   x7 z5
## 2:  C  6   x8   x7 z6
## 3:  C  7 <NA> <NA> z7
## 4:  C  8 <NA> <NA> z8
## 5:  C  9 <NA> <NA> z9
##     Id X1 XY X2 ID Z2
##  1:  A  5 x2 x1  C z5
##  2:  B  5 x4 x3  C z5
##  3:  A  6 x2 x1  C z6
##  4:  B  6 x4 x3  C z6
##  5:  C  6 x6 x5  C z6
##  6:  A  7 x2 x1  C z7
##  7:  B  7 x4 x3  C z7
##  8:  C  7 x6 x5  C z7
##  9:  A  8 x2 x1  C z8
## 10:  B  8 x4 x3  C z8
## 11:  C  8 x6 x5  C z8
## 12:  C  8 x8 x7  C z8
## 13:  A  9 x2 x1  C z9
## 14:  B  9 x4 x3  C z9
## 15:  C  9 x6 x5  C z9
## 16:  C  9 x8 x7  C z9

Rolling joins/subsets (performed on the last numeric column)

# Nearest
x[z, on = .(Id == ID, X1 == Z1), roll = "nearest"]
## below, simplified examples with ad hoc subsets on a keyed data.table
setkey(x, Id, X1)
x[.("C", 5:9), roll = "nearest"]

##    Id X1 XY X2 Z2
## 1:  C  5 x6 x5 z5
## 2:  C  6 x6 x5 z6
## 3:  C  7 x8 x7 z7
## 4:  C  8 x8 x7 z8
## 5:  C  9 x8 x7 z9
##    Id X1 XY X2
## 1:  C  5 x6 x5
## 2:  C  6 x6 x5
## 3:  C  7 x8 x7
## 4:  C  8 x8 x7
## 5:  C  9 x8 x7

# Last Observation Carried Forward
x[.("C", 5:9), roll = Inf]
x[.("C", 5:9), roll = 0.5]  # bounded
x[.("C", 5:9), roll = Inf, rollends = c(FALSE, TRUE)]  # default
x[.("C", 5:9), roll = Inf, rollends = c(FALSE, FALSE)] # ends not rolled

##    Id X1 XY X2
## 1:  C  5 x6 x5
## 2:  C  6 x6 x5
## 3:  C  7 x8 x7
## 4:  C  8 x8 x7
## 5:  C  9 x8 x7
##    Id X1   XY   X2
## 1:  C  5   x6   x5
## 2:  C  6 <NA> <NA>
## 3:  C  7   x8   x7
## 4:  C  8 <NA> <NA>
## 5:  C  9 <NA> <NA>
##    Id X1 XY X2
## 1:  C  5 x6 x5
## 2:  C  6 x6 x5
## 3:  C  7 x8 x7
## 4:  C  8 x8 x7
## 5:  C  9 x8 x7
##    Id X1   XY   X2
## 1:  C  5   x6   x5
## 2:  C  6   x6   x5
## 3:  C  7   x8   x7
## 4:  C  8 <NA> <NA>
## 5:  C  9 <NA> <NA>

# Next Observation Carried Backward
x[.("C", 5:9), roll = -Inf]
x[.("C", 5:9), roll = -0.5] # bounded
x[.("C", 5:9), roll = -Inf, rollends = c(TRUE, FALSE)]
x[.("C", 5:9), roll = -Inf, rollends = c(TRUE, TRUE)]  # roll both ends

##    Id X1   XY   X2
## 1:  C  5   x6   x5
## 2:  C  6   x8   x7
## 3:  C  7   x8   x7
## 4:  C  8 <NA> <NA>
## 5:  C  9 <NA> <NA>
##    Id X1   XY   X2
## 1:  C  5   x6   x5
## 2:  C  6 <NA> <NA>
## 3:  C  7   x8   x7
## 4:  C  8 <NA> <NA>
## 5:  C  9 <NA> <NA>
##    Id X1   XY   X2
## 1:  C  5   x6   x5
## 2:  C  6   x8   x7
## 3:  C  7   x8   x7
## 4:  C  8 <NA> <NA>
## 5:  C  9 <NA> <NA>
##    Id X1 XY X2
## 1:  C  5 x6 x5
## 2:  C  6 x8 x7
## 3:  C  7 x8 x7
## 4:  C  8 x8 x7
## 5:  C  9 x8 x7

Cross join (`CJ` ~ `expand.grid`)

CJ(c(2, 1, 1), 3:2)
CJ(c(2, 1, 1), 3:2, sorted = FALSE, unique = TRUE)

##    V1 V2
## 1:  1  2
## 2:  1  2
## 3:  1  3
## 4:  1  3
## 5:  2  2
## 6:  2  3
##    V1 V2
## 1:  2  3
## 2:  2  2
## 3:  1  3
## 4:  1  2

# base::expand.grid(c(2, 1, 1), 3:2)
#

Overlap join

It is important to mention foverlaps() from data.table that allows to perform ‘overlap joins’. This is a very powerful function, but a bit out of scope for this document. See these slides for more details.

Bind

x <- data.table(1:3)
y <- data.table(4:6)
z <- data.table(7:9, 0L)

Bind rows

rbind(x, y)
rbind(x, z, fill = TRUE)

##    V1
## 1:  1
## 2:  2
## 3:  3
## 4:  4
## 5:  5
## 6:  6
##    V1 V2
## 1:  1 NA
## 2:  2 NA
## 3:  3 NA
## 4:  7  0
## 5:  8  0
## 6:  9  0

bind_rows(x, y)
bind_rows(x, z) # always fills

##    V1
## 1:  1
## 2:  2
## 3:  3
## 4:  4
## 5:  5
## 6:  6
##    V1 V2
## 1:  1 NA
## 2:  2 NA
## 3:  3 NA
## 4:  7  0
## 5:  8  0
## 6:  9  0

Bind rows using a list

rbindlist(list(x, y), idcol = TRUE)

##    .id V1
## 1:   1  1
## 2:   1  2
## 3:   1  3
## 4:   2  4
## 5:   2  5
## 6:   2  6

bind_rows(list(x, y), .id = "id")

##    id V1
## 1:  1  1
## 2:  1  2
## 3:  1  3
## 4:  2  4
## 5:  2  5
## 6:  2  6

Bind columns

base::cbind(x, y)

##    V1 V1
## 1:  1  4
## 2:  2  5
## 3:  3  6

bind_cols(x, y)

## New names:
## * V1 -> V1...1
## * V1 -> V1...2

##    V1...1 V1...2
## 1:      1      4
## 2:      2      5
## 3:      3      6

Set operations

x <- data.table(c(1, 2, 2, 3, 3))
y <- data.table(c(2, 2, 3, 4, 4))

Intersection

fintersect(x, y)
fintersect(x, y, all = TRUE)

##    V1
## 1:  2
## 2:  3
##    V1
## 1:  2
## 2:  2
## 3:  3

dplyr::intersect(x, y)
# no all option

##    V1
## 1:  2
## 2:  3

Difference

fsetdiff(x, y)
fsetdiff(x, y, all = TRUE)

##    V1
## 1:  1
##    V1
## 1:  1
## 2:  3

dplyr::setdiff(x, y)
# no all option

##    V1
## 1:  1

Union

funion(x, y)
funion(x, y, all = TRUE)

##    V1
## 1:  1
## 2:  2
## 3:  3
## 4:  4
##     V1
##  1:  1
##  2:  2
##  3:  2
##  4:  3
##  5:  3
##  6:  2
##  7:  2
##  8:  3
##  9:  4
## 10:  4

dplyr::union(x, y)
union_all(x, y)

##    V1
## 1:  1
## 2:  2
## 3:  3
## 4:  4
##     V1
##  1:  1
##  2:  2
##  3:  2
##  4:  3
##  5:  3
##  6:  2
##  7:  2
##  8:  3
##  9:  4
## 10:  4

Equality

fsetequal(x, x[order(-V1),])
all.equal(x, x) # S3 method

## [1] TRUE
## [1] TRUE

setequal(x, x[order(-V1),])
all_equal(x, x)

## [1] TRUE
## [1] TRUE

Summary

This article presented the most important features of both data.table and dplyr, two packages that are now essential tools for data manipulation in R.
There are still a lot of features not covered in this document, in particular, data.table functions to deal with time-series or dplyr vectorized functions have not been discussed, but ‘done is better than perfect’…
Hopefully, this comparison is not too biased, but I must admit that my preference is for data.table. So, I hope this post will encourage some readers to give it a try!

## R version 3.6.3 (2020-02-29)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.4 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=fr_FR.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=fr_FR.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=fr_FR.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=fr_FR.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] dplyr_1.0.0       data.table_1.13.0 knitr_1.26       
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.3       magrittr_1.5     hms_0.5.2        tidyselect_1.1.0
##  [5] R6_2.4.1         rlang_0.4.6      fansi_0.4.0      stringr_1.4.0   
##  [9] tools_3.6.3      xfun_0.11        utf8_1.1.4       cli_2.0.0       
## [13] ellipsis_0.3.0   htmltools_0.4.0  assertthat_0.2.1 yaml_2.2.0      
## [17] digest_0.6.23    tibble_2.1.3     lifecycle_0.2.0  crayon_1.3.4    
## [21] bookdown_0.16    readr_1.3.1      tidyr_1.0.0      purrr_0.3.3     
## [25] vctrs_0.3.1      glue_1.4.1       evaluate_0.14    rmarkdown_2.0   
## [29] blogdown_0.17    stringi_1.4.3    compiler_3.6.3   pillar_1.4.3    
## [33] generics_0.0.2   pkgconfig_2.0.3

← → Top

A data.table and dplyr tour

Introduction

About

Update

data.table and dplyr

Create example data

Basic operations

Filter rows

Filter rows using indices

Discard rows using negative indices

Filter rows using a logical expression

Filter rows using multiple conditions

Filter unique rows

Discard rows with missing values

Other filters

Sort rows

Sort rows by column

Sort rows in decreasing order

Sort rows based on several columns

Select columns

Select one column using an index (not recommended)

Select one column using column name

Select several columns

Exclude columns

Select/Exclude columns using a character vector

Other selections

Summarise data

Summarise one column

Summarise several columns

Summarise several columns and assign column names

Summarise a subset of rows

Add/update/delete columns

Modify a column

Add one column

Add several columns

Create one column and remove the others

Remove one column

Remove several columns

Remove columns using a vector of colnames

Replace values for rows matching a condition

by

By group

By several groups

Calling function in by

Assigning column name in by

Using a condition in by

By on a subset of rows

Count number of observations for each group

Add a column with number of observations for each group

Retrieve the first/last/nth observation for each group

Going further

Advanced columns manipulation

Summarise all the columns

Summarise several columns

Summarise several columns by group

Summarise with more than one function by group

Summarise using a condition

Modify all the columns

Modify several columns (dropping the others)

Modify several columns (keeping the others)

Modify columns using a condition (dropping the others)

Modify columns using a condition (keeping the others)

Use a complex expression

Use multiple expressions (with DT[,{j}])

Chain expressions

Expression chaining using DT[][] (recommended)

Expression chaining using %>%

Indexing and Keys

Set the key/index

Select the matching rows

Select the first matching row

Select the last matching row

Nomatch argument

Apply a function on the matching rows

Modify values for matching rows

Use keys in by

Set keys/indices for multiple columns

Subset using multiple keys/indices

Remove keys/indices

set*() modifications

Use multiple expressions (with `DT[,{j}]`)

Expression chaining using `DT[][]` (recommended)

Expression chaining using `%>%`

`set*()` modifications

Fast version of `ifelse()`

Vectorised `ifelse` statements

Join passing columns to match in the `on` argument

Cross join (`CJ` ~ `expand.grid`)