on
A data.table and dplyr tour
Introduction
About
- This document has been been inspired by this stackoverflow question and by the data.table cheat sheet published by Karlijn Willems. It has been written for my own self-teaching and may contain errors or imprecisions. Corrections and suggestions are welcome.
- Resources for data.table can be found on the data.table
wiki, in the data.table
vignettes,
and in the package documentation.
- Reference documents for dplyr include the dplyr
cheat sheet,
the dplyr vignettes,
and the package documentation.
- For the sake of readability, the console outputs are hidden by default. Click on the button below to show or hide the outputs. Also, the R code used in this document is independently available and can be easily reproduced.
Update
- 2020-07-26: Update with data.table_1.13.0 (
fcase()
, functions in.SDcols
) and dplyr_1.0.0 (across()
,relocate()
, …). ‘Superseded’ dplyr code has been commented and will probably be removed in the future.
- 2019-06-22: Coding style improved (thanks to @tylerburleigh for pointing that out). Some data.table expressions with
by
reordered and indented to better highlight the similarity with dplyr. Minor comments and improvements.
data.table and dplyr
data.table and dplyr are two R packages that both aim at an easier and more efficient manipulation of data frames. But while they share a lot of functionalities, their philosophies are quite different. Below is a quick overview of the main differences (from my basic user’s perspective).
Syntax:
- The general data.table syntax is as follows: DT[i, j, by, ...]
which means:
“Take DT, subset rows using i
, then calculate j
, grouped by by
” with possible extra options ...
.
It allows to combine several operations in a very concise and consistent expression.
- The syntax of dplyr is based on key verbs corresponding to the most common operations:
filter()
, arrange()
, select()
, mutate()
, summarise()
, …
These functions can be combine with group_by()
to aggregate data ‘by group’ and with a bunch of helper functions.
It is a ‘do one thing at a time’ approach, chaining together functions dedicated to a specific task.
Ecosystem:
The data.table package has no dependency whereas dplyr is part of the tidyverse. So, for example, while data.table includes functions to read, write, or reshape data, dplyr delegates these tasks to companion packages like readr or tidyr.
On the other hand, data.table is focused on the processing of local in-memory data, but dplyr offers a database backend.
Memory management and performance:
In data.table, objects can be manipulated ‘by reference’ (using the set*()
functions or with the :=
symbol). It means that the data will be modified but not copied, minimizing the RAM requirements. The behaviour of dplyr is similar to the one of base R.
Memory management, parallelism, and shrewd optimization give data.table the advantage in terms of performance.
Create example data
|
|
Below, the data.table code uses DT
and the dplyr code uses DF
.
Also, the dplyr code uses the %>%
operator: a basic knowledge of the magrittr syntax is assumed.
Basic operations
Filter rows
Filter rows using indices
|
|
Discard rows using negative indices
|
|
Filter rows using a logical expression
|
|
Filter rows using multiple conditions
|
|
Filter unique rows
|
|
Discard rows with missing values
|
|
Other filters
In addition to the mainfilter()
function, dplyr also offers slice_*()
helpers. With data.table, we can simply use a custom expression in DT[i]
.
|
|
|
|
Below, we will see that data.table has two optimized mechanisms to filter rows efficiently (keys and indices).
Sort rows
Sort rows by column
|
|
Sort rows in decreasing order
|
|
Sort rows based on several columns
|
|
Select columns
Select one column using an index (not recommended)
|
|
Select one column using column name
|
|
Select several columns
|
|
Exclude columns
|
|
Select/Exclude columns using a character vector
|
|
Other selections
As for row filtering, dplyr includes helper functions to select column. With data.table, a possible solution is to first retrieve the column names (e.g. using a regular expression), then select these columns. Another way (usingpatterns()
) is presented in a next section.
|
|
Summarise data
Summarise one column
|
|
Summarise several columns
|
|
Summarise several columns and assign column names
|
|
Summarise a subset of rows
|
|
summarise()
(or summarize()
) include first()
, last()
, n()
, nth()
, and n_distinct()
. The data.table package also include first()
, last()
, and uniqueN()
.
|
|
Add/update/delete columns
In the following commands, with data.table, columns are modified by reference using the column assignment symbol :=
(no copy performed) and the results are returned invisibly. With dplyr, we have to assign the results.
Modify a column
|
|
Add one column
|
|
Add several columns
|
|
Create one column and remove the others
|
|
Remove one column
|
|
Remove several columns
|
|
Remove columns using a vector of colnames
|
|
Replace values for rows matching a condition
|
|
by
The dplyr::group_by()
function and the corresponding by
and keyby
statements in data.table
allow to run manipulate each group of observations and combine the results.
The sole difference between by
and keyby
is that keyby
orders the results and
creates a key
that will allow faster subsetting (cf. the indexing and keys section).
Below, we arbitrary use one or the other.
group_by()
takes an existing tibble and converts it into a grouped
tibble where operations will always be performed “by group”.
Using ungroup()
removes grouping. With data.table, by
is always used on the fly.
Note that it is possible to reorder the arguments in data.table: DT[i, j, by]
<=> DT[i, by, j]
.
This is done below to better highlight the similarity with dplyr.
By group
|
|
By several groups
|
|
Calling function in by
|
|
Assigning column name in by
|
|
Using a condition in by
|
|
By on a subset of rows
|
|
Count number of observations for each group
|
|
Add a column with number of observations for each group
|
|
Retrieve the first/last/nth observation for each group
|
|
Going further
Advanced columns manipulation
To manipulate multiple columns, dplyr_1.0.0 has introduced the across()
function,
superseding the _all
, _at
, and _if
versions of summarise()
, mutate()
, and transmute()
.
With data.table, we use .SD
, which is a data.table
containing the Subset of Data for each group,
excluding the column(s) used in by
. So, DT[, .SD]
is DT
itself and in the expression DT[, .SD, by = V4]
,
.SD
contains all the DT columns (except V4) for each values in V4 (see DT[, print(.SD), by = V4]
).
.SDcols
allows to select the columns included in .SD
.
Summarise all the columns
|
|
Summarise several columns
|
|
Summarise several columns by group
|
|
Summarise with more than one function by group
|
|
Summarise using a condition
|
|
Modify all the columns
|
|
Modify several columns (dropping the others)
|
|
Modify several columns (keeping the others)
|
|
Modify columns using a condition (dropping the others)
|
|
Modify columns using a condition (keeping the others)
|
|
The use of DT[,j]
is very flexible, allowing to pass complex expressions in a straightforward way,
or combine expressions with multiple outputs.
Use a complex expression
|
|
Use multiple expressions (with DT[,{j}]
)
|
|
Chain expressions
The dplyr workflow relies on the magrittr pipe operator (%>%
).
The magrittr package can also be used with data.table
objects,
but data.table comes with its own chaining system: DT[ ... ][ ... ][ ... ]
.
Expression chaining using DT[][]
(recommended)
|
|
Expression chaining using %>%
|
|
Indexing and Keys
Row subsetting in dplyr relies on the filter()
and slice()
functions, as shown in the first section.
With data.table, in addition to the above-mentioned approach, two systems are available
to make row filtering and join operations more convenient and blazingly fast (~170x speed-up):
keys (primary ordered index) and indices (automatic secondary indexing).
The main differences between keys and indices are:
- When using keys, data are physically reordered in memory. When using indices, the order is stored as an attribute.
- Only one key is possible but several indices can coexist.
- Keys are defined explicitly. Indices can be created manually but are also created on-the-fly (and stored when using ==
or %in%
).
- Indices are used with the on
argument. It is optional when using keys, but recommended (and used below) for better readability.
Note: in the following code, we set both a key and an index to demonstrate their usage, but internally, indices are not used when a key already exists for the same columns(s).
Set the key/index
|
|
Select the matching rows
|
|
Select the first matching row
|
|
Select the last matching row
|
|
Nomatch argument
|
|
Apply a function on the matching rows
|
|
Modify values for matching rows
|
|
Use keys in by
|
|
Set keys/indices for multiple columns
|
|
Subset using multiple keys/indices
|
|
Remove keys/indices
|
|
set*()
modifications
In data.table, set*()
functions modify objects by reference, making these operations fast and memory-efficient.
In case this is not a desired behaviour, users can use copy()
.
The corresponding expressions in dplyr will be less memory-efficient.
Replace values
|
|
Reorder rows
|
|
Modify colnames
|
|
Reorder columns
|
|
Relocate columns
|
|
Convert data
|
|
Advanced use of by
Context dependent expressions (cur_*()
) have been introduced in dplyr 1.0.0, reflecting data.table aliases .I
, .GRP
, …
Select first/last/… row by group
|
|
Select rows using a nested query
|
|
Add a group counter column
|
|
Get row number of first (and last) observation by group
|
|
Handle list-columns by group
|
|
Grouping sets (multiple by at once)
|
|
Miscellaneous
Read / Write data
fread()
and fwrite()
are among the most powerful functions of data.table.
They are not only incredibly fast (see benchmarks),
they are also extremely robust.
The few commands below only scratch the surface and there are a lot of awesome features.
For example, fread()
accepts http
and https
URLs directly as well as operating system commands such as sed
and awk
output.
Make sure to check the docs.
Here again, fread()
and fwrite()
are very versatile and allow to handle different file formats while dplyr delegates file reading and writing to the readr package with several specific functions (csv, tsv, delim, …).
Write data to a csv file
|
|
Write data to a tab-delimited file
|
|
Write list-column data to a csv file
|
|
Read a csv / tab-delimited file
|
|
Read a csv file selecting / droping columns
|
|
Read and rbind several files
|
|
Reshape data
This part is still a bit clunky. I need to find better examples. See here and here for more details.
Melt data (from wide to long)
|
|
Cast data (from long to wide)
|
|
Split
|
|
Split and transpose a vector/column
|
|
Other
Check package installation
|
|
List data.tables/tibbles
|
|
Get/Set number of threads when parallelized
|
|
Lead/Lag
|
|
Generate run-length ids
|
|
Fast version of ifelse()
|
|
Vectorised ifelse
statements
|
|
Rolling functions
|
|
Join/Bind data sets
Join
Joining data in data.table works like the fast subsetting approach described above. It can be performed using keys, using the ad hoc on
argument, or using the merge.data.table
method. For the sake of completeness, the three methods are presented below. As previously mentioned, the on
and by
(in merge
) arguments are optional with keyed data.tables, but recommended to make the code more explicit.
In the examples below, the x
, y
, and z
data.tables are also used with dplyr.
x <- data.table(Id = c("A", "B", "C", "C"),
X1 = c(1L, 3L, 5L, 7L),
XY = c("x2", "x4", "x6", "x8"),
key = "Id")
y <- data.table(Id = c("A", "B", "B", "D"),
Y1 = c(1L, 3L, 5L, 7L),
XY = c("y1", "y3", "y5", "y7"),
key = "Id")
Join matching rows from y to x
|
|
Join matching rows from x to y
|
|
Join matching rows from both x and y
|
|
Join keeping all the rows
|
|
Return rows from x matching y
|
|
Return rows from x not matching y
|
|
More joins
data.table excels at joining data, and offers additional functions and features.
Select columns while joining
|
|
Aggregate columns while joining
|
|
Update columns while joining
|
|
Adds a list column with rows from y matching x (nest-join)
|
|
Update columns while joining (using vectors of colnames)
|
|
Join passing columns to match in the on
argument
z <- data.table(ID = "C", Z1 = 5:9, Z2 = paste0("z", 5:9))
x[, X2 := paste0("x", X1)] # used to track the results
z; x
## ID Z1 Z2
## 1: C 5 z5
## 2: C 6 z6
## 3: C 7 z7
## 4: C 8 z8
## 5: C 9 z9
## Id X1 XY X2
## 1: A 1 x2 x1
## 2: B 3 x4 x3
## 3: C 5 x6 x5
## 4: C 7 x8 x7
|
|
Non-equi joins
|
|
Rolling joins/subsets (performed on the last numeric column)
|
|
|
|
|
|
Cross join (CJ
~ expand.grid
)
|
|
Overlap join
It is important to mention foverlaps()
from data.table that allows to perform ‘overlap joins’. This is a very powerful function, but a bit out of scope for this document. See these slides for more details.
Bind
x <- data.table(1:3)
y <- data.table(4:6)
z <- data.table(7:9, 0L)
Bind rows
|
|
Bind rows using a list
|
|
Bind columns
|
|
Set operations
x <- data.table(c(1, 2, 2, 3, 3))
y <- data.table(c(2, 2, 3, 4, 4))
Intersection
|
|
Difference
|
|
Union
|
|
Equality
|
|
Summary
This article presented the most important features of both data.table and dplyr, two packages that are now essential tools for data manipulation in R.
There are still a lot of features not covered in this document, in particular, data.table functions to deal with time-series or dplyr vectorized functions have not been discussed, but ‘done is better than perfect’…
Hopefully, this comparison is not too biased, but I must admit that my preference is for data.table. So, I hope this post will encourage some readers to give it a try!
## R version 3.6.3 (2020-02-29)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.4 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=fr_FR.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=fr_FR.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=fr_FR.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=fr_FR.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] dplyr_1.0.0 data.table_1.13.0 knitr_1.26
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.3 magrittr_1.5 hms_0.5.2 tidyselect_1.1.0
## [5] R6_2.4.1 rlang_0.4.6 fansi_0.4.0 stringr_1.4.0
## [9] tools_3.6.3 xfun_0.11 utf8_1.1.4 cli_2.0.0
## [13] ellipsis_0.3.0 htmltools_0.4.0 assertthat_0.2.1 yaml_2.2.0
## [17] digest_0.6.23 tibble_2.1.3 lifecycle_0.2.0 crayon_1.3.4
## [21] bookdown_0.16 readr_1.3.1 tidyr_1.0.0 purrr_0.3.3
## [25] vctrs_0.3.1 glue_1.4.1 evaluate_0.14 rmarkdown_2.0
## [29] blogdown_0.17 stringi_1.4.3 compiler_3.6.3 pillar_1.4.3
## [33] generics_0.0.2 pkgconfig_2.0.3