Written by
Atrebas
on
on
A data.table and pandas small stroll
|
Introduction
About
- Last year, I published a post
showing how to use the data.table and dplyr R packages to execute data munging tasks.
- I have learned some Python lately and thought it would be a could exercice to “translate” the commands using pandas.
- I am less familiar with Python and pandas than I am with R and data.table, and presenting the two librairies together in a logical way is a bit challenging. So, rather than going into convoluted explanations,
I choose the easy (and time-saving) solution: only the code is presented, no further explanation is given - hence the title.
- That’s all it is. Just a programming exercice, not a A-vs-B post.
It is incomplete and not fully polished but I publish it in case it can be useful for someone and to better keep track of the files.
- The same R code is used below and in my “data.table and dplyr tour”. The Python code, available here, is structured and labeled in the same way.
Update
- 2020-08-10: The python code has been improved and completed. Thanks a lot to Samuel Oranyeli for his careful and insightful feedback.
Create example data
|
|
Below, the data.table code uses DT
and the pandas code uses DF
.
Basic operations
Filter rows
Filter rows using indices
|
|
Discard rows using negative indices
|
|
Filter rows using a logical expression
|
|
Filter rows using multiple conditions
|
|
Filter unique rows
|
|
Discard rows with missing values
|
|
Other filters
|
|
|
|
Sort rows
Sort rows by column
|
|
Sort rows in decreasing order
|
|
Sort rows based on several columns
|
|
Select columns
Select one column using an index (not recommended)
|
|
Select one column using column name
|
|
Select several columns
|
|
Exclude columns
|
|
Select/Exclude columns using a character vector
|
|
Other selections
|
|
Summarise data
Summarise one column
|
|
Summarise several columns
|
|
Summarise several columns and assign column names
|
|
Summarise a subset of rows
|
|
|
|
Add/update/delete columns
Modify a column
|
|
Add one column
|
|
Add several columns
|
|
Create one column and remove the others
|
|
Remove one column
|
|
Remove several columns
|
|
Remove columns using a vector of colnames
|
|
Replace values for rows matching a condition
|
|
by
By group
|
|
By several groups
|
|
Calling function in by
|
|
Assigning column name in by
|
|
Using a condition in by
|
|
By on a subset of rows
|
|
Count number of observations for each group
|
|
Add a column with number of observations for each group
|
|
Retrieve the first/last/nth observation for each group
|
|
Going further
Advanced columns manipulation
Summarise all the columns
|
|
Summarise several columns
|
|
Summarise several columns by group
|
|
Summarise with more than one function by group
|
|
Summarise using a condition
|
|
Modify all the columns
|
|
Modify several columns (dropping the others)
|
|
Modify several columns (keeping the others)
|
|
Modify columns using a condition (dropping the others)
|
|
Modify columns using a condition (keeping the others)
|
|
Use a complex expression
|
|
Use multiple expressions (with DT[,{j}]
)
|
|
Chain expressions
Expression chaining using DT[][]
(recommended)
|
|
Indexing and Keys
Set the key/index
|
|
Select the matching rows
|
|
Select the first matching row
|
|
Select the last matching row
|
|
Nomatch argument
|
|
Apply a function on the matching rows
|
|
Modify values for matching rows
|
|
Use keys in by
|
|
Set keys/indices for multiple columns
|
|
Subset using multiple keys/indices
|
|
Remove keys/indices
|
|
set*()
modifications
Replace values
|
|
Reorder rows
|
|
Modify colnames
|
|
Reorder columns
|
|
Convert data
|
|
Advanced use of by
Select first/last/… row by group
|
|
Select rows using a nested query
|
|
Add a group counter column
|
|
Get row number of first (and last) observation by group
|
|
Handle list-columns by group
|
|
Grouping sets (multiple by at once)
|
|
Miscellaneous
Read / Write data
Write data to a csv file
|
|
Write data to a tab-delimited file
|
|
Write list-column data to a csv file
|
|
Read a csv / tab-delimited file
|
|
Read a csv file selecting / droping columns
|
|
Read and rbind several files
|
|
Reshape data
Melt data (from wide to long)
|
|
Cast data (from long to wide)
|
|
Split
|
|
Split and transpose a vector/column
|
|
Other
Check package installation
|
|
List data.tables/tibbles
|
|
Get/Set number of threads when parallelized
|
|
Lead/Lag
|
|
Generate run-length ids
|
|
Vectorised ifelse
statements
|
|
Rolling functions
|
|
Join/Bind data sets
Join
|
|
Join matching rows from y to x
|
|
Join matching rows from x to y
|
|
Join matching rows from both x and y
|
|
Join keeping all the rows
|
|
Return rows from x matching y
|
|
Return rows from x not matching y
|
|
More joins
Select columns while joining
|
|
Aggregate columns while joining
|
|
Update columns while joining
|
|
Adds a list column with rows from y matching x (nest-join)
|
|
Update columns while joining (using vectors of colnames)
|
|
Join passing columns to match in the on
argument
z <- data.table(ID = "C", Z1 = 5:9, Z2 = paste0("z", 5:9))
x[, X2 := paste0("x", X1)] # used to track the results
z; x
## ID Z1 Z2
## 1: C 5 z5
## 2: C 6 z6
## 3: C 7 z7
## 4: C 8 z8
## 5: C 9 z9
## Id X1 XY X2
## 1: A 1 x2 x1
## 2: B 3 x4 x3
## 3: C 5 x6 x5
## 4: C 7 x8 x7
|
|
Non-equi joins
|
|
Rolling joins/subsets (performed on the last numeric column)
|
|
|
|
|
|
Cross join (CJ
~ expand.grid
)
|
|
Bind
|
|
Bind rows
|
|
Bind rows using a list
|
|
Bind columns
|
|
Set operations
x <- data.table(c(1, 2, 2, 3, 3))
y <- data.table(c(2, 2, 3, 4, 4))
Intersection
|
|
Difference
|
|
Union
|
|
Equality
|
|
## R version 3.6.3 (2020-02-29)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.4 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
## LAPACK: /home/xu/anaconda3/lib/libmkl_rt.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=fr_FR.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=fr_FR.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=fr_FR.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=fr_FR.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] data.table_1.13.0 knitr_1.26
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.3 bookdown_0.16 lattice_0.20-41 digest_0.6.23
## [5] rappdirs_0.3.1 grid_3.6.3 jsonlite_1.6 magrittr_1.5
## [9] evaluate_0.14 blogdown_0.17 rlang_0.4.6 stringi_1.4.3
## [13] Matrix_1.2-18 reticulate_1.16 rmarkdown_2.0 tools_3.6.3
## [17] stringr_1.4.0 xfun_0.11 yaml_2.2.0 compiler_3.6.3
## [21] htmltools_0.4.0