Written by
        Atrebas
        
on
		on
A data.table and pandas small stroll
 
 | 
 
 | 
Introduction
About
- Last year, I published a post
showing how to use the data.table and dplyr R packages to execute data munging tasks.
 - I have learned some Python lately and thought it would be a could exercice to “translate” the commands using pandas.
 - I am less familiar with Python and pandas than I am with R and data.table, and presenting the two librairies together in a logical way is a bit challenging. So, rather than going into convoluted explanations,
I choose the easy (and time-saving) solution: only the code is presented, no further explanation is given - hence the title.
 - That’s all it is. Just a programming exercice, not a A-vs-B post.
It is incomplete and not fully polished but I publish it in case it can be useful for someone and to better keep track of the files.
 - The same R code is used below and in my “data.table and dplyr tour”. The Python code, available here, is structured and labeled in the same way.
 
Update
- 2020-08-10: The python code has been improved and completed. Thanks a lot to Samuel Oranyeli for his careful and insightful feedback.
 
Create example data
 | 
 | 
Below, the data.table code uses DT and the pandas code uses DF.
Basic operations
Filter rows
Filter rows using indices
 | 
 | 
Discard rows using negative indices
 | 
 | 
Filter rows using a logical expression
 | 
 | 
Filter rows using multiple conditions
 | 
 | 
Filter unique rows
 | 
 | 
Discard rows with missing values
 | 
 | 
Other filters
 | 
 | 
 | 
 | 
Sort rows
Sort rows by column
 | 
 | 
Sort rows in decreasing order
 | 
 | 
Sort rows based on several columns
 | 
 | 
Select columns
Select one column using an index (not recommended)
 | 
 | 
Select one column using column name
 | 
 | 
Select several columns
 | 
 | 
Exclude columns
 | 
 | 
Select/Exclude columns using a character vector
 | 
 | 
Other selections
 | 
 | 
Summarise data
Summarise one column
 | 
 | 
Summarise several columns
 | 
 | 
Summarise several columns and assign column names
 | 
 | 
Summarise a subset of rows
 | 
 | 
 | 
 | 
Add/update/delete columns
Modify a column
 | 
 | 
Add one column
 | 
 | 
Add several columns
 | 
 | 
Create one column and remove the others
 | 
 | 
Remove one column
 | 
 | 
Remove several columns
 | 
 | 
Remove columns using a vector of colnames
 | 
 | 
Replace values for rows matching a condition
 | 
 | 
by
By group
 | 
 | 
By several groups
 | 
 | 
Calling function in by
 | 
 | 
Assigning column name in by
 | 
 | 
Using a condition in by
 | 
 | 
By on a subset of rows
 | 
 | 
Count number of observations for each group
 | 
 | 
Add a column with number of observations for each group
 | 
 | 
Retrieve the first/last/nth observation for each group
 | 
 | 
Going further
Advanced columns manipulation
Summarise all the columns
 | 
 | 
Summarise several columns
 | 
 | 
Summarise several columns by group
 | 
 | 
Summarise with more than one function by group
 | 
 | 
Summarise using a condition
 | 
 | 
Modify all the columns
 | 
 | 
Modify several columns (dropping the others)
 | 
 | 
Modify several columns (keeping the others)
 | 
 | 
Modify columns using a condition (dropping the others)
 | 
 | 
Modify columns using a condition (keeping the others)
 | 
 | 
Use a complex expression
 | 
 | 
Use multiple expressions (with DT[,{j}])
 | 
 | 
Chain expressions
Expression chaining using DT[][] (recommended)
 | 
 | 
Indexing and Keys
Set the key/index
 | 
 | 
Select the matching rows
 | 
 | 
Select the first matching row
 | 
 | 
Select the last matching row
 | 
 | 
Nomatch argument
 | 
 | 
Apply a function on the matching rows
 | 
 | 
Modify values for matching rows
 | 
 | 
Use keys in by
 | 
 | 
Set keys/indices for multiple columns
 | 
 | 
Subset using multiple keys/indices
 | 
 | 
Remove keys/indices
 | 
 | 
set*() modifications
Replace values
 | 
 | 
Reorder rows
 | 
 | 
Modify colnames
 | 
 | 
Reorder columns
 | 
 | 
Convert data
 | 
 | 
Advanced use of by
Select first/last/… row by group
 | 
 | 
Select rows using a nested query
 | 
 | 
Add a group counter column
 | 
 | 
Get row number of first (and last) observation by group
 | 
 | 
Handle list-columns by group
 | 
 | 
Grouping sets (multiple by at once)
 | 
 | 
Miscellaneous
Read / Write data
Write data to a csv file
 | 
 | 
Write data to a tab-delimited file
 | 
 | 
Write list-column data to a csv file
 | 
 | 
Read a csv / tab-delimited file
 | 
 | 
Read a csv file selecting / droping columns
 | 
 | 
Read and rbind several files
 | 
 | 
Reshape data
Melt data (from wide to long)
 | 
 | 
Cast data (from long to wide)
 | 
 | 
Split
 | 
 | 
Split and transpose a vector/column
 | 
 | 
Other
Check package installation
 | 
 | 
List data.tables/tibbles
 | 
 | 
Get/Set number of threads when parallelized
 | 
 | 
Lead/Lag
 | 
 | 
Generate run-length ids
 | 
 | 
Vectorised ifelse statements
 | 
 | 
Rolling functions
 | 
 | 
Join/Bind data sets
Join
 | 
 | 
Join matching rows from y to x
 | 
 | 
Join matching rows from x to y
 | 
 | 
Join matching rows from both x and y
 | 
 | 
Join keeping all the rows
 | 
 | 
Return rows from x matching y
 | 
 | 
Return rows from x not matching y
 | 
 | 
More joins
Select columns while joining
 | 
 | 
Aggregate columns while joining
 | 
 | 
Update columns while joining
 | 
 | 
Adds a list column with rows from y matching x (nest-join)
 | 
 | 
Update columns while joining (using vectors of colnames)
 | 
 | 
Join passing columns to match in the on argument
z <- data.table(ID = "C", Z1 = 5:9, Z2 = paste0("z", 5:9))
x[, X2 := paste0("x", X1)] # used to track the results
z; x
##    ID Z1 Z2
## 1:  C  5 z5
## 2:  C  6 z6
## 3:  C  7 z7
## 4:  C  8 z8
## 5:  C  9 z9
##    Id X1 XY X2
## 1:  A  1 x2 x1
## 2:  B  3 x4 x3
## 3:  C  5 x6 x5
## 4:  C  7 x8 x7
 | 
 | 
Non-equi joins
 | 
 | 
Rolling joins/subsets (performed on the last numeric column)
 | 
 | 
 | 
 | 
 | 
 | 
Cross join (CJ ~ expand.grid)
 | 
 | 
Bind
 | 
 | 
Bind rows
 | 
 | 
Bind rows using a list
 | 
 | 
Bind columns
 | 
 | 
Set operations
x <- data.table(c(1, 2, 2, 3, 3))
y <- data.table(c(2, 2, 3, 4, 4))
Intersection
 | 
 | 
Difference
 | 
 | 
Union
 | 
 | 
Equality
 | 
 | 
## R version 3.6.3 (2020-02-29)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.4 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
## LAPACK: /home/xu/anaconda3/lib/libmkl_rt.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=fr_FR.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=fr_FR.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=fr_FR.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=fr_FR.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] data.table_1.13.0 knitr_1.26       
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.3      bookdown_0.16   lattice_0.20-41 digest_0.6.23  
##  [5] rappdirs_0.3.1  grid_3.6.3      jsonlite_1.6    magrittr_1.5   
##  [9] evaluate_0.14   blogdown_0.17   rlang_0.4.6     stringi_1.4.3  
## [13] Matrix_1.2-18   reticulate_1.16 rmarkdown_2.0   tools_3.6.3    
## [17] stringr_1.4.0   xfun_0.11       yaml_2.2.0      compiler_3.6.3 
## [21] htmltools_0.4.0