Written by Atrebas
on June 14, 2020

A data.table and pandas small stroll

Introduction

About

Last year, I published a post showing how to use the data.table and dplyr R packages to execute data munging tasks.
I have learned some Python lately and thought it would be a could exercice to “translate” the commands using pandas.
I am less familiar with Python and pandas than I am with R and data.table, and presenting the two librairies together in a logical way is a bit challenging. So, rather than going into convoluted explanations, I choose the easy (and time-saving) solution: only the code is presented, no further explanation is given - hence the title.
That’s all it is. Just a programming exercice, not a A-vs-B post. It is incomplete and not fully polished but I publish it in case it can be useful for someone and to better keep track of the files.
The same R code is used below and in my “data.table and dplyr tour”. The Python code, available here, is structured and labeled in the same way.

Update

2020-08-10: The python code has been improved and completed. Thanks a lot to Samuel Oranyeli for his careful and insightful feedback.

Create example data

library(data.table)
set.seed(1L)

## Create a data table
DT <- data.table(V1 = rep(c(1L, 2L), 5)[-10],
                V2 = 1:9,
                V3 = c(0.5, 1.0, 1.5),
                V4 = rep(LETTERS[1:3], 3))

class(DT)
DT

## [1] "data.table" "data.frame"
##    V1 V2  V3 V4
## 1:  1  1 0.5  A
## 2:  2  2 1.0  B
## 3:  1  3 1.5  C
## 4:  2  4 0.5  A
## 5:  1  5 1.0  B
## 6:  2  6 1.5  C
## 7:  1  7 0.5  A
## 8:  2  8 1.0  B
## 9:  1  9 1.5  C

import pandas as pd
import numpy as np

## Create a pandas DataFrame
DF = pd.DataFrame(
  {"V1" : [1, 2, 1, 2, 1, 2, 1, 2, 1],
   "V2" : [1, 2, 3, 4, 5, 6, 7, 8, 9], 
   "V3" : [0.5, 1.0, 1.5] * 3, 
   "V4" : ['A', 'B', 'C'] * 3}) 
type(DF)

## <class 'pandas.core.frame.DataFrame'>

DF

##    V1  V2   V3 V4
## 0   1   1  0.5  A
## 1   2   2  1.0  B
## 2   1   3  1.5  C
## 3   2   4  0.5  A
## 4   1   5  1.0  B
## 5   2   6  1.5  C
## 6   1   7  0.5  A
## 7   2   8  1.0  B
## 8   1   9  1.5  C

Below, the data.table code uses DT and the pandas code uses DF.

Basic operations

Filter rows

Filter rows using indices

DT[3:4,]
DT[3:4] # same

##    V1 V2  V3 V4
## 1:  1  3 1.5  C
## 2:  2  4 0.5  A
##    V1 V2  V3 V4
## 1:  1  3 1.5  C
## 2:  2  4 0.5  A

DF.iloc[2:4]

##    V1  V2   V3 V4
## 2   1   3  1.5  C
## 3   2   4  0.5  A

Discard rows using negative indices

DT[!3:7,]
DT[-(3:7)] # same

##    V1 V2  V3 V4
## 1:  1  1 0.5  A
## 2:  2  2 1.0  B
## 3:  2  8 1.0  B
## 4:  1  9 1.5  C
##    V1 V2  V3 V4
## 1:  1  1 0.5  A
## 2:  2  2 1.0  B
## 3:  2  8 1.0  B
## 4:  1  9 1.5  C

DF[~DF.index.isin(range(2, 7))]

##    V1  V2   V3 V4
## 0   1   1  0.5  A
## 1   2   2  1.0  B
## 7   2   8  1.0  B
## 8   1   9  1.5  C

Filter rows using a logical expression

DT[V2 > 5]
DT[V4 %chin% c("A", "C")] # fast %in% for character

##    V1 V2  V3 V4
## 1:  2  6 1.5  C
## 2:  1  7 0.5  A
## 3:  2  8 1.0  B
## 4:  1  9 1.5  C
##    V1 V2  V3 V4
## 1:  1  1 0.5  A
## 2:  1  3 1.5  C
## 3:  2  4 0.5  A
## 4:  2  6 1.5  C
## 5:  1  7 0.5  A
## 6:  1  9 1.5  C

DF[DF.V2 > 5]

##    V1  V2   V3 V4
## 5   2   6  1.5  C
## 6   1   7  0.5  A
## 7   2   8  1.0  B
## 8   1   9  1.5  C

DF.query('V2>5')

##    V1  V2   V3 V4
## 5   2   6  1.5  C
## 6   1   7  0.5  A
## 7   2   8  1.0  B
## 8   1   9  1.5  C

DF[DF.V4.isin(['A', 'C'])]

##    V1  V2   V3 V4
## 0   1   1  0.5  A
## 2   1   3  1.5  C
## 3   2   4  0.5  A
## 5   2   6  1.5  C
## 6   1   7  0.5  A
## 8   1   9  1.5  C

Filter rows using multiple conditions

DT[V1 == 1 & V4 == "A"]
# any logical criteria can be used

##    V1 V2  V3 V4
## 1:  1  1 0.5  A
## 2:  1  7 0.5  A

DF.loc[(DF.V1 == 1) & (DF.V4 == 'A')]
# DF.query("V1 == 1 and V4 == 'A'")
# any logical criteria can be used

##    V1  V2   V3 V4
## 0   1   1  0.5  A
## 6   1   7  0.5  A

Filter unique rows

unique(DT)
unique(DT, by = c("V1", "V4")) # returns all cols

##    V1 V2  V3 V4
## 1:  1  1 0.5  A
## 2:  2  2 1.0  B
## 3:  1  3 1.5  C
## 4:  2  4 0.5  A
## 5:  1  5 1.0  B
## 6:  2  6 1.5  C
## 7:  1  7 0.5  A
## 8:  2  8 1.0  B
## 9:  1  9 1.5  C
##    V1 V2  V3 V4
## 1:  1  1 0.5  A
## 2:  2  2 1.0  B
## 3:  1  3 1.5  C
## 4:  2  4 0.5  A
## 5:  1  5 1.0  B
## 6:  2  6 1.5  C

DF.drop_duplicates()

##    V1  V2   V3 V4
## 0   1   1  0.5  A
## 1   2   2  1.0  B
## 2   1   3  1.5  C
## 3   2   4  0.5  A
## 4   1   5  1.0  B
## 5   2   6  1.5  C
## 6   1   7  0.5  A
## 7   2   8  1.0  B
## 8   1   9  1.5  C

DF.drop_duplicates(subset =['V1', 'V4'])

##    V1  V2   V3 V4
## 0   1   1  0.5  A
## 1   2   2  1.0  B
## 2   1   3  1.5  C
## 3   2   4  0.5  A
## 4   1   5  1.0  B
## 5   2   6  1.5  C

Discard rows with missing values

na.omit(DT, cols = 1:4)  # fast S3 method with cols argument

##    V1 V2  V3 V4
## 1:  1  1 0.5  A
## 2:  2  2 1.0  B
## 3:  1  3 1.5  C
## 4:  2  4 0.5  A
## 5:  1  5 1.0  B
## 6:  2  6 1.5  C
## 7:  1  7 0.5  A
## 8:  2  8 1.0  B
## 9:  1  9 1.5  C

DF.dropna(subset = DF.columns)
# dropna has more options

##    V1  V2   V3 V4
## 0   1   1  0.5  A
## 1   2   2  1.0  B
## 2   1   3  1.5  C
## 3   2   4  0.5  A
## 4   1   5  1.0  B
## 5   2   6  1.5  C
## 6   1   7  0.5  A
## 7   2   8  1.0  B
## 8   1   9  1.5  C

DF.dropna(axis = 0, how = 'any', thresh = None, subset = None, inplace = False)

##    V1  V2   V3 V4
## 0   1   1  0.5  A
## 1   2   2  1.0  B
## 2   1   3  1.5  C
## 3   2   4  0.5  A
## 4   1   5  1.0  B
## 5   2   6  1.5  C
## 6   1   7  0.5  A
## 7   2   8  1.0  B
## 8   1   9  1.5  C

Other filters

DT[sample(.N, 3)] # .N = nb of rows in DT
DT[sample(.N, .N / 2)]
DT[frankv(-V1, ties.method = "dense") < 2]

##    V1 V2  V3 V4
## 1:  1  9 1.5  C
## 2:  2  4 0.5  A
## 3:  1  7 0.5  A
##    V1 V2  V3 V4
## 1:  1  1 0.5  A
## 2:  2  2 1.0  B
## 3:  1  5 1.0  B
## 4:  1  3 1.5  C
##    V1 V2  V3 V4
## 1:  2  2 1.0  B
## 2:  2  4 0.5  A
## 3:  2  6 1.5  C
## 4:  2  8 1.0  B

DF.sample(n = 3)

##    V1  V2   V3 V4
## 4   1   5  1.0  B
## 5   2   6  1.5  C
## 1   2   2  1.0  B

DF.sample(frac = 0.5)
# DF[DF['V1'].nlargest(1)] # no keep = 'all'

##    V1  V2   V3 V4
## 7   2   8  1.0  B
## 0   1   1  0.5  A
## 6   1   7  0.5  A
## 2   1   3  1.5  C

DT[V4 %like% "^B"]
DT[V2 %between% c(3, 5)]
DT[data.table::between(V2, 3, 5, incbounds = FALSE)]
DT[V2 %inrange% list(-1:1, 1:3)] # see also ?inrange

##    V1 V2 V3 V4
## 1:  2  2  1  B
## 2:  1  5  1  B
## 3:  2  8  1  B
##    V1 V2  V3 V4
## 1:  1  3 1.5  C
## 2:  2  4 0.5  A
## 3:  1  5 1.0  B
##    V1 V2  V3 V4
## 1:  2  4 0.5  A
##    V1 V2  V3 V4
## 1:  1  1 0.5  A
## 2:  2  2 1.0  B
## 3:  1  3 1.5  C

DF[DF.V4.str.startswith('B')] # use tuples when multiple

##    V1  V2   V3 V4
## 1   2   2  1.0  B
## 4   1   5  1.0  B
## 7   2   8  1.0  B

DF[DF['V2'].between(3, 5)]

##    V1  V2   V3 V4
## 2   1   3  1.5  C
## 3   2   4  0.5  A
## 4   1   5  1.0  B

DF[DF['V2'].between(3, 5, inclusive = False)]
#DF[DF.V2 >= list(range(-1, 2)) & DF.V2 <= list(range(1,4))]

##    V1  V2   V3 V4
## 3   2   4  0.5  A

Sort rows

Sort rows by column

DT[order(V3)]  # see also setorder

##    V1 V2  V3 V4
## 1:  1  1 0.5  A
## 2:  2  4 0.5  A
## 3:  1  7 0.5  A
## 4:  2  2 1.0  B
## 5:  1  5 1.0  B
## 6:  2  8 1.0  B
## 7:  1  3 1.5  C
## 8:  2  6 1.5  C
## 9:  1  9 1.5  C

DF.sort_values('V3')

##    V1  V2   V3 V4
## 0   1   1  0.5  A
## 3   2   4  0.5  A
## 6   1   7  0.5  A
## 1   2   2  1.0  B
## 4   1   5  1.0  B
## 7   2   8  1.0  B
## 2   1   3  1.5  C
## 5   2   6  1.5  C
## 8   1   9  1.5  C

Sort rows in decreasing order

DT[order(-V3)]

##    V1 V2  V3 V4
## 1:  1  3 1.5  C
## 2:  2  6 1.5  C
## 3:  1  9 1.5  C
## 4:  2  2 1.0  B
## 5:  1  5 1.0  B
## 6:  2  8 1.0  B
## 7:  1  1 0.5  A
## 8:  2  4 0.5  A
## 9:  1  7 0.5  A

DF.sort_values('V3', ascending = False)

##    V1  V2   V3 V4
## 2   1   3  1.5  C
## 5   2   6  1.5  C
## 8   1   9  1.5  C
## 1   2   2  1.0  B
## 4   1   5  1.0  B
## 7   2   8  1.0  B
## 0   1   1  0.5  A
## 3   2   4  0.5  A
## 6   1   7  0.5  A

Sort rows based on several columns

DT[order(V1, -V2)]

##    V1 V2  V3 V4
## 1:  1  9 1.5  C
## 2:  1  7 0.5  A
## 3:  1  5 1.0  B
## 4:  1  3 1.5  C
## 5:  1  1 0.5  A
## 6:  2  8 1.0  B
## 7:  2  6 1.5  C
## 8:  2  4 0.5  A
## 9:  2  2 1.0  B

DF.sort_values(['V1', 'V2'], ascending = [True, False])

##    V1  V2   V3 V4
## 8   1   9  1.5  C
## 6   1   7  0.5  A
## 4   1   5  1.0  B
## 2   1   3  1.5  C
## 0   1   1  0.5  A
## 7   2   8  1.0  B
## 5   2   6  1.5  C
## 3   2   4  0.5  A
## 1   2   2  1.0  B

Select columns

Select one column using an index (not recommended)

DT[[3]] # returns a vector
DT[, 3]  # returns a data.table

## [1] 0.5 1.0 1.5 0.5 1.0 1.5 0.5 1.0 1.5
##     V3
## 1: 0.5
## 2: 1.0
## 3: 1.5
## 4: 0.5
## 5: 1.0
## 6: 1.5
## 7: 0.5
## 8: 1.0
## 9: 1.5

DF.iloc[:, 2]   # returns a pandas Series

## 0    0.5
## 1    1.0
## 2    1.5
## 3    0.5
## 4    1.0
## 5    1.5
## 6    0.5
## 7    1.0
## 8    1.5
## Name: V3, dtype: float64

DF.iloc[:, [2]] # convert to DataFrame

##     V3
## 0  0.5
## 1  1.0
## 2  1.5
## 3  0.5
## 4  1.0
## 5  1.5
## 6  0.5
## 7  1.0
## 8  1.5

Select one column using column name

DT[, list(V2)] # returns a data.table
DT[, .(V2)]    # returns a data.table
# . is an alias for list
DT[, "V2"]     # returns a data.table
DT[, V2]       # returns a vector
DT[["V2"]]     # returns a vector

##    V2
## 1:  1
## 2:  2
## 3:  3
## 4:  4
## 5:  5
## 6:  6
## 7:  7
## 8:  8
## 9:  9
##    V2
## 1:  1
## 2:  2
## 3:  3
## 4:  4
## 5:  5
## 6:  6
## 7:  7
## 8:  8
## 9:  9
##    V2
## 1:  1
## 2:  2
## 3:  3
## 4:  4
## 5:  5
## 6:  6
## 7:  7
## 8:  8
## 9:  9
## [1] 1 2 3 4 5 6 7 8 9
## [1] 1 2 3 4 5 6 7 8 9

DF.V2           # returns a pandas Series

## 0    1
## 1    2
## 2    3
## 3    4
## 4    5
## 5    6
## 6    7
## 7    8
## 8    9
## Name: V2, dtype: int64

DF['V2']        # returns a pandas Series

## 0    1
## 1    2
## 2    3
## 3    4
## 4    5
## 5    6
## 6    7
## 7    8
## 8    9
## Name: V2, dtype: int64

DF.loc[:, 'V2'] # returns a pandas Series

## 0    1
## 1    2
## 2    3
## 3    4
## 4    5
## 5    6
## 6    7
## 7    8
## 8    9
## Name: V2, dtype: int64

DF[['V2']]      # returns a DataFrame

##    V2
## 0   1
## 1   2
## 2   3
## 3   4
## 4   5
## 5   6
## 6   7
## 7   8
## 8   9

Select several columns

DT[, .(V2, V3, V4)]
DT[, list(V2, V3, V4)]
DT[, V2:V4] # select columns between V2 and V4

##    V2  V3 V4
## 1:  1 0.5  A
## 2:  2 1.0  B
## 3:  3 1.5  C
## 4:  4 0.5  A
## 5:  5 1.0  B
## 6:  6 1.5  C
## 7:  7 0.5  A
## 8:  8 1.0  B
## 9:  9 1.5  C
##    V2  V3 V4
## 1:  1 0.5  A
## 2:  2 1.0  B
## 3:  3 1.5  C
## 4:  4 0.5  A
## 5:  5 1.0  B
## 6:  6 1.5  C
## 7:  7 0.5  A
## 8:  8 1.0  B
## 9:  9 1.5  C
##    V2  V3 V4
## 1:  1 0.5  A
## 2:  2 1.0  B
## 3:  3 1.5  C
## 4:  4 0.5  A
## 5:  5 1.0  B
## 6:  6 1.5  C
## 7:  7 0.5  A
## 8:  8 1.0  B
## 9:  9 1.5  C

DF[['V2', 'V3', 'V4']]

##    V2   V3 V4
## 0   1  0.5  A
## 1   2  1.0  B
## 2   3  1.5  C
## 3   4  0.5  A
## 4   5  1.0  B
## 5   6  1.5  C
## 6   7  0.5  A
## 7   8  1.0  B
## 8   9  1.5  C

DF.loc[:, 'V2':'V4'] # select columns between V2 and V4

##    V2   V3 V4
## 0   1  0.5  A
## 1   2  1.0  B
## 2   3  1.5  C
## 3   4  0.5  A
## 4   5  1.0  B
## 5   6  1.5  C
## 6   7  0.5  A
## 7   8  1.0  B
## 8   9  1.5  C

Exclude columns

DT[, !c("V2", "V3")]

##    V1 V4
## 1:  1  A
## 2:  2  B
## 3:  1  C
## 4:  2  A
## 5:  1  B
## 6:  2  C
## 7:  1  A
## 8:  2  B
## 9:  1  C

DF.drop(columns = ['V2', 'V3'])

##    V1 V4
## 0   1  A
## 1   2  B
## 2   1  C
## 3   2  A
## 4   1  B
## 5   2  C
## 6   1  A
## 7   2  B
## 8   1  C

Select/Exclude columns using a character vector

cols <- c("V2", "V3")
DT[, ..cols] # .. prefix means 'one-level up'
DT[, !..cols] # or DT[, -..cols]

##    V2  V3
## 1:  1 0.5
## 2:  2 1.0
## 3:  3 1.5
## 4:  4 0.5
## 5:  5 1.0
## 6:  6 1.5
## 7:  7 0.5
## 8:  8 1.0
## 9:  9 1.5
##    V1 V4
## 1:  1  A
## 2:  2  B
## 3:  1  C
## 4:  2  A
## 5:  1  B
## 6:  2  C
## 7:  1  A
## 8:  2  B
## 9:  1  C

cols = ['V2', 'V3']
DF.loc[:, cols]

##    V2   V3
## 0   1  0.5
## 1   2  1.0
## 2   3  1.5
## 3   4  0.5
## 4   5  1.0
## 5   6  1.5
## 6   7  0.5
## 7   8  1.0
## 8   9  1.5

DF.loc[:, DF.columns.difference(cols)]

##    V1 V4
## 0   1  A
## 1   2  B
## 2   1  C
## 3   2  A
## 4   1  B
## 5   2  C
## 6   1  A
## 7   2  B
## 8   1  C

Other selections

cols <- paste0("V", 1:2)
cols <- union("V4", names(DT))
cols <- grep("V",   names(DT))
cols <- grep("3$",  names(DT))
cols <- grep(".2",  names(DT))
cols <- grep("^V1|X$",  names(DT))
cols <- grep("^(?!V2)", names(DT), perl = TRUE)
DT[, ..cols]

##    V1  V3 V4
## 1:  1 0.5  A
## 2:  2 1.0  B
## 3:  1 1.5  C
## 4:  2 0.5  A
## 5:  1 1.0  B
## 6:  2 1.5  C
## 7:  1 0.5  A
## 8:  2 1.0  B
## 9:  1 1.5  C

cols = ['V' + str(x) for x in [1, 2]] # and DF.loc
# ?
DF.filter(regex = 'V')

##    V1  V2   V3 V4
## 0   1   1  0.5  A
## 1   2   2  1.0  B
## 2   1   3  1.5  C
## 3   2   4  0.5  A
## 4   1   5  1.0  B
## 5   2   6  1.5  C
## 6   1   7  0.5  A
## 7   2   8  1.0  B
## 8   1   9  1.5  C

DF.filter(regex = '3$')

##     V3
## 0  0.5
## 1  1.0
## 2  1.5
## 3  0.5
## 4  1.0
## 5  1.5
## 6  0.5
## 7  1.0
## 8  1.5

DF.filter(regex = '.2')

##    V2
## 0   1
## 1   2
## 2   3
## 3   4
## 4   5
## 5   6
## 6   7
## 7   8
## 8   9

DF.filter(regex = '^V1|X$')

##    V1
## 0   1
## 1   2
## 2   1
## 3   2
## 4   1
## 5   2
## 6   1
## 7   2
## 8   1

DF.filter(regex = '^(?!V2)')

##    V1   V3 V4
## 0   1  0.5  A
## 1   2  1.0  B
## 2   1  1.5  C
## 3   2  0.5  A
## 4   1  1.0  B
## 5   2  1.5  C
## 6   1  0.5  A
## 7   2  1.0  B
## 8   1  1.5  C

Summarise data

Summarise one column

DT[, sum(V1)]    # returns a vector
DT[, .(sum(V1))] # returns a data.table
DT[, .(sumV1 = sum(V1))] # returns a data.table

## [1] 13
##    V1
## 1: 13
##    sumV1
## 1:    13

DF.V1.sum()   # returns a numpy array

## 13

DF[['V1']].sum()   # returns a pandas series

## V1    13
## dtype: int64

DF.loc[:, ['V1']].agg(['sum']).set_axis(['sumV1'], axis = 1)

##      sumV1
## sum     13

Summarise several columns

DT[, .(sum(V1), sd(V3))]

##    V1        V2
## 1: 13 0.4330127

DF.agg({'V1': 'sum', 'V3': 'std'}) # series

## V1    13.000000
## V3     0.433013
## dtype: float64

Summarise several columns and assign column names

DT[, .(sumv1 = sum(V1),
       sdv3  = sd(V3))]

##    sumv1      sdv3
## 1:    13 0.4330127

DF.agg(sumV1 = ('V1','sum'), sdv3 = ('V3','std'))
#res = list(DF.agg({'V1': 'sum', 'V3': 'std'}))
#pd.DataFrame([res], columns=['sumV1','sdv3'])

##          V1        V3
## sumV1  13.0       NaN
## sdv3    NaN  0.433013

Summarise a subset of rows

DT[1:4, sum(V1)]

## [1] 6

DF.loc[:3, 'V1'].sum()

## 6

DT[, data.table::first(V3)]
DT[, data.table::last(V3)]
DT[5, V3]
DT[, uniqueN(V4)]
uniqueN(DT)

## [1] 0.5
## [1] 1.5
## [1] 1
## [1] 3
## [1] 9

DF.loc[0,'V3']

## 0.5

DF.at[DF.index[-1], 'V3']
# at method better than loc with scalars

## 1.5

DF.V3.iloc[4]

## 1.0

DF['V4'].nunique()

## 3

len(DF.drop_duplicates())

## 9

Add/update/delete columns

Modify a column

DT[, V1 := V1^2]
DT

##    V1 V2  V3 V4
## 1:  1  1 0.5  A
## 2:  4  2 1.0  B
## 3:  1  3 1.5  C
## 4:  4  4 0.5  A
## 5:  1  5 1.0  B
## 6:  4  6 1.5  C
## 7:  1  7 0.5  A
## 8:  4  8 1.0  B
## 9:  1  9 1.5  C

DF.loc[:, 'V1'] = DF.loc[:, 'V1'] ** 2
#DF['V1'] = DF['V1']**2
DF.eval('V1 = V1**2', inplace = True)
DF

##    V1  V2   V3 V4
## 0   1   1  0.5  A
## 1  16   2  1.0  B
## 2   1   3  1.5  C
## 3  16   4  0.5  A
## 4   1   5  1.0  B
## 5  16   6  1.5  C
## 6   1   7  0.5  A
## 7  16   8  1.0  B
## 8   1   9  1.5  C

Add one column

DT[, v5 := log(V1)][] # adding [] prints the result

##    V1 V2  V3 V4       v5
## 1:  1  1 0.5  A 0.000000
## 2:  4  2 1.0  B 1.386294
## 3:  1  3 1.5  C 0.000000
## 4:  4  4 0.5  A 1.386294
## 5:  1  5 1.0  B 0.000000
## 6:  4  6 1.5  C 1.386294
## 7:  1  7 0.5  A 0.000000
## 8:  4  8 1.0  B 1.386294
## 9:  1  9 1.5  C 0.000000

DF = DF.assign(v5 = np.log(DF.V1))

Add several columns

DT[, c("v6", "v7") := .(sqrt(V1), "X")]

DT[, ':='(v6 = sqrt(V1),
          v7 = "X")]     # same, functional form

DF = DF.assign(v6 = np.sqrt(DF.V1), v7 = 'X')

Create one column and remove the others

DT[, .(v8 = V3 + 1)]

##     v8
## 1: 1.5
## 2: 2.0
## 3: 2.5
## 4: 1.5
## 5: 2.0
## 6: 2.5
## 7: 1.5
## 8: 2.0
## 9: 2.5

DF.loc[:, ['V3']].add(1).rename(columns = {'V3':'V8'})
# pd.DataFrame({'v8' : DF.V3 + 1})

##     V8
## 0  1.5
## 1  2.0
## 2  2.5
## 3  1.5
## 4  2.0
## 5  2.5
## 6  1.5
## 7  2.0
## 8  2.5

Remove one column

DT[, v5 := NULL]

del DF['v5'] # DF = DF.drop('v5', 1)

Remove several columns

DT[, c("v6", "v7") := NULL]

DF = DF.drop(['v6', 'v7'], 1)

Remove columns using a vector of colnames

cols <- c("V3")
DT[, (cols) := NULL] # ! not DT[, cols := NULL]

cols = 'V3'
del DF[cols]

Replace values for rows matching a condition

DT[V2 < 4, V2 := 0L]
DT

##    V1 V2 V4
## 1:  1  0  A
## 2:  4  0  B
## 3:  1  0  C
## 4:  4  4  A
## 5:  1  5  B
## 6:  4  6  C
## 7:  1  7  A
## 8:  4  8  B
## 9:  1  9  C

DF.loc[DF.loc[:, 'V2'] < 4, 'V2'] = 0

by

By group

# one-liner:
DT[, .(sumV2 = sum(V2)), by = "V4"]
# reordered and indented:
DT[, by = V4,
     .(sumV2 = sum(V2))]
#

##    V4 sumV2
## 1:  A    11
## 2:  B    13
## 3:  C    15
##    V4 sumV2
## 1:  A    11
## 2:  B    13
## 3:  C    15

(DF.groupby(['V4'], as_index = False)
  .agg(sumV2 = ('V2', 'sum')))

##   V4  sumV2
## 0  A     11
## 1  B     13
## 2  C     15

By several groups

DT[, keyby = .(V4, V1),
     .(sumV2 = sum(V2))]

##    V4 V1 sumV2
## 1:  A  1     7
## 2:  A  4     4
## 3:  B  1     5
## 4:  B  4     8
## 5:  C  1     9
## 6:  C  4     6

(DF.groupby(['V4', 'V1'], as_index = False)
  .agg(sumV2 = ('V2', 'sum')))

##   V4  V1  sumV2
## 0  A   1      7
## 1  A  16      4
## 2  B   1      5
## 3  B  16      8
## 4  C   1      9
## 5  C  16      6

Calling function in by

DT[, by = tolower(V4),
     .(sumV1 = sum(V1))]

##    tolower sumV1
## 1:       a     6
## 2:       b     9
## 3:       c     6

(DF.assign(V4 = DF.V4.str.lower())
  .groupby(['V4'], as_index = False)
  .agg(sumV1=('V1', 'sum')))

##   V4  sumV1
## 0  a     18
## 1  b     33
## 2  c     18

Assigning column name in by

DT[, keyby = .(abc = tolower(V4)),
     .(sumV1 = sum(V1))]

##    abc sumV1
## 1:   a     6
## 2:   b     9
## 3:   c     6

(DF.assign(abc = lambda x: x.V4.str.lower())
  .groupby(['abc'], as_index = False)
  .agg(sumV1 = ('V1', 'sum')))

##   abc  sumV1
## 0   a     18
## 1   b     33
## 2   c     18

Using a condition in by

DT[, keyby = V4 == "A",
     sum(V1)]

##       V4 V1
## 1: FALSE 15
## 2:  TRUE  6

DF.groupby(DF.V4 == 'A').V1.sum()

## V4
## False    51
## True     18
## Name: V1, dtype: int64

By on a subset of rows

DT[1:5,                # i
   .(sumV1 = sum(V1)), # j
   by = V4]            # by
## complete DT[i, j, by] expression!

##    V4 sumV1
## 1:  A     5
## 2:  B     5
## 3:  C     1

DF.iloc[:5].groupby('V4').agg(sumV1 = ('V1','sum'))

##     sumV1
## V4       
## A      17
## B      17
## C       1

Count number of observations for each group

DT[, .N, by = V4]

##    V4 N
## 1:  A 3
## 2:  B 3
## 3:  C 3

DF.groupby('V4').size()

## V4
## A    3
## B    3
## C    3
## dtype: int64

Add a column with number of observations for each group

DT[, n := .N, by = V1][]
DT[, n := NULL] # rm column for consistency

##    V1 V2 V4 n
## 1:  1  0  A 5
## 2:  4  0  B 4
## 3:  1  0  C 5
## 4:  4  4  A 4
## 5:  1  5  B 5
## 6:  4  6  C 4
## 7:  1  7  A 5
## 8:  4  8  B 4
## 9:  1  9  C 5

DF.assign(n = lambda x: x.groupby('V1').V4.transform('count'))

##    V1  V2 V4  n
## 0   1   0  A  5
## 1  16   0  B  4
## 2   1   0  C  5
## 3  16   4  A  4
## 4   1   5  B  5
## 5  16   6  C  4
## 6   1   7  A  5
## 7  16   8  B  4
## 8   1   9  C  5

Retrieve the first/last/nth observation for each group

DT[, data.table::first(V2), by = V4]
DT[, data.table::last(V2), by = V4]
DT[, V2[2], by = V4]

##    V4 V1
## 1:  A  0
## 2:  B  0
## 3:  C  0
##    V4 V1
## 1:  A  7
## 2:  B  8
## 3:  C  9
##    V4 V1
## 1:  A  4
## 2:  B  5
## 3:  C  6

DF.groupby('V4')['V2'].first()

## V4
## A    0
## B    0
## C    0
## Name: V2, dtype: int64

DF.groupby('V4')['V2'].last()

## V4
## A    7
## B    8
## C    9
## Name: V2, dtype: int64

DF.groupby('V4')['V2'].nth(1) # 0-based

## V4
## A    4
## B    5
## C    6
## Name: V2, dtype: int64

Going further

Advanced columns manipulation

Summarise all the columns

DT[, lapply(.SD, max)]

##    V1 V2 V4
## 1:  4  9  C

DF.agg(['max'])

##      V1  V2 V4
## max  16   9  C

Summarise several columns

DT[, lapply(.SD, mean),
     .SDcols = c("V1", "V2")]

##          V1       V2
## 1: 2.333333 4.333333

DF.loc[:, ['V1', 'V2']].agg(['mean'])

##             V1        V2
## mean  7.666667  4.333333

Summarise several columns by group

DT[, by = V4,
     lapply(.SD, mean),
     .SDcols = c("V1", "V2")]
## using patterns (regex)
DT[, by = V4,
     lapply(.SD, mean),
     .SDcols = patterns("V1|V2|Z0")]

##    V4 V1       V2
## 1:  A  2 3.666667
## 2:  B  3 4.333333
## 3:  C  2 5.000000
##    V4 V1       V2
## 1:  A  2 3.666667
## 2:  B  3 4.333333
## 3:  C  2 5.000000

(DF.groupby('V4')
  .agg(**{f'{col}_mean':(col, 'mean') for col in ['V1', 'V2']}))
## using regex

##     V1_mean   V2_mean
## V4                   
## A         6  3.666667
## B        11  4.333333
## C         6  5.000000

(DF.filter(regex=('V1|V2|V4'))
   .groupby('V4')
   .apply(np.mean, axis = 0)
   .reset_index())

##   V4    V1        V2
## 0  A   6.0  3.666667
## 1  B  11.0  4.333333
## 2  C   6.0  5.000000

Summarise with more than one function by group

DT[, by = V4,
     c(lapply(.SD, sum),
       lapply(.SD, mean))]

##    V4 V1 V2 V1       V2
## 1:  A  6 11  2 3.666667
## 2:  B  9 13  3 4.333333
## 3:  C  6 15  2 5.000000

DF.groupby('V4').agg(['sum', 'mean'])

##     V1       V2          
##    sum mean sum      mean
## V4                       
## A   18    6  11  3.666667
## B   33   11  13  4.333333
## C   18    6  15  5.000000

Summarise using a condition

DT[, lapply(.SD, mean),
     .SDcols = is.numeric]
foo <- function(x) {is.numeric(x) && mean(x) > 3}
DT[, lapply(.SD, mean),
   .SDcols = foo]

##          V1       V2
## 1: 2.333333 4.333333
##          V2
## 1: 4.333333

DF.select_dtypes(include = 'number').agg(['mean'])

##             V1        V2
## mean  7.666667  4.333333

Modify all the columns

DT[, lapply(.SD, rev)]

##    V1 V2 V4
## 1:  1  9  C
## 2:  4  8  B
## 3:  1  7  A
## 4:  4  6  C
## 5:  1  5  B
## 6:  4  4  A
## 7:  1  0  C
## 8:  4  0  B
## 9:  1  0  A

DF.iloc[::-1]
#DF.apply(lambda x: x[::-1], axis = 0)

##    V1  V2 V4
## 8   1   9  C
## 7  16   8  B
## 6   1   7  A
## 5  16   6  C
## 4   1   5  B
## 3  16   4  A
## 2   1   0  C
## 1  16   0  B
## 0   1   0  A

Modify several columns (dropping the others)

DT[, lapply(.SD, sqrt),
     .SDcols = V1:V2]
DT[, lapply(.SD, exp),
     .SDcols = !"V4"]

##    V1       V2
## 1:  1 0.000000
## 2:  2 0.000000
## 3:  1 0.000000
## 4:  2 2.000000
## 5:  1 2.236068
## 6:  2 2.449490
## 7:  1 2.645751
## 8:  2 2.828427
## 9:  1 3.000000
##           V1         V2
## 1:  2.718282    1.00000
## 2: 54.598150    1.00000
## 3:  2.718282    1.00000
## 4: 54.598150   54.59815
## 5:  2.718282  148.41316
## 6: 54.598150  403.42879
## 7:  2.718282 1096.63316
## 8: 54.598150 2980.95799
## 9:  2.718282 8103.08393

DF.filter(['V1', 'V2']).agg(np.sqrt)

##     V1        V2
## 0  1.0  0.000000
## 1  4.0  0.000000
## 2  1.0  0.000000
## 3  4.0  2.000000
## 4  1.0  2.236068
## 5  4.0  2.449490
## 6  1.0  2.645751
## 7  4.0  2.828427
## 8  1.0  3.000000

DF.filter(regex = '[^V4]').agg(np.exp)

##              V1           V2
## 0  2.718282e+00     1.000000
## 1  8.886111e+06     1.000000
## 2  2.718282e+00     1.000000
## 3  8.886111e+06    54.598150
## 4  2.718282e+00   148.413159
## 5  8.886111e+06   403.428793
## 6  2.718282e+00  1096.633158
## 7  8.886111e+06  2980.957987
## 8  2.718282e+00  8103.083928

Modify several columns (keeping the others)

DT[, c("V1", "V2") := lapply(.SD, sqrt),
     .SDcols = c("V1", "V2")]

cols <- setdiff(names(DT), "V4")
DT[, (cols) := lapply(.SD, "^", 2L),
     .SDcols = cols]
DT

##    V1 V2 V4
## 1:  1  0  A
## 2:  4  0  B
## 3:  1  0  C
## 4:  4  4  A
## 5:  1  5  B
## 6:  4  6  C
## 7:  1  7  A
## 8:  4  8  B
## 9:  1  9  C

DF.loc[:, 'V1':'V2'] = DF.filter(['V1', 'V2']).agg(np.sqrt)
cols = DF.columns.difference(['V4'])
DF.loc[:, cols] = DF.filter(cols).agg(lambda x: pow(x,2))
DF

##      V1   V2 V4
## 0   1.0  0.0  A
## 1  16.0  0.0  B
## 2   1.0  0.0  C
## 3  16.0  4.0  A
## 4   1.0  5.0  B
## 5  16.0  6.0  C
## 6   1.0  7.0  A
## 7  16.0  8.0  B
## 8   1.0  9.0  C

Modify columns using a condition (dropping the others)

DT[, .SD - 1,
     .SDcols = is.numeric]
DT

##    V1 V2
## 1:  0 -1
## 2:  3 -1
## 3:  0 -1
## 4:  3  3
## 5:  0  4
## 6:  3  5
## 7:  0  6
## 8:  3  7
## 9:  0  8
##    V1 V2 V4
## 1:  1  0  A
## 2:  4  0  B
## 3:  1  0  C
## 4:  4  4  A
## 5:  1  5  B
## 6:  4  6  C
## 7:  1  7  A
## 8:  4  8  B
## 9:  1  9  C

DF.select_dtypes("number").sub(1)
#cols = DF.select_dtypes(include = [np.number]).columns
#DF.loc[:, cols].apply(lambda x: x - 1)

##      V1   V2
## 0   0.0 -1.0
## 1  15.0 -1.0
## 2   0.0 -1.0
## 3  15.0  3.0
## 4   0.0  4.0
## 5  15.0  5.0
## 6   0.0  6.0
## 7  15.0  7.0
## 8   0.0  8.0

Modify columns using a condition (keeping the others)

DT[, (cols) := lapply(.SD, as.integer),
     .SDcols = is.numeric]

DF.loc[:, cols] = DF.loc[:, cols].astype(int)
DF

##    V1  V2 V4
## 0   1   0  A
## 1  16   0  B
## 2   1   0  C
## 3  16   4  A
## 4   1   5  B
## 5  16   5  C
## 6   1   7  A
## 7  16   8  B
## 8   1   9  C

Use a complex expression

DT[, by = V4,
     .(V1[1:2], "X")]

##    V4 V1 V2
## 1:  A  1  X
## 2:  A  4  X
## 3:  B  4  X
## 4:  B  1  X
## 5:  C  1  X
## 6:  C  4  X

(DF.groupby('V4')
  .head(2)
  .assign(V2 = 'X')
  .sort_values('V4'))

##    V1 V2 V4
## 0   1  X  A
## 3  16  X  A
## 1  16  X  B
## 4   1  X  B
## 2   1  X  C
## 5  16  X  C

Use multiple expressions (with `DT[,{j}]`)

DT[, {print(V1) #  comments here!
      print(summary(V1))
      x <- V1 + sum(V2)
     .(A = 1:.N, B = x) # last list returned as a data.table
     }]

## [1] 1 4 1 4 1 4 1 4 1
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.000   1.000   2.333   4.000   4.000 
##    A  B
## 1: 1 39
## 2: 2 42
## 3: 3 39
## 4: 4 42
## 5: 5 39
## 6: 6 42
## 7: 7 39
## 8: 8 42
## 9: 9 39

pass

Chain expressions

Expression chaining using `DT[][]` (recommended)

DT[, by = V4, 
     .(V1sum = sum(V1)) ][
     V1sum > 5]

##    V4 V1sum
## 1:  A     6
## 2:  B     9
## 3:  C     6

(DF.groupby(['V4'], as_index=False)
  .agg(V1sum = ('V1',np.sum))
  .query('V1sum > 5'))

##   V4  V1sum
## 0  A     18
## 1  B     33
## 2  C     18

Indexing and Keys

Set the key/index

setkey(DT, V4)
setindex(DT, V4)
DT

##    V1 V2 V4
## 1:  1  0  A
## 2:  4  4  A
## 3:  1  7  A
## 4:  4  0  B
## 5:  1  5  B
## 6:  4  8  B
## 7:  1  0  C
## 8:  4  5  C
## 9:  1  9  C

DF.set_index('V4', drop = False, inplace = True)
DF.sort_index(inplace = True)
DF

##     V1  V2 V4
## V4           
## A    1   0  A
## A   16   4  A
## A    1   7  A
## B   16   0  B
## B    1   5  B
## B   16   8  B
## C    1   0  C
## C   16   5  C
## C    1   9  C

Select the matching rows

DT["A", on = "V4"]
DT[c("A", "C"), on = .(V4)] # same as on = "V4"

##    V1 V2 V4
## 1:  1  0  A
## 2:  4  4  A
## 3:  1  7  A
##    V1 V2 V4
## 1:  1  0  A
## 2:  4  4  A
## 3:  1  7  A
## 4:  1  0  C
## 5:  4  5  C
## 6:  1  9  C

DF.loc['A']

##     V1  V2 V4
## V4           
## A    1   0  A
## A   16   4  A
## A    1   7  A

DF.loc[['A', 'C']]

##     V1  V2 V4
## V4           
## A    1   0  A
## A   16   4  A
## A    1   7  A
## C    1   0  C
## C   16   5  C
## C    1   9  C

Select the first matching row

DT["B", on = "V4", mult = "first"]
DT[c("B", "C"), on = "V4", mult = "first"]

##    V1 V2 V4
## 1:  4  0  B
##    V1 V2 V4
## 1:  4  0  B
## 2:  1  0  C

DF.loc['B'].head(1)
# ?

##     V1  V2 V4
## V4           
## B   16   0  B

Select the last matching row

DT["A", on = "V4", mult = "last"]

##    V1 V2 V4
## 1:  1  7  A

DF.loc['A'].tail(1)

##     V1  V2 V4
## V4           
## A    1   7  A

Nomatch argument

# (default) returns a row with "D" even if not found
DT[c("A", "D"), on = "V4", nomatch = NA]
# no rows for unmatched values
DT[c("A", "D"), on = "V4", nomatch = 0]

##    V1 V2 V4
## 1:  1  0  A
## 2:  4  4  A
## 3:  1  7  A
## 4: NA NA  D
##    V1 V2 V4
## 1:  1  0  A
## 2:  4  4  A
## 3:  1  7  A

#deprecate-loc-reindex-listlike
#DF.loc[['A', 'D']]
#DF.loc[['A', 'D']].dropna()

Apply a function on the matching rows

DT[c("A", "C"), sum(V1), on = "V4"]

## [1] 12

DF.loc[['A', 'C'],'V1'].sum()

## 36

Modify values for matching rows

DT["A", V1 := 0, on = "V4"]
DT

##    V1 V2 V4
## 1:  0  0  A
## 2:  0  4  A
## 3:  0  7  A
## 4:  4  0  B
## 5:  1  5  B
## 6:  4  8  B
## 7:  1  0  C
## 8:  4  5  C
## 9:  1  9  C

DF.loc['A', 'V1'] = 0
DF

##     V1  V2 V4
## V4           
## A    0   0  A
## A    0   4  A
## A    0   7  A
## B   16   0  B
## B    1   5  B
## B   16   8  B
## C    1   0  C
## C   16   5  C
## C    1   9  C

Use keys in by

DT[!"B", sum(V1), on = "V4", by = .EACHI]
DT[V4 != "B",
   by = V4,
   sum(V1)]   # same

##    V4 V1
## 1:  A  0
## 2:  C  6
##    V4 V1
## 1:  A  0
## 2:  C  6

(DF.query("index != 'B'")
  .groupby(level=0)
  .agg({'V1':np.sum}))
#DF.loc[~(DF.index == 'B')].groupby(level = 0).agg({'V1':'sum'})

##     V1
## V4    
## A    0
## C   18

Set keys/indices for multiple columns

setkey(DT, V4, V1) # or setkeyv(DT, c("V4", "V1"))
setindex(DT, V4, V1) # setindexv(DT, c("V4", "V1"))
DT

##    V1 V2 V4
## 1:  0  0  A
## 2:  0  4  A
## 3:  0  7  A
## 4:  1  5  B
## 5:  4  0  B
## 6:  4  8  B
## 7:  1  0  C
## 8:  1  9  C
## 9:  4  5  C

DF.set_index(['V4', 'V1'], drop = False, inplace = True)
DF.sort_index(inplace = True)
DF

##        V1  V2 V4
## V4 V1           
## A  0    0   0  A
##    0    0   4  A
##    0    0   7  A
## B  1    1   5  B
##    16  16   0  B
##    16  16   8  B
## C  1    1   0  C
##    1    1   9  C
##    16  16   5  C

Subset using multiple keys/indices

DT[.("C", 1), on = .(V4, V1)]
DT[.(c("B", "C"), 1), on = .(V4, V1)]
# using which = TRUE only returns the matching rows indices
DT[.(c("B", "C"), 1), on = .(V4, V1), which = TRUE]

##    V1 V2 V4
## 1:  1  0  C
## 2:  1  9  C
##    V1 V2 V4
## 1:  1  5  B
## 2:  1  0  C
## 3:  1  9  C
## [1] 4 7 8

DF.loc[('C', 1)]

##        V1  V2 V4
## V4 V1           
## C  1    1   0  C
##    1    1   9  C

DF.loc[(('B', 'C'), 1), :]
#DF.query("V4 in ('B', 'C') and V1 == 1")

##        V1  V2 V4
## V4 V1           
## B  1    1   5  B
## C  1    1   0  C
##    1    1   9  C

list(np.where(DF.V4.isin(['B', 'C']) & DF.V1 == 1))

## [array([3, 6, 7])]

DF.index.isin({'V4': ['B', 'C'], 'V1': [1]}) # wrong

## array([False, False, False, False, False, False, False, False, False])

Remove keys/indices

setkey(DT, NULL)
setindex(DT, NULL)
DT

##    V1 V2 V4
## 1:  0  0  A
## 2:  0  4  A
## 3:  0  7  A
## 4:  1  5  B
## 5:  4  0  B
## 6:  4  8  B
## 7:  1  0  C
## 8:  1  9  C
## 9:  4  5  C

DF.reset_index(inplace = True, drop = True)
DF

##    V1  V2 V4
## 0   0   0  A
## 1   0   4  A
## 2   0   7  A
## 3   1   5  B
## 4  16   0  B
## 5  16   8  B
## 6   1   0  C
## 7   1   9  C
## 8  16   5  C

`set*()` modifications

Replace values

set(DT, i = 1L, j = 2L, value = 3L)
DT

##    V1 V2 V4
## 1:  0  3  A
## 2:  0  4  A
## 3:  0  7  A
## 4:  1  5  B
## 5:  4  0  B
## 6:  4  8  B
## 7:  1  0  C
## 8:  1  9  C
## 9:  4  5  C

DF.iloc[0, 1] = 3
DF

##    V1  V2 V4
## 0   0   3  A
## 1   0   4  A
## 2   0   7  A
## 3   1   5  B
## 4  16   0  B
## 5  16   8  B
## 6   1   0  C
## 7   1   9  C
## 8  16   5  C

Reorder rows

setorder(DT, V4, -V1)
setorderv(DT, c("V4", "V1"), c(1, -1))

DF.sort_values(['V4','V1'], ascending = [True, False], inplace = True)

Modify colnames

setnames(DT, old = "V2", new = "v2")
setnames(DT, old = -(c(1, 3)), new = "V2")
DT

##    V1 V2 V4
## 1:  0  3  A
## 2:  0  4  A
## 3:  0  7  A
## 4:  4  0  B
## 5:  4  8  B
## 6:  1  5  B
## 7:  4  5  C
## 8:  1  0  C
## 9:  1  9  C

DF.rename(columns = {'V2':'v2'}, inplace = True)
cols = DF.columns.values; cols[1] = 'V2'
DF.columns = cols
DF

##    V1  V2 V4
## 0   0   3  A
## 1   0   4  A
## 2   0   7  A
## 4  16   0  B
## 5  16   8  B
## 3   1   5  B
## 8  16   5  C
## 6   1   0  C
## 7   1   9  C

Reorder columns

setcolorder(DT, c("V4", "V1", "V2"))
DT

##    V4 V1 V2
## 1:  A  0  3
## 2:  A  0  4
## 3:  A  0  7
## 4:  B  4  0
## 5:  B  4  8
## 6:  B  1  5
## 7:  C  4  5
## 8:  C  1  0
## 9:  C  1  9

DF = DF[['V4', 'V1', 'V2']]
DF

##   V4  V1  V2
## 0  A   0   3
## 1  A   0   4
## 2  A   0   7
## 4  B  16   0
## 5  B  16   8
## 3  B   1   5
## 8  C  16   5
## 6  C   1   0
## 7  C   1   9

Convert data

# ?setDT # data.frame or list to data.table
# ?setDF # data.table to data.frame
# ?setattr # modify attributes

pass

Advanced use of by

Select first/last/… row by group

DT[, .SD[1], by = V4]
DT[, .SD[c(1, .N)], by = V4]
DT[, tail(.SD, 2), by = V4]

##    V4 V1 V2
## 1:  A  0  3
## 2:  B  4  0
## 3:  C  4  5
##    V4 V1 V2
## 1:  A  0  3
## 2:  A  0  7
## 3:  B  4  0
## 4:  B  1  5
## 5:  C  4  5
## 6:  C  1  9
##    V4 V1 V2
## 1:  A  0  4
## 2:  A  0  7
## 3:  B  4  8
## 4:  B  1  5
## 5:  C  1  0
## 6:  C  1  9

DF.groupby('V4').head(1)

##   V4  V1  V2
## 0  A   0   3
## 4  B  16   0
## 8  C  16   5

DF.groupby('V4').nth([0, -1])

##     V1  V2
## V4        
## A    0   3
## A    0   7
## B   16   0
## B    1   5
## C   16   5
## C    1   9

DF.groupby('V4').nth([-1, -2])

##     V1  V2
## V4        
## A    0   4
## A    0   7
## B   16   8
## B    1   5
## C    1   0
## C    1   9

Select rows using a nested query

DT[, .SD[which.min(V2)], by = V4]

##    V4 V1 V2
## 1:  A  0  3
## 2:  B  4  0
## 3:  C  1  0

DF.loc[DF.groupby('V4').V2.idxmin()]

##   V4  V1  V2
## 0  A   0   3
## 4  B  16   0
## 6  C   1   0

Add a group counter column

DT[, Grp := .GRP, by = .(V4, V1)][]
DT[, Grp := NULL] # delete for consistency

##    V4 V1 V2 Grp
## 1:  A  0  3   1
## 2:  A  0  4   1
## 3:  A  0  7   1
## 4:  B  4  0   2
## 5:  B  4  8   2
## 6:  B  1  5   3
## 7:  C  4  5   4
## 8:  C  1  0   5
## 9:  C  1  9   5

DF['Grp'] = DF.groupby(['V4', 'V1']).ngroup()

## /home/xu/anaconda3/bin/python:1: SettingWithCopyWarning: 
## A value is trying to be set on a copy of a slice from a DataFrame.
## Try using .loc[row_indexer,col_indexer] = value instead
## 
## See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

DF

##   V4  V1  V2  Grp
## 0  A   0   3    0
## 1  A   0   4    0
## 2  A   0   7    0
## 4  B  16   0    2
## 5  B  16   8    2
## 3  B   1   5    1
## 8  C  16   5    4
## 6  C   1   0    3
## 7  C   1   9    3

del DF['Grp']

Get row number of first (and last) observation by group

DT[, .I, by = V4] # returns a data.table
DT[, .I[1], by = V4]
DT[, .I[c(1, .N)], by = V4]

##    V4 I
## 1:  A 1
## 2:  A 2
## 3:  A 3
## 4:  B 4
## 5:  B 5
## 6:  B 6
## 7:  C 7
## 8:  C 8
## 9:  C 9
##    V4 V1
## 1:  A  1
## 2:  B  4
## 3:  C  7
##    V4 V1
## 1:  A  1
## 2:  A  3
## 3:  B  4
## 4:  B  6
## 5:  C  7
## 6:  C  9

pd.DataFrame(DF.groupby('V4').indices).melt()

##   variable  value
## 0        A      0
## 1        A      1
## 2        A      2
## 3        B      3
## 4        B      4
## 5        B      5
## 6        C      6
## 7        C      7
## 8        C      8

pd.DataFrame(DF.groupby('V4').indices).loc[0]

## A    0
## B    3
## C    6
## Name: 0, dtype: int64

pd.DataFrame(DF.groupby('V4').indices).iloc[[0,-1]].melt()

##   variable  value
## 0        A      0
## 1        A      2
## 2        B      3
## 3        B      5
## 4        C      6
## 5        C      8

Handle list-columns by group

DT[, .(.(V1)),  by = V4]  # return V1 as a list
DT[, .(.(.SD)), by = V4] # subsets of the data

##    V4    V1
## 1:  A 0,0,0
## 2:  B 4,4,1
## 3:  C 4,1,1
##    V4                V1
## 1:  A <data.table[3x2]>
## 2:  B <data.table[3x2]>
## 3:  C <data.table[3x2]>

DF.groupby('V4')['V1'].apply(list)
# ?

## V4
## A      [0, 0, 0]
## B    [16, 16, 1]
## C     [16, 1, 1]
## Name: V1, dtype: object

Grouping sets (multiple by at once)

rollup(DT,
       .(SumV2 = sum(V2)),
       by = c("V1", "V4"))

rollup(DT,
       .(SumV2 = sum(V2), .N),
       by = c("V1", "V4"),
       id = TRUE)

cube(DT,
     .(SumV2 = sum(V2), .N),
     by = c("V1", "V4"),
     id = TRUE)

groupingsets(DT,
             .(SumV2 = sum(V2), .N),
             by   = c("V1", "V4"),
             sets = list("V1", c("V1", "V4")),
             id   = TRUE)

##    V1   V4 SumV2
## 1:  0    A    14
## 2:  4    B     8
## 3:  1    B     5
## 4:  4    C     5
## 5:  1    C     9
## 6:  0 <NA>    14
## 7:  4 <NA>    13
## 8:  1 <NA>    14
## 9: NA <NA>    41
##    grouping V1   V4 SumV2 N
## 1:        0  0    A    14 3
## 2:        0  4    B     8 2
## 3:        0  1    B     5 1
## 4:        0  4    C     5 1
## 5:        0  1    C     9 2
## 6:        1  0 <NA>    14 3
## 7:        1  4 <NA>    13 3
## 8:        1  1 <NA>    14 3
## 9:        3 NA <NA>    41 9
##     grouping V1   V4 SumV2 N
##  1:        0  0    A    14 3
##  2:        0  4    B     8 2
##  3:        0  1    B     5 1
##  4:        0  4    C     5 1
##  5:        0  1    C     9 2
##  6:        1  0 <NA>    14 3
##  7:        1  4 <NA>    13 3
##  8:        1  1 <NA>    14 3
##  9:        2 NA    A    14 3
## 10:        2 NA    B    13 3
## 11:        2 NA    C    14 3
## 12:        3 NA <NA>    41 9
##    grouping V1   V4 SumV2 N
## 1:        1  0 <NA>    14 3
## 2:        1  4 <NA>    13 3
## 3:        1  1 <NA>    14 3
## 4:        0  0    A    14 3
## 5:        0  4    B     8 2
## 6:        0  1    B     5 1
## 7:        0  4    C     5 1
## 8:        0  1    C     9 2

dd = pd.pivot_table(DF, values=['V2'], \
        index =['V1'], columns=['V4'], aggfunc = np.sum, margins = True)
pd.melt(dd, col_level = 1)
# ?

##      V4  value
## 0     A   14.0
## 1     A    NaN
## 2     A    NaN
## 3     A   14.0
## 4     B    NaN
## 5     B    5.0
## 6     B    8.0
## 7     B   13.0
## 8     C    NaN
## 9     C    9.0
## 10    C    5.0
## 11    C   14.0
## 12  All   14.0
## 13  All   14.0
## 14  All   13.0
## 15  All   41.0

Miscellaneous

Read / Write data

Write data to a csv file

fwrite(DT, "DT.csv")

DF.to_csv('DF.csv', index = False)

Write data to a tab-delimited file

fwrite(DT, "DT.txt", sep = "\t")

DF.to_csv('DF.txt', sep='\t', index = False)

Write list-column data to a csv file

fwrite(setDT(list(0, list(1:5))), "DT2.csv")

pd.DataFrame({'V1':0, 'V2':[[1,2,3,4,5]]}).to_csv('DF2.csv', index = False)

Read a csv / tab-delimited file

fread("DT.csv")
# fread("DT.csv", verbose = TRUE) # full details
fread("DT.txt", sep = "\t")

##    V4 V1 V2
## 1:  A  0  3
## 2:  A  0  4
## 3:  A  0  7
## 4:  B  4  0
## 5:  B  4  8
## 6:  B  1  5
## 7:  C  4  5
## 8:  C  1  0
## 9:  C  1  9
##    V4 V1 V2
## 1:  A  0  3
## 2:  A  0  4
## 3:  A  0  7
## 4:  B  4  0
## 5:  B  4  8
## 6:  B  1  5
## 7:  C  4  5
## 8:  C  1  0
## 9:  C  1  9

pd.read_csv('DF.csv')

##   V4  V1  V2
## 0  A   0   3
## 1  A   0   4
## 2  A   0   7
## 3  B  16   0
## 4  B  16   8
## 5  B   1   5
## 6  C  16   5
## 7  C   1   0
## 8  C   1   9

pd.read_csv('DF.txt', sep = '\t')

##   V4  V1  V2
## 0  A   0   3
## 1  A   0   4
## 2  A   0   7
## 3  B  16   0
## 4  B  16   8
## 5  B   1   5
## 6  C  16   5
## 7  C   1   0
## 8  C   1   9

Read a csv file selecting / droping columns

fread("DT.csv", select = c("V1", "V4"))
fread("DT.csv", drop = "V4")

##    V1 V4
## 1:  0  A
## 2:  0  A
## 3:  0  A
## 4:  4  B
## 5:  4  B
## 6:  1  B
## 7:  4  C
## 8:  1  C
## 9:  1  C
##    V1 V2
## 1:  0  3
## 2:  0  4
## 3:  0  7
## 4:  4  0
## 5:  4  8
## 6:  1  5
## 7:  4  5
## 8:  1  0
## 9:  1  9

pd.read_csv('DF.csv', usecols=['V1', 'V4'])

##   V4  V1
## 0  A   0
## 1  A   0
## 2  A   0
## 3  B  16
## 4  B  16
## 5  B   1
## 6  C  16
## 7  C   1
## 8  C   1

pd.read_csv('DF.csv', usecols = lambda x: x != 'V4')

##    V1  V2
## 0   0   3
## 1   0   4
## 2   0   7
## 3  16   0
## 4  16   8
## 5   1   5
## 6  16   5
## 7   1   0
## 8   1   9

Read and rbind several files

rbindlist(lapply(c("DT.csv", "DT.csv"), fread))
# c("DT.csv", "DT.csv") %>% lapply(fread) %>% rbindlist

##     V4 V1 V2
##  1:  A  0  3
##  2:  A  0  4
##  3:  A  0  7
##  4:  B  4  0
##  5:  B  4  8
##  6:  B  1  5
##  7:  C  4  5
##  8:  C  1  0
##  9:  C  1  9
## 10:  A  0  3
## 11:  A  0  4
## 12:  A  0  7
## 13:  B  4  0
## 14:  B  4  8
## 15:  B  1  5
## 16:  C  4  5
## 17:  C  1  0
## 18:  C  1  9

li = [pd.read_csv(fi) for fi in ['DF.csv', 'DF.csv']]
pd.concat(li, axis = 0, ignore_index = True)

##    V4  V1  V2
## 0   A   0   3
## 1   A   0   4
## 2   A   0   7
## 3   B  16   0
## 4   B  16   8
## 5   B   1   5
## 6   C  16   5
## 7   C   1   0
## 8   C   1   9
## 9   A   0   3
## 10  A   0   4
## 11  A   0   7
## 12  B  16   0
## 13  B  16   8
## 14  B   1   5
## 15  C  16   5
## 16  C   1   0
## 17  C   1   9

Reshape data

Melt data (from wide to long)

melt(DT, id.vars = "V4")
mDT <- melt(DT,
            id.vars       = "V4",
            measure.vars  = c("V1", "V2"),
            variable.name = "Variable",
            value.name    = "Value")

##     V4 variable value
##  1:  A       V1     0
##  2:  A       V1     0
##  3:  A       V1     0
##  4:  B       V1     4
##  5:  B       V1     4
##  6:  B       V1     1
##  7:  C       V1     4
##  8:  C       V1     1
##  9:  C       V1     1
## 10:  A       V2     3
## 11:  A       V2     4
## 12:  A       V2     7
## 13:  B       V2     0
## 14:  B       V2     8
## 15:  B       V2     5
## 16:  C       V2     5
## 17:  C       V2     0
## 18:  C       V2     9

pd.melt(DF, id_vars = ['V4'])

##    V4 variable  value
## 0   A       V1      0
## 1   A       V1      0
## 2   A       V1      0
## 3   B       V1     16
## 4   B       V1     16
## 5   B       V1      1
## 6   C       V1     16
## 7   C       V1      1
## 8   C       V1      1
## 9   A       V2      3
## 10  A       V2      4
## 11  A       V2      7
## 12  B       V2      0
## 13  B       V2      8
## 14  B       V2      5
## 15  C       V2      5
## 16  C       V2      0
## 17  C       V2      9

mDF = pd.melt(DF,
              id_vars    = 'V4',
              value_vars = ('V1', 'V2'), 
              var_name   = 'Variable',
              value_name = 'Value')

Cast data (from long to wide)

dcast(mDT, V4 ~ Variable) # aggregate by count

## Using 'Value' as value column. Use 'value.var' to override

## Aggregate function missing, defaulting to 'length'

dcast(mDT, V4 ~ Variable, fun.aggregate = sum)

## Using 'Value' as value column. Use 'value.var' to override

dcast(mDT, V4 ~ Value > 5)

## Using 'Value' as value column. Use 'value.var' to override
## Aggregate function missing, defaulting to 'length'

# see ?dcast: multiple values / fun.aggregate

##    V4 V1 V2
## 1:  A  3  3
## 2:  B  3  3
## 3:  C  3  3
##    V4 V1 V2
## 1:  A  0 14
## 2:  B  9 13
## 3:  C  6 14
##    V4 FALSE TRUE
## 1:  A     5    1
## 2:  B     5    1
## 3:  C     5    1

pd.crosstab(mDF.V4, mDF.Variable)

## Variable  V1  V2
## V4              
## A          3   3
## B          3   3
## C          3   3

pd.crosstab(mDF.V4, mDF.Variable, values = mDF.Value, aggfunc = 'sum')

## Variable  V1  V2
## V4              
## A          0  14
## B         33  13
## C         18  14

pd.crosstab(mDF.V4, mDF.Value > 5)

## Value  False  True 
## V4                 
## A          5      1
## B          3      3
## C          4      2

Split

split(DT, by = "V4") # S3 method

## $A
##    V4 V1 V2
## 1:  A  0  3
## 2:  A  0  4
## 3:  A  0  7
## 
## $B
##    V4 V1 V2
## 1:  B  4  0
## 2:  B  4  8
## 3:  B  1  5
## 
## $C
##    V4 V1 V2
## 1:  C  4  5
## 2:  C  1  0
## 3:  C  1  9

list(DF.groupby('V4'))

## [('A',   V4  V1  V2
## 0  A   0   3
## 1  A   0   4
## 2  A   0   7), ('B',   V4  V1  V2
## 4  B  16   0
## 5  B  16   8
## 3  B   1   5), ('C',   V4  V1  V2
## 8  C  16   5
## 6  C   1   0
## 7  C   1   9)]

Split and transpose a vector/column

vec <- c("A:a", "B:b", "C:c")
tstrsplit(vec, split = ":", keep = 2L) # works on vector
setDT(tstrsplit(vec, split = ":"))[]

## [[1]]
## [1] "a" "b" "c"
## 
##    V1 V2
## 1:  A  a
## 2:  B  b
## 3:  C  c

vec = ['A:a', 'B:b', 'C:c']
[i.split(':')[1] for i in vec]

## ['a', 'b', 'c']

pd.DataFrame([i.split(':') for i in vec])

##    0  1
## 0  A  a
## 1  B  b
## 2  C  c

Other

Check package installation

# test.data.table()
# There's more lines of test code in data.table than there is code!

pass

List data.tables/tibbles

tables()

##    NAME NROW NCOL MB              COLS KEY
## 1:   DT    9    3  0          V4,V1,V2    
## 2:  mDT   18    3  0 V4,Variable,Value    
## Total: 0MB

# %whos DataFrame # IPython
for i in dir():
    if type(globals()[i]) == pd.DataFrame:
        print(i)

## DF
## dd
## mDF

Get/Set number of threads when parallelized

getDTthreads() # setDTthreads()

## [1] 2

pass

Lead/Lag

shift(1:10, n = 1,   fill = NA, type = "lag")
shift(1:10, n = 1:2, fill = NA, type = "lag") # multiple
shift(1:10, n = 1,   fill = NA, type = "lead")

##  [1] NA  1  2  3  4  5  6  7  8  9
## [[1]]
##  [1] NA  1  2  3  4  5  6  7  8  9
## 
## [[2]]
##  [1] NA NA  1  2  3  4  5  6  7  8
## 
##  [1]  2  3  4  5  6  7  8  9 10 NA

pd.Series(np.arange(1, 11)).shift(periods = 1)

## 0    NaN
## 1    1.0
## 2    2.0
## 3    3.0
## 4    4.0
## 5    5.0
## 6    6.0
## 7    7.0
## 8    8.0
## 9    9.0
## dtype: float64

[pd.Series(np.arange(1, 11)).shift(periods = i) for i in [1,2]]

## [0    NaN
## 1    1.0
## 2    2.0
## 3    3.0
## 4    4.0
## 5    5.0
## 6    6.0
## 7    7.0
## 8    8.0
## 9    9.0
## dtype: float64, 0    NaN
## 1    NaN
## 2    1.0
## 3    2.0
## 4    3.0
## 5    4.0
## 6    5.0
## 7    6.0
## 8    7.0
## 9    8.0
## dtype: float64]

pd.Series(np.arange(1, 11)).shift(periods = -1)

## 0     2.0
## 1     3.0
## 2     4.0
## 3     5.0
## 4     6.0
## 5     7.0
## 6     8.0
## 7     9.0
## 8    10.0
## 9     NaN
## dtype: float64

Generate run-length ids

rleid(rep(c("a", "b", "a"), each = 3)) # see also ?rleidv
rleid(rep(c("a", "b", "a"), each = 3), prefix = "G")

## [1] 1 1 1 2 2 2 3 3 3
## [1] "G1" "G1" "G1" "G2" "G2" "G2" "G3" "G3" "G3"

pass

Vectorised `ifelse` statements

x <- 1:10
fcase(
  x %% 6 == 0, "fizz buzz",
  x %% 2 == 0, "fizz",
  x %% 3 == 0, "buzz",
  default = NA_character_
)

##  [1] NA          "fizz"      "buzz"      "fizz"      NA          "fizz buzz"
##  [7] NA          "fizz"      "buzz"      "fizz"

pass

Rolling functions

# todo

pass

Join/Bind data sets

Join

x <- data.table(Id  = c("A", "B", "C", "C"),
                X1  = c(1L, 3L, 5L, 7L),
                XY  = c("x2", "x4", "x6", "x8"),
                key = "Id")

y <- data.table(Id  = c("A", "B", "B", "D"),
                Y1  = c(1L, 3L, 5L, 7L),
                XY  = c("y1", "y3", "y5", "y7"),
                key = "Id")

x = pd.DataFrame(
  {"Id" : ['A', 'B', 'C', 'C'],
   "X1" : [1, 3, 5, 7],
   "XY" : ['x2', 'x4', 'x6', 'x8']})
y = pd.DataFrame(
  {"Id" : ['A', 'B', 'B', 'D'],
   "Y1" : [1, 3, 5, 7],
   "XY" : ['y1', 'y3', 'y5', 'y7']})

Join matching rows from y to x

y[x, on = "Id"]
merge(x, y, all.x = TRUE, by = "Id")
y[x] # requires keys

##    Id Y1   XY X1 i.XY
## 1:  A  1   y1  1   x2
## 2:  B  3   y3  3   x4
## 3:  B  5   y5  3   x4
## 4:  C NA <NA>  5   x6
## 5:  C NA <NA>  7   x8
##    Id X1 XY.x Y1 XY.y
## 1:  A  1   x2  1   y1
## 2:  B  3   x4  3   y3
## 3:  B  3   x4  5   y5
## 4:  C  5   x6 NA <NA>
## 5:  C  7   x8 NA <NA>
##    Id Y1   XY X1 i.XY
## 1:  A  1   y1  1   x2
## 2:  B  3   y3  3   x4
## 3:  B  5   y5  3   x4
## 4:  C NA <NA>  5   x6
## 5:  C NA <NA>  7   x8

pd.merge(x, y, how = 'left', on = 'Id')

##   Id  X1 XY_x   Y1 XY_y
## 0  A   1   x2  1.0   y1
## 1  B   3   x4  3.0   y3
## 2  B   3   x4  5.0   y5
## 3  C   5   x6  NaN  NaN
## 4  C   7   x8  NaN  NaN

Join matching rows from x to y

x[y, on = "Id"]
merge(x, y, all.y = TRUE, by = "Id")
x[y] # requires keys

##    Id X1   XY Y1 i.XY
## 1:  A  1   x2  1   y1
## 2:  B  3   x4  3   y3
## 3:  B  3   x4  5   y5
## 4:  D NA <NA>  7   y7
##    Id X1 XY.x Y1 XY.y
## 1:  A  1   x2  1   y1
## 2:  B  3   x4  3   y3
## 3:  B  3   x4  5   y5
## 4:  D NA <NA>  7   y7
##    Id X1   XY Y1 i.XY
## 1:  A  1   x2  1   y1
## 2:  B  3   x4  3   y3
## 3:  B  3   x4  5   y5
## 4:  D NA <NA>  7   y7

pd.merge(x, y, how = 'right', on = 'Id')

##   Id   X1 XY_x  Y1 XY_y
## 0  A  1.0   x2   1   y1
## 1  B  3.0   x4   3   y3
## 2  B  3.0   x4   5   y5
## 3  D  NaN  NaN   7   y7

Join matching rows from both x and y

x[y, on = "Id", nomatch = 0]
merge(x, y)
x[y, nomatch = 0] # requires keys

##    Id X1 XY Y1 i.XY
## 1:  A  1 x2  1   y1
## 2:  B  3 x4  3   y3
## 3:  B  3 x4  5   y5
##    Id X1 XY.x Y1 XY.y
## 1:  A  1   x2  1   y1
## 2:  B  3   x4  3   y3
## 3:  B  3   x4  5   y5
##    Id X1 XY Y1 i.XY
## 1:  A  1 x2  1   y1
## 2:  B  3 x4  3   y3
## 3:  B  3 x4  5   y5

pd.merge(x, y, how = 'inner', on = 'Id')

##   Id  X1 XY_x  Y1 XY_y
## 0  A   1   x2   1   y1
## 1  B   3   x4   3   y3
## 2  B   3   x4   5   y5

Join keeping all the rows

merge(x, y, all = TRUE, by = "Id")

##    Id X1 XY.x Y1 XY.y
## 1:  A  1   x2  1   y1
## 2:  B  3   x4  3   y3
## 3:  B  3   x4  5   y5
## 4:  C  5   x6 NA <NA>
## 5:  C  7   x8 NA <NA>
## 6:  D NA <NA>  7   y7

pd.merge(x, y, how = 'outer', on = 'Id')

##   Id   X1 XY_x   Y1 XY_y
## 0  A  1.0   x2  1.0   y1
## 1  B  3.0   x4  3.0   y3
## 2  B  3.0   x4  5.0   y5
## 3  C  5.0   x6  NaN  NaN
## 4  C  7.0   x8  NaN  NaN
## 5  D  NaN  NaN  7.0   y7

Return rows from x matching y

unique(x[y$Id, on = "Id", nomatch = 0])
unique(x[y$Id, nomatch = 0]) # requires keys

##    Id X1 XY
## 1:  A  1 x2
## 2:  B  3 x4
##    Id X1 XY
## 1:  A  1 x2
## 2:  B  3 x4

x[x.Id.isin(y.Id)]

##   Id  X1  XY
## 0  A   1  x2
## 1  B   3  x4

Return rows from x not matching y

x[!y, on = "Id"]
x[!y] # requires keys

##    Id X1 XY
## 1:  C  5 x6
## 2:  C  7 x8
##    Id X1 XY
## 1:  C  5 x6
## 2:  C  7 x8

x[~x.Id.isin(y.Id)]

##   Id  X1  XY
## 2  C   5  x6
## 3  C   7  x8

More joins

Select columns while joining

x[y, .(Id, X1, i.XY)]   # i. prefix refers to cols in y
x[y, .(Id, x.XY, i.XY)] # x. prefix refers to cols in x

##    Id X1 i.XY
## 1:  A  1   y1
## 2:  B  3   y3
## 3:  B  3   y5
## 4:  D NA   y7
##    Id x.XY i.XY
## 1:  A   x2   y1
## 2:  B   x4   y3
## 3:  B   x4   y5
## 4:  D <NA>   y7

pd.merge(x[['Id','X1']],
         y[['Id','XY']],
         how = 'right',
         on  = 'Id')

##   Id   X1  XY
## 0  A  1.0  y1
## 1  B  3.0  y3
## 2  B  3.0  y5
## 3  D  NaN  y7

pd.merge(x[['Id','XY']],
         y[['Id','XY']],
         how = 'right',
         on  = 'Id')

##   Id XY_x XY_y
## 0  A   x2   y1
## 1  B   x4   y3
## 2  B   x4   y5
## 3  D  NaN   y7

Aggregate columns while joining

y[x, .(X1Y1 = sum(Y1) * X1), by = .EACHI]

##    Id X1Y1
## 1:  A    1
## 2:  B   24
## 3:  C   NA
## 4:  C   NA

(y.groupby('Id')['Y1']
    .agg('sum')
    .to_frame()
    .rename(columns={'Y1': 'SumY1'})
    .merge(x, how = 'right', on = 'Id')
    .assign(X1Y1 = lambda df: df.SumY1 * df.X1)
    .loc[:, ['Id', 'X1Y1']])

##   Id  X1Y1
## 0  A   1.0
## 1  B  24.0
## 2  C   NaN
## 3  C   NaN

Update columns while joining

y[x, SqX1 := i.X1^2]
y[, SqX1 := x[.BY, X1^2, on = "Id"], by = Id] # more memory-efficient
y[, SqX1 := NULL] # rm column for consistency

(x.loc[:, ['Id', 'X1']]
    .assign(SqX1 = lambda df: df.X1**2)
    .merge(y, how = 'right', on = 'Id')
    .loc[:, list(y.columns) + ['SqX1']])

##   Id  Y1  XY  SqX1
## 0  A   1  y1   1.0
## 1  B   3  y3   9.0
## 2  B   5  y5   9.0
## 3  D   7  y7   NaN

Adds a list column with rows from y matching x (nest-join)

x[, y := .(.(y[.BY, on = "Id"])), by = Id]
x[, y := NULL] # rm column for consistency

pass

Update columns while joining (using vectors of colnames)

cols  <- c("NewXY", "NewX1")
icols <- paste0("i.", c("XY", "X1"))

y[x, (cols) := mget(icols)]

y[, (cols) := NULL] # rm columns for consistency

pass

Join passing columns to match in the `on` argument

z <- data.table(ID = "C", Z1 = 5:9, Z2 = paste0("z", 5:9))
x[, X2 := paste0("x", X1)] # used to track the results
z; x

##    ID Z1 Z2
## 1:  C  5 z5
## 2:  C  6 z6
## 3:  C  7 z7
## 4:  C  8 z8
## 5:  C  9 z9
##    Id X1 XY X2
## 1:  A  1 x2 x1
## 2:  B  3 x4 x3
## 3:  C  5 x6 x5
## 4:  C  7 x8 x7

x[z, on = "X1 == Z1"]
x[z, on = .(X1 == Z1)] # same
x[z, on = .(Id == ID, X1 == Z1)] # using two columns

##      Id X1   XY   X2 ID Z2
## 1:    C  5   x6   x5  C z5
## 2: <NA>  6 <NA> <NA>  C z6
## 3:    C  7   x8   x7  C z7
## 4: <NA>  8 <NA> <NA>  C z8
## 5: <NA>  9 <NA> <NA>  C z9
##      Id X1   XY   X2 ID Z2
## 1:    C  5   x6   x5  C z5
## 2: <NA>  6 <NA> <NA>  C z6
## 3:    C  7   x8   x7  C z7
## 4: <NA>  8 <NA> <NA>  C z8
## 5: <NA>  9 <NA> <NA>  C z9
##    Id X1   XY   X2 Z2
## 1:  C  5   x6   x5 z5
## 2:  C  6 <NA> <NA> z6
## 3:  C  7   x8   x7 z7
## 4:  C  8 <NA> <NA> z8
## 5:  C  9 <NA> <NA> z9

z = pd.DataFrame(
  {"ID" : ['C', 'C', 'C', 'C', 'C'],
   "Z1" : [5, 6, 7, 8, 9],
   "Z2" : ['z5', 'z6', 'z7', 'z8', 'z9']})
x = x.assign(X2 = 'x' + x.X1.astype(str))
#
pd.merge(x, z, how = 'right', left_on = 'X1', right_on = 'Z1')

##     Id   X1   XY   X2 ID  Z1  Z2
## 0    C  5.0   x6   x5  C   5  z5
## 1  NaN  NaN  NaN  NaN  C   6  z6
## 2    C  7.0   x8   x7  C   7  z7
## 3  NaN  NaN  NaN  NaN  C   8  z8
## 4  NaN  NaN  NaN  NaN  C   9  z9

pd.merge(x, z, how = 'right', left_on = ['Id', 'X1'], right_on = ['ID', 'Z1'])

##     Id   X1   XY   X2 ID  Z1  Z2
## 0    C  5.0   x6   x5  C   5  z5
## 1  NaN  NaN  NaN  NaN  C   6  z6
## 2    C  7.0   x8   x7  C   7  z7
## 3  NaN  NaN  NaN  NaN  C   8  z8
## 4  NaN  NaN  NaN  NaN  C   9  z9

Non-equi joins

x[z, on = .(Id == ID, X1 <= Z1)]
x[z, on = .(Id == ID, X1 > Z1)]
x[z, on = .(X1 < Z1), allow.cartesian = TRUE] # allows 'numerous' matching values

##    Id X1 XY X2 Z2
## 1:  C  5 x6 x5 z5
## 2:  C  6 x6 x5 z6
## 3:  C  7 x6 x5 z7
## 4:  C  7 x8 x7 z7
## 5:  C  8 x6 x5 z8
## 6:  C  8 x8 x7 z8
## 7:  C  9 x6 x5 z9
## 8:  C  9 x8 x7 z9
##    Id X1   XY   X2 Z2
## 1:  C  5   x8   x7 z5
## 2:  C  6   x8   x7 z6
## 3:  C  7 <NA> <NA> z7
## 4:  C  8 <NA> <NA> z8
## 5:  C  9 <NA> <NA> z9
##     Id X1 XY X2 ID Z2
##  1:  A  5 x2 x1  C z5
##  2:  B  5 x4 x3  C z5
##  3:  A  6 x2 x1  C z6
##  4:  B  6 x4 x3  C z6
##  5:  C  6 x6 x5  C z6
##  6:  A  7 x2 x1  C z7
##  7:  B  7 x4 x3  C z7
##  8:  C  7 x6 x5  C z7
##  9:  A  8 x2 x1  C z8
## 10:  B  8 x4 x3  C z8
## 11:  C  8 x6 x5  C z8
## 12:  C  8 x8 x7  C z8
## 13:  A  9 x2 x1  C z9
## 14:  B  9 x4 x3  C z9
## 15:  C  9 x6 x5  C z9
## 16:  C  9 x8 x7  C z9

(pd.merge(x, z, how = 'right', left_on = 'Id', right_on = 'ID')
    .query('X1 <= Z1')
    .eval('X1 = Z1')
    .drop(columns = ['ID', 'Z1']))
# ?
# ?

##   Id  X1  XY  X2  Z2
## 0  C   5  x6  x5  z5
## 2  C   6  x6  x5  z6
## 4  C   7  x6  x5  z7
## 5  C   7  x8  x7  z7
## 6  C   8  x6  x5  z8
## 7  C   8  x8  x7  z8
## 8  C   9  x6  x5  z9
## 9  C   9  x8  x7  z9

Rolling joins/subsets (performed on the last numeric column)

# Nearest
x[z, on = .(Id == ID, X1 == Z1), roll = "nearest"]
## below, simplified examples with ad hoc subsets on a keyed data.table
setkey(x, Id, X1)
x[.("C", 5:9), roll = "nearest"]

##    Id X1 XY X2 Z2
## 1:  C  5 x6 x5 z5
## 2:  C  6 x6 x5 z6
## 3:  C  7 x8 x7 z7
## 4:  C  8 x8 x7 z8
## 5:  C  9 x8 x7 z9
##    Id X1 XY X2
## 1:  C  5 x6 x5
## 2:  C  6 x6 x5
## 3:  C  7 x8 x7
## 4:  C  8 x8 x7
## 5:  C  9 x8 x7

pass

# Last Observation Carried Forward
x[.("C", 5:9), roll = Inf]
x[.("C", 5:9), roll = 0.5]  # bounded
x[.("C", 5:9), roll = Inf, rollends = c(FALSE, TRUE)]  # default
x[.("C", 5:9), roll = Inf, rollends = c(FALSE, FALSE)] # ends not rolled

##    Id X1 XY X2
## 1:  C  5 x6 x5
## 2:  C  6 x6 x5
## 3:  C  7 x8 x7
## 4:  C  8 x8 x7
## 5:  C  9 x8 x7
##    Id X1   XY   X2
## 1:  C  5   x6   x5
## 2:  C  6 <NA> <NA>
## 3:  C  7   x8   x7
## 4:  C  8 <NA> <NA>
## 5:  C  9 <NA> <NA>
##    Id X1 XY X2
## 1:  C  5 x6 x5
## 2:  C  6 x6 x5
## 3:  C  7 x8 x7
## 4:  C  8 x8 x7
## 5:  C  9 x8 x7
##    Id X1   XY   X2
## 1:  C  5   x6   x5
## 2:  C  6   x6   x5
## 3:  C  7   x8   x7
## 4:  C  8 <NA> <NA>
## 5:  C  9 <NA> <NA>

pass

# Next Observation Carried Backward
x[.("C", 5:9), roll = -Inf]
x[.("C", 5:9), roll = -0.5] # bounded
x[.("C", 5:9), roll = -Inf, rollends = c(TRUE, FALSE)]
x[.("C", 5:9), roll = -Inf, rollends = c(TRUE, TRUE)]  # roll both ends

##    Id X1   XY   X2
## 1:  C  5   x6   x5
## 2:  C  6   x8   x7
## 3:  C  7   x8   x7
## 4:  C  8 <NA> <NA>
## 5:  C  9 <NA> <NA>
##    Id X1   XY   X2
## 1:  C  5   x6   x5
## 2:  C  6 <NA> <NA>
## 3:  C  7   x8   x7
## 4:  C  8 <NA> <NA>
## 5:  C  9 <NA> <NA>
##    Id X1   XY   X2
## 1:  C  5   x6   x5
## 2:  C  6   x8   x7
## 3:  C  7   x8   x7
## 4:  C  8 <NA> <NA>
## 5:  C  9 <NA> <NA>
##    Id X1 XY X2
## 1:  C  5 x6 x5
## 2:  C  6 x8 x7
## 3:  C  7 x8 x7
## 4:  C  8 x8 x7
## 5:  C  9 x8 x7

pass

Cross join (`CJ` ~ `expand.grid`)

CJ(c(2, 1, 1), 3:2)
CJ(c(2, 1, 1), 3:2, sorted = FALSE, unique = TRUE)

##    V1 V2
## 1:  1  2
## 2:  1  2
## 3:  1  3
## 4:  1  3
## 5:  2  2
## 6:  2  3
##    V1 V2
## 1:  2  3
## 2:  2  2
## 3:  1  3
## 4:  1  2

(pd.DataFrame([(x, y) for x in [2, 1, 1] for y in [3, 2]],
              columns = ['V1', 'V2'])
   .sort_values(['V1', 'V2']))

##    V1  V2
## 3   1   2
## 5   1   2
## 2   1   3
## 4   1   3
## 1   2   2
## 0   2   3

(pd.DataFrame([(x, y) for x in [2, 1, 1] for y in [3, 2]],
              columns = ['V1', 'V2'])
   .drop_duplicates())

##    V1  V2
## 0   2   3
## 1   2   2
## 2   1   3
## 3   1   2

Bind

x <- data.table(1:3)
y <- data.table(4:6)
z <- data.table(7:9, 0L)

x = pd.DataFrame({"V1" : [1, 2, 3]})
y = pd.DataFrame({"V1" : [4, 5, 6]})
z = pd.DataFrame({"V1" : [7, 8, 9], "V2" : [0, 0, 0]})

Bind rows

rbind(x, y)
rbind(x, z, fill = TRUE)

##    V1
## 1:  1
## 2:  2
## 3:  3
## 4:  4
## 5:  5
## 6:  6
##    V1 V2
## 1:  1 NA
## 2:  2 NA
## 3:  3 NA
## 4:  7  0
## 5:  8  0
## 6:  9  0

pd.concat([x, y])

##    V1
## 0   1
## 1   2
## 2   3
## 0   4
## 1   5
## 2   6

pd.concat([x, z])

##    V1   V2
## 0   1  NaN
## 1   2  NaN
## 2   3  NaN
## 0   7  0.0
## 1   8  0.0
## 2   9  0.0

Bind rows using a list

rbindlist(list(x, y), idcol = TRUE)

##    .id V1
## 1:   1  1
## 2:   1  2
## 3:   1  3
## 4:   2  4
## 5:   2  5
## 6:   2  6

pass

Bind columns

base::cbind(x, y)

##    V1 V1
## 1:  1  4
## 2:  2  5
## 3:  3  6

pd.concat([x, y], axis = 1)

##    V1  V1
## 0   1   4
## 1   2   5
## 2   3   6

Set operations

x <- data.table(c(1, 2, 2, 3, 3))
y <- data.table(c(2, 2, 3, 4, 4))

Intersection

fintersect(x, y)
fintersect(x, y, all = TRUE)

##    V1
## 1:  2
## 2:  3
##    V1
## 1:  2
## 2:  2
## 3:  3

pass

Difference

fsetdiff(x, y)
fsetdiff(x, y, all = TRUE)

##    V1
## 1:  1
##    V1
## 1:  1
## 2:  3

pass

Union

funion(x, y)
funion(x, y, all = TRUE)

##    V1
## 1:  1
## 2:  2
## 3:  3
## 4:  4
##     V1
##  1:  1
##  2:  2
##  3:  2
##  4:  3
##  5:  3
##  6:  2
##  7:  2
##  8:  3
##  9:  4
## 10:  4

pass

Equality

fsetequal(x, x[order(-V1),])
all.equal(x, x) # S3 method

## [1] TRUE
## [1] TRUE

pass

## R version 3.6.3 (2020-02-29)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.4 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
## LAPACK: /home/xu/anaconda3/lib/libmkl_rt.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=fr_FR.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=fr_FR.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=fr_FR.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=fr_FR.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] data.table_1.13.0 knitr_1.26       
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.3      bookdown_0.16   lattice_0.20-41 digest_0.6.23  
##  [5] rappdirs_0.3.1  grid_3.6.3      jsonlite_1.6    magrittr_1.5   
##  [9] evaluate_0.14   blogdown_0.17   rlang_0.4.6     stringi_1.4.3  
## [13] Matrix_1.2-18   reticulate_1.16 rmarkdown_2.0   tools_3.6.3    
## [17] stringr_1.4.0   xfun_0.11       yaml_2.2.0      compiler_3.6.3 
## [21] htmltools_0.4.0

← → Top

A data.table and pandas small stroll

Introduction

About

Update

Create example data

Basic operations

Filter rows

Filter rows using indices

Discard rows using negative indices

Filter rows using a logical expression

Filter rows using multiple conditions

Filter unique rows

Discard rows with missing values

Other filters

Sort rows

Sort rows by column

Sort rows in decreasing order

Sort rows based on several columns

Select columns

Select one column using an index (not recommended)

Select one column using column name

Select several columns

Exclude columns

Select/Exclude columns using a character vector

Other selections

Summarise data

Summarise one column

Summarise several columns

Summarise several columns and assign column names

Summarise a subset of rows

Add/update/delete columns

Modify a column

Add one column

Add several columns

Create one column and remove the others

Remove one column

Remove several columns

Remove columns using a vector of colnames

Replace values for rows matching a condition

by

By group

By several groups

Calling function in by

Assigning column name in by

Using a condition in by

By on a subset of rows

Count number of observations for each group

Add a column with number of observations for each group

Retrieve the first/last/nth observation for each group

Going further

Advanced columns manipulation

Summarise all the columns

Summarise several columns

Summarise several columns by group

Summarise with more than one function by group

Summarise using a condition

Modify all the columns

Modify several columns (dropping the others)

Modify several columns (keeping the others)

Modify columns using a condition (dropping the others)

Modify columns using a condition (keeping the others)

Use a complex expression

Use multiple expressions (with DT[,{j}])

Chain expressions

Expression chaining using DT[][] (recommended)

Indexing and Keys

Set the key/index

Select the matching rows

Select the first matching row

Select the last matching row

Nomatch argument

Apply a function on the matching rows

Modify values for matching rows

Use keys in by

Set keys/indices for multiple columns

Subset using multiple keys/indices

Remove keys/indices

set*() modifications

Replace values

Reorder rows

Use multiple expressions (with `DT[,{j}]`)

Expression chaining using `DT[][]` (recommended)

`set*()` modifications

Vectorised `ifelse` statements

Join passing columns to match in the `on` argument

Cross join (`CJ` ~ `expand.grid`)