A data.table and pandas small stroll


data.table_logo

pandas_logo


Introduction

About

  • Last year, I published a post showing how to use the data.table and dplyr R packages to execute data munging tasks.
  • I have learned some Python lately and thought it would be a could exercice to “translate” the commands using pandas.
  • I am less familiar with Python and pandas than I am with R and data.table, and presenting the two librairies together in a logical way is a bit challenging. So, rather than going into convoluted explanations, I choose the easy (and time-saving) solution: only the code is presented, no further explanation is given - hence the title.
  • That’s all it is. Just a programming exercice, not a A-vs-B post. It is incomplete and not fully polished but I publish it in case it can be useful for someone and to better keep track of the files.
  • The same R code is used below and in my “data.table and dplyr tour”. The Python code, available here, is structured and labeled in the same way.

Update

  • 2020-08-10: The python code has been improved and completed. Thanks a lot to Samuel Oranyeli for his careful and insightful feedback.


Create example data

library(data.table)
set.seed(1L)

## Create a data table
DT <- data.table(V1 = rep(c(1L, 2L), 5)[-10],
                V2 = 1:9,
                V3 = c(0.5, 1.0, 1.5),
                V4 = rep(LETTERS[1:3], 3))

class(DT)
DT
## [1] "data.table" "data.frame"
##    V1 V2  V3 V4
## 1:  1  1 0.5  A
## 2:  2  2 1.0  B
## 3:  1  3 1.5  C
## 4:  2  4 0.5  A
## 5:  1  5 1.0  B
## 6:  2  6 1.5  C
## 7:  1  7 0.5  A
## 8:  2  8 1.0  B
## 9:  1  9 1.5  C
import pandas as pd
import numpy as np

## Create a pandas DataFrame
DF = pd.DataFrame(
  {"V1" : [1, 2, 1, 2, 1, 2, 1, 2, 1],
   "V2" : [1, 2, 3, 4, 5, 6, 7, 8, 9], 
   "V3" : [0.5, 1.0, 1.5] * 3, 
   "V4" : ['A', 'B', 'C'] * 3}) 
type(DF)
## <class 'pandas.core.frame.DataFrame'>
DF
##    V1  V2   V3 V4
## 0   1   1  0.5  A
## 1   2   2  1.0  B
## 2   1   3  1.5  C
## 3   2   4  0.5  A
## 4   1   5  1.0  B
## 5   2   6  1.5  C
## 6   1   7  0.5  A
## 7   2   8  1.0  B
## 8   1   9  1.5  C

Below, the data.table code uses DT and the pandas code uses DF.




Basic operations

Filter rows

Filter rows using indices

DT[3:4,]
DT[3:4] # same
##    V1 V2  V3 V4
## 1:  1  3 1.5  C
## 2:  2  4 0.5  A
##    V1 V2  V3 V4
## 1:  1  3 1.5  C
## 2:  2  4 0.5  A
DF.iloc[2:4]
##    V1  V2   V3 V4
## 2   1   3  1.5  C
## 3   2   4  0.5  A

Discard rows using negative indices

DT[!3:7,]
DT[-(3:7)] # same
##    V1 V2  V3 V4
## 1:  1  1 0.5  A
## 2:  2  2 1.0  B
## 3:  2  8 1.0  B
## 4:  1  9 1.5  C
##    V1 V2  V3 V4
## 1:  1  1 0.5  A
## 2:  2  2 1.0  B
## 3:  2  8 1.0  B
## 4:  1  9 1.5  C
DF[~DF.index.isin(range(2, 7))]
##    V1  V2   V3 V4
## 0   1   1  0.5  A
## 1   2   2  1.0  B
## 7   2   8  1.0  B
## 8   1   9  1.5  C

Filter rows using a logical expression

DT[V2 > 5]
DT[V4 %chin% c("A", "C")] # fast %in% for character
##    V1 V2  V3 V4
## 1:  2  6 1.5  C
## 2:  1  7 0.5  A
## 3:  2  8 1.0  B
## 4:  1  9 1.5  C
##    V1 V2  V3 V4
## 1:  1  1 0.5  A
## 2:  1  3 1.5  C
## 3:  2  4 0.5  A
## 4:  2  6 1.5  C
## 5:  1  7 0.5  A
## 6:  1  9 1.5  C
DF[DF.V2 > 5]
##    V1  V2   V3 V4
## 5   2   6  1.5  C
## 6   1   7  0.5  A
## 7   2   8  1.0  B
## 8   1   9  1.5  C
DF.query('V2>5')
##    V1  V2   V3 V4
## 5   2   6  1.5  C
## 6   1   7  0.5  A
## 7   2   8  1.0  B
## 8   1   9  1.5  C
DF[DF.V4.isin(['A', 'C'])]
##    V1  V2   V3 V4
## 0   1   1  0.5  A
## 2   1   3  1.5  C
## 3   2   4  0.5  A
## 5   2   6  1.5  C
## 6   1   7  0.5  A
## 8   1   9  1.5  C

Filter rows using multiple conditions

DT[V1 == 1 & V4 == "A"]
# any logical criteria can be used
##    V1 V2  V3 V4
## 1:  1  1 0.5  A
## 2:  1  7 0.5  A
DF.loc[(DF.V1 == 1) & (DF.V4 == 'A')]
# DF.query("V1 == 1 and V4 == 'A'")
# any logical criteria can be used
##    V1  V2   V3 V4
## 0   1   1  0.5  A
## 6   1   7  0.5  A

Filter unique rows

unique(DT)
unique(DT, by = c("V1", "V4")) # returns all cols
##    V1 V2  V3 V4
## 1:  1  1 0.5  A
## 2:  2  2 1.0  B
## 3:  1  3 1.5  C
## 4:  2  4 0.5  A
## 5:  1  5 1.0  B
## 6:  2  6 1.5  C
## 7:  1  7 0.5  A
## 8:  2  8 1.0  B
## 9:  1  9 1.5  C
##    V1 V2  V3 V4
## 1:  1  1 0.5  A
## 2:  2  2 1.0  B
## 3:  1  3 1.5  C
## 4:  2  4 0.5  A
## 5:  1  5 1.0  B
## 6:  2  6 1.5  C
DF.drop_duplicates()
##    V1  V2   V3 V4
## 0   1   1  0.5  A
## 1   2   2  1.0  B
## 2   1   3  1.5  C
## 3   2   4  0.5  A
## 4   1   5  1.0  B
## 5   2   6  1.5  C
## 6   1   7  0.5  A
## 7   2   8  1.0  B
## 8   1   9  1.5  C
DF.drop_duplicates(subset =['V1', 'V4'])
##    V1  V2   V3 V4
## 0   1   1  0.5  A
## 1   2   2  1.0  B
## 2   1   3  1.5  C
## 3   2   4  0.5  A
## 4   1   5  1.0  B
## 5   2   6  1.5  C

Discard rows with missing values

na.omit(DT, cols = 1:4)  # fast S3 method with cols argument
##    V1 V2  V3 V4
## 1:  1  1 0.5  A
## 2:  2  2 1.0  B
## 3:  1  3 1.5  C
## 4:  2  4 0.5  A
## 5:  1  5 1.0  B
## 6:  2  6 1.5  C
## 7:  1  7 0.5  A
## 8:  2  8 1.0  B
## 9:  1  9 1.5  C
DF.dropna(subset = DF.columns)
# dropna has more options
##    V1  V2   V3 V4
## 0   1   1  0.5  A
## 1   2   2  1.0  B
## 2   1   3  1.5  C
## 3   2   4  0.5  A
## 4   1   5  1.0  B
## 5   2   6  1.5  C
## 6   1   7  0.5  A
## 7   2   8  1.0  B
## 8   1   9  1.5  C
DF.dropna(axis = 0, how = 'any', thresh = None, subset = None, inplace = False)
##    V1  V2   V3 V4
## 0   1   1  0.5  A
## 1   2   2  1.0  B
## 2   1   3  1.5  C
## 3   2   4  0.5  A
## 4   1   5  1.0  B
## 5   2   6  1.5  C
## 6   1   7  0.5  A
## 7   2   8  1.0  B
## 8   1   9  1.5  C

Other filters

DT[sample(.N, 3)] # .N = nb of rows in DT
DT[sample(.N, .N / 2)]
DT[frankv(-V1, ties.method = "dense") < 2]
##    V1 V2  V3 V4
## 1:  1  9 1.5  C
## 2:  2  4 0.5  A
## 3:  1  7 0.5  A
##    V1 V2  V3 V4
## 1:  1  1 0.5  A
## 2:  2  2 1.0  B
## 3:  1  5 1.0  B
## 4:  1  3 1.5  C
##    V1 V2  V3 V4
## 1:  2  2 1.0  B
## 2:  2  4 0.5  A
## 3:  2  6 1.5  C
## 4:  2  8 1.0  B
DF.sample(n = 3)
##    V1  V2   V3 V4
## 4   1   5  1.0  B
## 5   2   6  1.5  C
## 1   2   2  1.0  B
DF.sample(frac = 0.5)
# DF[DF['V1'].nlargest(1)] # no keep = 'all'
##    V1  V2   V3 V4
## 7   2   8  1.0  B
## 0   1   1  0.5  A
## 6   1   7  0.5  A
## 2   1   3  1.5  C
DT[V4 %like% "^B"]
DT[V2 %between% c(3, 5)]
DT[data.table::between(V2, 3, 5, incbounds = FALSE)]
DT[V2 %inrange% list(-1:1, 1:3)] # see also ?inrange
##    V1 V2 V3 V4
## 1:  2  2  1  B
## 2:  1  5  1  B
## 3:  2  8  1  B
##    V1 V2  V3 V4
## 1:  1  3 1.5  C
## 2:  2  4 0.5  A
## 3:  1  5 1.0  B
##    V1 V2  V3 V4
## 1:  2  4 0.5  A
##    V1 V2  V3 V4
## 1:  1  1 0.5  A
## 2:  2  2 1.0  B
## 3:  1  3 1.5  C
DF[DF.V4.str.startswith('B')] # use tuples when multiple
##    V1  V2   V3 V4
## 1   2   2  1.0  B
## 4   1   5  1.0  B
## 7   2   8  1.0  B
DF[DF['V2'].between(3, 5)]
##    V1  V2   V3 V4
## 2   1   3  1.5  C
## 3   2   4  0.5  A
## 4   1   5  1.0  B
DF[DF['V2'].between(3, 5, inclusive = False)]
#DF[DF.V2 >= list(range(-1, 2)) & DF.V2 <= list(range(1,4))]
##    V1  V2   V3 V4
## 3   2   4  0.5  A


Sort rows

Sort rows by column

DT[order(V3)]  # see also setorder
##    V1 V2  V3 V4
## 1:  1  1 0.5  A
## 2:  2  4 0.5  A
## 3:  1  7 0.5  A
## 4:  2  2 1.0  B
## 5:  1  5 1.0  B
## 6:  2  8 1.0  B
## 7:  1  3 1.5  C
## 8:  2  6 1.5  C
## 9:  1  9 1.5  C
DF.sort_values('V3')
##    V1  V2   V3 V4
## 0   1   1  0.5  A
## 3   2   4  0.5  A
## 6   1   7  0.5  A
## 1   2   2  1.0  B
## 4   1   5  1.0  B
## 7   2   8  1.0  B
## 2   1   3  1.5  C
## 5   2   6  1.5  C
## 8   1   9  1.5  C

Sort rows in decreasing order

DT[order(-V3)]
##    V1 V2  V3 V4
## 1:  1  3 1.5  C
## 2:  2  6 1.5  C
## 3:  1  9 1.5  C
## 4:  2  2 1.0  B
## 5:  1  5 1.0  B
## 6:  2  8 1.0  B
## 7:  1  1 0.5  A
## 8:  2  4 0.5  A
## 9:  1  7 0.5  A
DF.sort_values('V3', ascending = False)
##    V1  V2   V3 V4
## 2   1   3  1.5  C
## 5   2   6  1.5  C
## 8   1   9  1.5  C
## 1   2   2  1.0  B
## 4   1   5  1.0  B
## 7   2   8  1.0  B
## 0   1   1  0.5  A
## 3   2   4  0.5  A
## 6   1   7  0.5  A

Sort rows based on several columns

DT[order(V1, -V2)]
##    V1 V2  V3 V4
## 1:  1  9 1.5  C
## 2:  1  7 0.5  A
## 3:  1  5 1.0  B
## 4:  1  3 1.5  C
## 5:  1  1 0.5  A
## 6:  2  8 1.0  B
## 7:  2  6 1.5  C
## 8:  2  4 0.5  A
## 9:  2  2 1.0  B
DF.sort_values(['V1', 'V2'], ascending = [True, False])
##    V1  V2   V3 V4
## 8   1   9  1.5  C
## 6   1   7  0.5  A
## 4   1   5  1.0  B
## 2   1   3  1.5  C
## 0   1   1  0.5  A
## 7   2   8  1.0  B
## 5   2   6  1.5  C
## 3   2   4  0.5  A
## 1   2   2  1.0  B


Select columns

Select one column using column name

DT[, list(V2)] # returns a data.table
DT[, .(V2)]    # returns a data.table
# . is an alias for list
DT[, "V2"]     # returns a data.table
DT[, V2]       # returns a vector
DT[["V2"]]     # returns a vector
##    V2
## 1:  1
## 2:  2
## 3:  3
## 4:  4
## 5:  5
## 6:  6
## 7:  7
## 8:  8
## 9:  9
##    V2
## 1:  1
## 2:  2
## 3:  3
## 4:  4
## 5:  5
## 6:  6
## 7:  7
## 8:  8
## 9:  9
##    V2
## 1:  1
## 2:  2
## 3:  3
## 4:  4
## 5:  5
## 6:  6
## 7:  7
## 8:  8
## 9:  9
## [1] 1 2 3 4 5 6 7 8 9
## [1] 1 2 3 4 5 6 7 8 9
DF.V2           # returns a pandas Series
## 0    1
## 1    2
## 2    3
## 3    4
## 4    5
## 5    6
## 6    7
## 7    8
## 8    9
## Name: V2, dtype: int64
DF['V2']        # returns a pandas Series
## 0    1
## 1    2
## 2    3
## 3    4
## 4    5
## 5    6
## 6    7
## 7    8
## 8    9
## Name: V2, dtype: int64
DF.loc[:, 'V2'] # returns a pandas Series
## 0    1
## 1    2
## 2    3
## 3    4
## 4    5
## 5    6
## 6    7
## 7    8
## 8    9
## Name: V2, dtype: int64
DF[['V2']]      # returns a DataFrame
##    V2
## 0   1
## 1   2
## 2   3
## 3   4
## 4   5
## 5   6
## 6   7
## 7   8
## 8   9

Select several columns

DT[, .(V2, V3, V4)]
DT[, list(V2, V3, V4)]
DT[, V2:V4] # select columns between V2 and V4
##    V2  V3 V4
## 1:  1 0.5  A
## 2:  2 1.0  B
## 3:  3 1.5  C
## 4:  4 0.5  A
## 5:  5 1.0  B
## 6:  6 1.5  C
## 7:  7 0.5  A
## 8:  8 1.0  B
## 9:  9 1.5  C
##    V2  V3 V4
## 1:  1 0.5  A
## 2:  2 1.0  B
## 3:  3 1.5  C
## 4:  4 0.5  A
## 5:  5 1.0  B
## 6:  6 1.5  C
## 7:  7 0.5  A
## 8:  8 1.0  B
## 9:  9 1.5  C
##    V2  V3 V4
## 1:  1 0.5  A
## 2:  2 1.0  B
## 3:  3 1.5  C
## 4:  4 0.5  A
## 5:  5 1.0  B
## 6:  6 1.5  C
## 7:  7 0.5  A
## 8:  8 1.0  B
## 9:  9 1.5  C
DF[['V2', 'V3', 'V4']]
##    V2   V3 V4
## 0   1  0.5  A
## 1   2  1.0  B
## 2   3  1.5  C
## 3   4  0.5  A
## 4   5  1.0  B
## 5   6  1.5  C
## 6   7  0.5  A
## 7   8  1.0  B
## 8   9  1.5  C
DF.loc[:, 'V2':'V4'] # select columns between V2 and V4
##    V2   V3 V4
## 0   1  0.5  A
## 1   2  1.0  B
## 2   3  1.5  C
## 3   4  0.5  A
## 4   5  1.0  B
## 5   6  1.5  C
## 6   7  0.5  A
## 7   8  1.0  B
## 8   9  1.5  C

Exclude columns

DT[, !c("V2", "V3")]
##    V1 V4
## 1:  1  A
## 2:  2  B
## 3:  1  C
## 4:  2  A
## 5:  1  B
## 6:  2  C
## 7:  1  A
## 8:  2  B
## 9:  1  C
DF.drop(columns = ['V2', 'V3'])
##    V1 V4
## 0   1  A
## 1   2  B
## 2   1  C
## 3   2  A
## 4   1  B
## 5   2  C
## 6   1  A
## 7   2  B
## 8   1  C

Select/Exclude columns using a character vector

cols <- c("V2", "V3")
DT[, ..cols] # .. prefix means 'one-level up'
DT[, !..cols] # or DT[, -..cols]
##    V2  V3
## 1:  1 0.5
## 2:  2 1.0
## 3:  3 1.5
## 4:  4 0.5
## 5:  5 1.0
## 6:  6 1.5
## 7:  7 0.5
## 8:  8 1.0
## 9:  9 1.5
##    V1 V4
## 1:  1  A
## 2:  2  B
## 3:  1  C
## 4:  2  A
## 5:  1  B
## 6:  2  C
## 7:  1  A
## 8:  2  B
## 9:  1  C
cols = ['V2', 'V3']
DF.loc[:, cols]
##    V2   V3
## 0   1  0.5
## 1   2  1.0
## 2   3  1.5
## 3   4  0.5
## 4   5  1.0
## 5   6  1.5
## 6   7  0.5
## 7   8  1.0
## 8   9  1.5
DF.loc[:, DF.columns.difference(cols)]
##    V1 V4
## 0   1  A
## 1   2  B
## 2   1  C
## 3   2  A
## 4   1  B
## 5   2  C
## 6   1  A
## 7   2  B
## 8   1  C

Other selections

cols <- paste0("V", 1:2)
cols <- union("V4", names(DT))
cols <- grep("V",   names(DT))
cols <- grep("3$",  names(DT))
cols <- grep(".2",  names(DT))
cols <- grep("^V1|X$",  names(DT))
cols <- grep("^(?!V2)", names(DT), perl = TRUE)
DT[, ..cols]
##    V1  V3 V4
## 1:  1 0.5  A
## 2:  2 1.0  B
## 3:  1 1.5  C
## 4:  2 0.5  A
## 5:  1 1.0  B
## 6:  2 1.5  C
## 7:  1 0.5  A
## 8:  2 1.0  B
## 9:  1 1.5  C
cols = ['V' + str(x) for x in [1, 2]] # and DF.loc
# ?
DF.filter(regex = 'V')
##    V1  V2   V3 V4
## 0   1   1  0.5  A
## 1   2   2  1.0  B
## 2   1   3  1.5  C
## 3   2   4  0.5  A
## 4   1   5  1.0  B
## 5   2   6  1.5  C
## 6   1   7  0.5  A
## 7   2   8  1.0  B
## 8   1   9  1.5  C
DF.filter(regex = '3$')
##     V3
## 0  0.5
## 1  1.0
## 2  1.5
## 3  0.5
## 4  1.0
## 5  1.5
## 6  0.5
## 7  1.0
## 8  1.5
DF.filter(regex = '.2')
##    V2
## 0   1
## 1   2
## 2   3
## 3   4
## 4   5
## 5   6
## 6   7
## 7   8
## 8   9
DF.filter(regex = '^V1|X$')
##    V1
## 0   1
## 1   2
## 2   1
## 3   2
## 4   1
## 5   2
## 6   1
## 7   2
## 8   1
DF.filter(regex = '^(?!V2)')
##    V1   V3 V4
## 0   1  0.5  A
## 1   2  1.0  B
## 2   1  1.5  C
## 3   2  0.5  A
## 4   1  1.0  B
## 5   2  1.5  C
## 6   1  0.5  A
## 7   2  1.0  B
## 8   1  1.5  C


Summarise data

Summarise one column

DT[, sum(V1)]    # returns a vector
DT[, .(sum(V1))] # returns a data.table
DT[, .(sumV1 = sum(V1))] # returns a data.table
## [1] 13
##    V1
## 1: 13
##    sumV1
## 1:    13
DF.V1.sum()   # returns a numpy array
## 13
DF[['V1']].sum()   # returns a pandas series
## V1    13
## dtype: int64
DF.loc[:, ['V1']].agg(['sum']).set_axis(['sumV1'], axis = 1)
##      sumV1
## sum     13

Summarise several columns

DT[, .(sum(V1), sd(V3))]
##    V1        V2
## 1: 13 0.4330127
DF.agg({'V1': 'sum', 'V3': 'std'}) # series
## V1    13.000000
## V3     0.433013
## dtype: float64

Summarise several columns and assign column names

DT[, .(sumv1 = sum(V1),
       sdv3  = sd(V3))]
##    sumv1      sdv3
## 1:    13 0.4330127
DF.agg(sumV1 = ('V1','sum'), sdv3 = ('V3','std'))
#res = list(DF.agg({'V1': 'sum', 'V3': 'std'}))
#pd.DataFrame([res], columns=['sumV1','sdv3'])
##          V1        V3
## sumV1  13.0       NaN
## sdv3    NaN  0.433013

Summarise a subset of rows

DT[1:4, sum(V1)]
## [1] 6
DF.loc[:3, 'V1'].sum()
## 6
DT[, data.table::first(V3)]
DT[, data.table::last(V3)]
DT[5, V3]
DT[, uniqueN(V4)]
uniqueN(DT)
## [1] 0.5
## [1] 1.5
## [1] 1
## [1] 3
## [1] 9
DF.loc[0,'V3']
## 0.5
DF.at[DF.index[-1], 'V3']
# at method better than loc with scalars
## 1.5
DF.V3.iloc[4]
## 1.0
DF['V4'].nunique()
## 3
len(DF.drop_duplicates())
## 9


Add/update/delete columns

Modify a column

DT[, V1 := V1^2]
DT
##    V1 V2  V3 V4
## 1:  1  1 0.5  A
## 2:  4  2 1.0  B
## 3:  1  3 1.5  C
## 4:  4  4 0.5  A
## 5:  1  5 1.0  B
## 6:  4  6 1.5  C
## 7:  1  7 0.5  A
## 8:  4  8 1.0  B
## 9:  1  9 1.5  C
DF.loc[:, 'V1'] = DF.loc[:, 'V1'] ** 2
#DF['V1'] = DF['V1']**2
DF.eval('V1 = V1**2', inplace = True)
DF
##    V1  V2   V3 V4
## 0   1   1  0.5  A
## 1  16   2  1.0  B
## 2   1   3  1.5  C
## 3  16   4  0.5  A
## 4   1   5  1.0  B
## 5  16   6  1.5  C
## 6   1   7  0.5  A
## 7  16   8  1.0  B
## 8   1   9  1.5  C

Add one column

DT[, v5 := log(V1)][] # adding [] prints the result
##    V1 V2  V3 V4       v5
## 1:  1  1 0.5  A 0.000000
## 2:  4  2 1.0  B 1.386294
## 3:  1  3 1.5  C 0.000000
## 4:  4  4 0.5  A 1.386294
## 5:  1  5 1.0  B 0.000000
## 6:  4  6 1.5  C 1.386294
## 7:  1  7 0.5  A 0.000000
## 8:  4  8 1.0  B 1.386294
## 9:  1  9 1.5  C 0.000000
DF = DF.assign(v5 = np.log(DF.V1))

Add several columns

DT[, c("v6", "v7") := .(sqrt(V1), "X")]

DT[, ':='(v6 = sqrt(V1),
          v7 = "X")]     # same, functional form
DF = DF.assign(v6 = np.sqrt(DF.V1), v7 = 'X')

Create one column and remove the others

DT[, .(v8 = V3 + 1)]
##     v8
## 1: 1.5
## 2: 2.0
## 3: 2.5
## 4: 1.5
## 5: 2.0
## 6: 2.5
## 7: 1.5
## 8: 2.0
## 9: 2.5
DF.loc[:, ['V3']].add(1).rename(columns = {'V3':'V8'})
# pd.DataFrame({'v8' : DF.V3 + 1})
##     V8
## 0  1.5
## 1  2.0
## 2  2.5
## 3  1.5
## 4  2.0
## 5  2.5
## 6  1.5
## 7  2.0
## 8  2.5

Remove one column

DT[, v5 := NULL]
del DF['v5'] # DF = DF.drop('v5', 1)

Remove several columns

DT[, c("v6", "v7") := NULL]
DF = DF.drop(['v6', 'v7'], 1)

Remove columns using a vector of colnames

cols <- c("V3")
DT[, (cols) := NULL] # ! not DT[, cols := NULL]
cols = 'V3'
del DF[cols]

Replace values for rows matching a condition

DT[V2 < 4, V2 := 0L]
DT
##    V1 V2 V4
## 1:  1  0  A
## 2:  4  0  B
## 3:  1  0  C
## 4:  4  4  A
## 5:  1  5  B
## 6:  4  6  C
## 7:  1  7  A
## 8:  4  8  B
## 9:  1  9  C
DF.loc[DF.loc[:, 'V2'] < 4, 'V2'] = 0


by

By group

# one-liner:
DT[, .(sumV2 = sum(V2)), by = "V4"]
# reordered and indented:
DT[, by = V4,
     .(sumV2 = sum(V2))]
# 
##    V4 sumV2
## 1:  A    11
## 2:  B    13
## 3:  C    15
##    V4 sumV2
## 1:  A    11
## 2:  B    13
## 3:  C    15
(DF.groupby(['V4'], as_index = False)
  .agg(sumV2 = ('V2', 'sum')))
##   V4  sumV2
## 0  A     11
## 1  B     13
## 2  C     15

By several groups

DT[, keyby = .(V4, V1),
     .(sumV2 = sum(V2))]
##    V4 V1 sumV2
## 1:  A  1     7
## 2:  A  4     4
## 3:  B  1     5
## 4:  B  4     8
## 5:  C  1     9
## 6:  C  4     6
(DF.groupby(['V4', 'V1'], as_index = False)
  .agg(sumV2 = ('V2', 'sum')))
##   V4  V1  sumV2
## 0  A   1      7
## 1  A  16      4
## 2  B   1      5
## 3  B  16      8
## 4  C   1      9
## 5  C  16      6

Calling function in by

DT[, by = tolower(V4),
     .(sumV1 = sum(V1))]
##    tolower sumV1
## 1:       a     6
## 2:       b     9
## 3:       c     6
(DF.assign(V4 = DF.V4.str.lower())
  .groupby(['V4'], as_index = False)
  .agg(sumV1=('V1', 'sum')))
##   V4  sumV1
## 0  a     18
## 1  b     33
## 2  c     18

Assigning column name in by

DT[, keyby = .(abc = tolower(V4)),
     .(sumV1 = sum(V1))]
##    abc sumV1
## 1:   a     6
## 2:   b     9
## 3:   c     6
(DF.assign(abc = lambda x: x.V4.str.lower())
  .groupby(['abc'], as_index = False)
  .agg(sumV1 = ('V1', 'sum')))
##   abc  sumV1
## 0   a     18
## 1   b     33
## 2   c     18

Using a condition in by

DT[, keyby = V4 == "A",
     sum(V1)]
##       V4 V1
## 1: FALSE 15
## 2:  TRUE  6
DF.groupby(DF.V4 == 'A').V1.sum()
## V4
## False    51
## True     18
## Name: V1, dtype: int64

By on a subset of rows

DT[1:5,                # i
   .(sumV1 = sum(V1)), # j
   by = V4]            # by
## complete DT[i, j, by] expression!
##    V4 sumV1
## 1:  A     5
## 2:  B     5
## 3:  C     1
DF.iloc[:5].groupby('V4').agg(sumV1 = ('V1','sum'))
##     sumV1
## V4       
## A      17
## B      17
## C       1

Count number of observations for each group

DT[, .N, by = V4]
##    V4 N
## 1:  A 3
## 2:  B 3
## 3:  C 3
DF.groupby('V4').size()
## V4
## A    3
## B    3
## C    3
## dtype: int64

Add a column with number of observations for each group

DT[, n := .N, by = V1][]
DT[, n := NULL] # rm column for consistency
##    V1 V2 V4 n
## 1:  1  0  A 5
## 2:  4  0  B 4
## 3:  1  0  C 5
## 4:  4  4  A 4
## 5:  1  5  B 5
## 6:  4  6  C 4
## 7:  1  7  A 5
## 8:  4  8  B 4
## 9:  1  9  C 5
DF.assign(n = lambda x: x.groupby('V1').V4.transform('count'))
##    V1  V2 V4  n
## 0   1   0  A  5
## 1  16   0  B  4
## 2   1   0  C  5
## 3  16   4  A  4
## 4   1   5  B  5
## 5  16   6  C  4
## 6   1   7  A  5
## 7  16   8  B  4
## 8   1   9  C  5

Retrieve the first/last/nth observation for each group

DT[, data.table::first(V2), by = V4]
DT[, data.table::last(V2), by = V4]
DT[, V2[2], by = V4]
##    V4 V1
## 1:  A  0
## 2:  B  0
## 3:  C  0
##    V4 V1
## 1:  A  7
## 2:  B  8
## 3:  C  9
##    V4 V1
## 1:  A  4
## 2:  B  5
## 3:  C  6
DF.groupby('V4')['V2'].first()
## V4
## A    0
## B    0
## C    0
## Name: V2, dtype: int64
DF.groupby('V4')['V2'].last()
## V4
## A    7
## B    8
## C    9
## Name: V2, dtype: int64
DF.groupby('V4')['V2'].nth(1) # 0-based
## V4
## A    4
## B    5
## C    6
## Name: V2, dtype: int64



Going further

Advanced columns manipulation

Summarise all the columns

DT[, lapply(.SD, max)]
##    V1 V2 V4
## 1:  4  9  C
DF.agg(['max'])
##      V1  V2 V4
## max  16   9  C

Summarise several columns

DT[, lapply(.SD, mean),
     .SDcols = c("V1", "V2")]
##          V1       V2
## 1: 2.333333 4.333333
DF.loc[:, ['V1', 'V2']].agg(['mean'])
##             V1        V2
## mean  7.666667  4.333333

Summarise several columns by group

DT[, by = V4,
     lapply(.SD, mean),
     .SDcols = c("V1", "V2")]
## using patterns (regex)
DT[, by = V4,
     lapply(.SD, mean),
     .SDcols = patterns("V1|V2|Z0")]
##    V4 V1       V2
## 1:  A  2 3.666667
## 2:  B  3 4.333333
## 3:  C  2 5.000000
##    V4 V1       V2
## 1:  A  2 3.666667
## 2:  B  3 4.333333
## 3:  C  2 5.000000
(DF.groupby('V4')
  .agg(**{f'{col}_mean':(col, 'mean') for col in ['V1', 'V2']}))
## using regex
##     V1_mean   V2_mean
## V4                   
## A         6  3.666667
## B        11  4.333333
## C         6  5.000000
(DF.filter(regex=('V1|V2|V4'))
   .groupby('V4')
   .apply(np.mean, axis = 0)
   .reset_index())
##   V4    V1        V2
## 0  A   6.0  3.666667
## 1  B  11.0  4.333333
## 2  C   6.0  5.000000

Summarise with more than one function by group

DT[, by = V4,
     c(lapply(.SD, sum),
       lapply(.SD, mean))]
##    V4 V1 V2 V1       V2
## 1:  A  6 11  2 3.666667
## 2:  B  9 13  3 4.333333
## 3:  C  6 15  2 5.000000
DF.groupby('V4').agg(['sum', 'mean'])
##     V1       V2          
##    sum mean sum      mean
## V4                       
## A   18    6  11  3.666667
## B   33   11  13  4.333333
## C   18    6  15  5.000000

Summarise using a condition

DT[, lapply(.SD, mean),
     .SDcols = is.numeric]
foo <- function(x) {is.numeric(x) && mean(x) > 3}
DT[, lapply(.SD, mean),
   .SDcols = foo]
##          V1       V2
## 1: 2.333333 4.333333
##          V2
## 1: 4.333333
DF.select_dtypes(include = 'number').agg(['mean'])
##             V1        V2
## mean  7.666667  4.333333

Modify all the columns

DT[, lapply(.SD, rev)]
##    V1 V2 V4
## 1:  1  9  C
## 2:  4  8  B
## 3:  1  7  A
## 4:  4  6  C
## 5:  1  5  B
## 6:  4  4  A
## 7:  1  0  C
## 8:  4  0  B
## 9:  1  0  A
DF.iloc[::-1]
#DF.apply(lambda x: x[::-1], axis = 0)
##    V1  V2 V4
## 8   1   9  C
## 7  16   8  B
## 6   1   7  A
## 5  16   6  C
## 4   1   5  B
## 3  16   4  A
## 2   1   0  C
## 1  16   0  B
## 0   1   0  A

Modify several columns (dropping the others)

DT[, lapply(.SD, sqrt),
     .SDcols = V1:V2]
DT[, lapply(.SD, exp),
     .SDcols = !"V4"]
##    V1       V2
## 1:  1 0.000000
## 2:  2 0.000000
## 3:  1 0.000000
## 4:  2 2.000000
## 5:  1 2.236068
## 6:  2 2.449490
## 7:  1 2.645751
## 8:  2 2.828427
## 9:  1 3.000000
##           V1         V2
## 1:  2.718282    1.00000
## 2: 54.598150    1.00000
## 3:  2.718282    1.00000
## 4: 54.598150   54.59815
## 5:  2.718282  148.41316
## 6: 54.598150  403.42879
## 7:  2.718282 1096.63316
## 8: 54.598150 2980.95799
## 9:  2.718282 8103.08393
DF.filter(['V1', 'V2']).agg(np.sqrt)
##     V1        V2
## 0  1.0  0.000000
## 1  4.0  0.000000
## 2  1.0  0.000000
## 3  4.0  2.000000
## 4  1.0  2.236068
## 5  4.0  2.449490
## 6  1.0  2.645751
## 7  4.0  2.828427
## 8  1.0  3.000000
DF.filter(regex = '[^V4]').agg(np.exp)
##              V1           V2
## 0  2.718282e+00     1.000000
## 1  8.886111e+06     1.000000
## 2  2.718282e+00     1.000000
## 3  8.886111e+06    54.598150
## 4  2.718282e+00   148.413159
## 5  8.886111e+06   403.428793
## 6  2.718282e+00  1096.633158
## 7  8.886111e+06  2980.957987
## 8  2.718282e+00  8103.083928

Modify several columns (keeping the others)

DT[, c("V1", "V2") := lapply(.SD, sqrt),
     .SDcols = c("V1", "V2")]

cols <- setdiff(names(DT), "V4")
DT[, (cols) := lapply(.SD, "^", 2L),
     .SDcols = cols]
DT
##    V1 V2 V4
## 1:  1  0  A
## 2:  4  0  B
## 3:  1  0  C
## 4:  4  4  A
## 5:  1  5  B
## 6:  4  6  C
## 7:  1  7  A
## 8:  4  8  B
## 9:  1  9  C
DF.loc[:, 'V1':'V2'] = DF.filter(['V1', 'V2']).agg(np.sqrt)
cols = DF.columns.difference(['V4'])
DF.loc[:, cols] = DF.filter(cols).agg(lambda x: pow(x,2))
DF
##      V1   V2 V4
## 0   1.0  0.0  A
## 1  16.0  0.0  B
## 2   1.0  0.0  C
## 3  16.0  4.0  A
## 4   1.0  5.0  B
## 5  16.0  6.0  C
## 6   1.0  7.0  A
## 7  16.0  8.0  B
## 8   1.0  9.0  C

Modify columns using a condition (dropping the others)

DT[, .SD - 1,
     .SDcols = is.numeric]
DT
##    V1 V2
## 1:  0 -1
## 2:  3 -1
## 3:  0 -1
## 4:  3  3
## 5:  0  4
## 6:  3  5
## 7:  0  6
## 8:  3  7
## 9:  0  8
##    V1 V2 V4
## 1:  1  0  A
## 2:  4  0  B
## 3:  1  0  C
## 4:  4  4  A
## 5:  1  5  B
## 6:  4  6  C
## 7:  1  7  A
## 8:  4  8  B
## 9:  1  9  C
DF.select_dtypes("number").sub(1)
#cols = DF.select_dtypes(include = [np.number]).columns
#DF.loc[:, cols].apply(lambda x: x - 1)
##      V1   V2
## 0   0.0 -1.0
## 1  15.0 -1.0
## 2   0.0 -1.0
## 3  15.0  3.0
## 4   0.0  4.0
## 5  15.0  5.0
## 6   0.0  6.0
## 7  15.0  7.0
## 8   0.0  8.0

Modify columns using a condition (keeping the others)

DT[, (cols) := lapply(.SD, as.integer),
     .SDcols = is.numeric]
DF.loc[:, cols] = DF.loc[:, cols].astype(int)
DF
##    V1  V2 V4
## 0   1   0  A
## 1  16   0  B
## 2   1   0  C
## 3  16   4  A
## 4   1   5  B
## 5  16   5  C
## 6   1   7  A
## 7  16   8  B
## 8   1   9  C

Use a complex expression

DT[, by = V4,
     .(V1[1:2], "X")]
##    V4 V1 V2
## 1:  A  1  X
## 2:  A  4  X
## 3:  B  4  X
## 4:  B  1  X
## 5:  C  1  X
## 6:  C  4  X
(DF.groupby('V4')
  .head(2)
  .assign(V2 = 'X')
  .sort_values('V4'))
##    V1 V2 V4
## 0   1  X  A
## 3  16  X  A
## 1  16  X  B
## 4   1  X  B
## 2   1  X  C
## 5  16  X  C

Use multiple expressions (with DT[,{j}])

DT[, {print(V1) #  comments here!
      print(summary(V1))
      x <- V1 + sum(V2)
     .(A = 1:.N, B = x) # last list returned as a data.table
     }]
## [1] 1 4 1 4 1 4 1 4 1
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.000   1.000   2.333   4.000   4.000 
##    A  B
## 1: 1 39
## 2: 2 42
## 3: 3 39
## 4: 4 42
## 5: 5 39
## 6: 6 42
## 7: 7 39
## 8: 8 42
## 9: 9 39
pass


Chain expressions

Indexing and Keys

Set the key/index

setkey(DT, V4)
setindex(DT, V4)
DT
##    V1 V2 V4
## 1:  1  0  A
## 2:  4  4  A
## 3:  1  7  A
## 4:  4  0  B
## 5:  1  5  B
## 6:  4  8  B
## 7:  1  0  C
## 8:  4  5  C
## 9:  1  9  C
DF.set_index('V4', drop = False, inplace = True)
DF.sort_index(inplace = True)
DF
##     V1  V2 V4
## V4           
## A    1   0  A
## A   16   4  A
## A    1   7  A
## B   16   0  B
## B    1   5  B
## B   16   8  B
## C    1   0  C
## C   16   5  C
## C    1   9  C

Select the matching rows

DT["A", on = "V4"]
DT[c("A", "C"), on = .(V4)] # same as on = "V4"
##    V1 V2 V4
## 1:  1  0  A
## 2:  4  4  A
## 3:  1  7  A
##    V1 V2 V4
## 1:  1  0  A
## 2:  4  4  A
## 3:  1  7  A
## 4:  1  0  C
## 5:  4  5  C
## 6:  1  9  C
DF.loc['A']
##     V1  V2 V4
## V4           
## A    1   0  A
## A   16   4  A
## A    1   7  A
DF.loc[['A', 'C']]
##     V1  V2 V4
## V4           
## A    1   0  A
## A   16   4  A
## A    1   7  A
## C    1   0  C
## C   16   5  C
## C    1   9  C

Select the first matching row

DT["B", on = "V4", mult = "first"]
DT[c("B", "C"), on = "V4", mult = "first"]
##    V1 V2 V4
## 1:  4  0  B
##    V1 V2 V4
## 1:  4  0  B
## 2:  1  0  C
DF.loc['B'].head(1)
# ?
##     V1  V2 V4
## V4           
## B   16   0  B

Select the last matching row

DT["A", on = "V4", mult = "last"]
##    V1 V2 V4
## 1:  1  7  A
DF.loc['A'].tail(1)
##     V1  V2 V4
## V4           
## A    1   7  A

Nomatch argument

# (default) returns a row with "D" even if not found
DT[c("A", "D"), on = "V4", nomatch = NA]
# no rows for unmatched values
DT[c("A", "D"), on = "V4", nomatch = 0]
##    V1 V2 V4
## 1:  1  0  A
## 2:  4  4  A
## 3:  1  7  A
## 4: NA NA  D
##    V1 V2 V4
## 1:  1  0  A
## 2:  4  4  A
## 3:  1  7  A
#deprecate-loc-reindex-listlike
#DF.loc[['A', 'D']]
#DF.loc[['A', 'D']].dropna()

Apply a function on the matching rows

DT[c("A", "C"), sum(V1), on = "V4"]
## [1] 12
DF.loc[['A', 'C'],'V1'].sum() 
## 36

Modify values for matching rows

DT["A", V1 := 0, on = "V4"]
DT
##    V1 V2 V4
## 1:  0  0  A
## 2:  0  4  A
## 3:  0  7  A
## 4:  4  0  B
## 5:  1  5  B
## 6:  4  8  B
## 7:  1  0  C
## 8:  4  5  C
## 9:  1  9  C
DF.loc['A', 'V1'] = 0
DF
##     V1  V2 V4
## V4           
## A    0   0  A
## A    0   4  A
## A    0   7  A
## B   16   0  B
## B    1   5  B
## B   16   8  B
## C    1   0  C
## C   16   5  C
## C    1   9  C

Use keys in by

DT[!"B", sum(V1), on = "V4", by = .EACHI]
DT[V4 != "B",
   by = V4,
   sum(V1)]   # same
##    V4 V1
## 1:  A  0
## 2:  C  6
##    V4 V1
## 1:  A  0
## 2:  C  6
(DF.query("index != 'B'")
  .groupby(level=0)
  .agg({'V1':np.sum}))
#DF.loc[~(DF.index == 'B')].groupby(level = 0).agg({'V1':'sum'})
##     V1
## V4    
## A    0
## C   18

Set keys/indices for multiple columns

setkey(DT, V4, V1) # or setkeyv(DT, c("V4", "V1"))
setindex(DT, V4, V1) # setindexv(DT, c("V4", "V1"))
DT
##    V1 V2 V4
## 1:  0  0  A
## 2:  0  4  A
## 3:  0  7  A
## 4:  1  5  B
## 5:  4  0  B
## 6:  4  8  B
## 7:  1  0  C
## 8:  1  9  C
## 9:  4  5  C
DF.set_index(['V4', 'V1'], drop = False, inplace = True)
DF.sort_index(inplace = True)
DF
##        V1  V2 V4
## V4 V1           
## A  0    0   0  A
##    0    0   4  A
##    0    0   7  A
## B  1    1   5  B
##    16  16   0  B
##    16  16   8  B
## C  1    1   0  C
##    1    1   9  C
##    16  16   5  C

Subset using multiple keys/indices

DT[.("C", 1), on = .(V4, V1)]
DT[.(c("B", "C"), 1), on = .(V4, V1)]
# using which = TRUE only returns the matching rows indices
DT[.(c("B", "C"), 1), on = .(V4, V1), which = TRUE]
##    V1 V2 V4
## 1:  1  0  C
## 2:  1  9  C
##    V1 V2 V4
## 1:  1  5  B
## 2:  1  0  C
## 3:  1  9  C
## [1] 4 7 8
DF.loc[('C', 1)]
##        V1  V2 V4
## V4 V1           
## C  1    1   0  C
##    1    1   9  C
DF.loc[(('B', 'C'), 1), :]
#DF.query("V4 in ('B', 'C') and V1 == 1")
##        V1  V2 V4
## V4 V1           
## B  1    1   5  B
## C  1    1   0  C
##    1    1   9  C
list(np.where(DF.V4.isin(['B', 'C']) & DF.V1 == 1))
## [array([3, 6, 7])]
DF.index.isin({'V4': ['B', 'C'], 'V1': [1]}) # wrong
## array([False, False, False, False, False, False, False, False, False])

Remove keys/indices

setkey(DT, NULL)
setindex(DT, NULL)
DT
##    V1 V2 V4
## 1:  0  0  A
## 2:  0  4  A
## 3:  0  7  A
## 4:  1  5  B
## 5:  4  0  B
## 6:  4  8  B
## 7:  1  0  C
## 8:  1  9  C
## 9:  4  5  C
DF.reset_index(inplace = True, drop = True)
DF
##    V1  V2 V4
## 0   0   0  A
## 1   0   4  A
## 2   0   7  A
## 3   1   5  B
## 4  16   0  B
## 5  16   8  B
## 6   1   0  C
## 7   1   9  C
## 8  16   5  C


set*() modifications

Replace values

set(DT, i = 1L, j = 2L, value = 3L)
DT
##    V1 V2 V4
## 1:  0  3  A
## 2:  0  4  A
## 3:  0  7  A
## 4:  1  5  B
## 5:  4  0  B
## 6:  4  8  B
## 7:  1  0  C
## 8:  1  9  C
## 9:  4  5  C
DF.iloc[0, 1] = 3
DF
##    V1  V2 V4
## 0   0   3  A
## 1   0   4  A
## 2   0   7  A
## 3   1   5  B
## 4  16   0  B
## 5  16   8  B
## 6   1   0  C
## 7   1   9  C
## 8  16   5  C

Reorder rows

setorder(DT, V4, -V1)
setorderv(DT, c("V4", "V1"), c(1, -1))
DF.sort_values(['V4','V1'], ascending = [True, False], inplace = True)

Modify colnames

setnames(DT, old = "V2", new = "v2")
setnames(DT, old = -(c(1, 3)), new = "V2")
DT
##    V1 V2 V4
## 1:  0  3  A
## 2:  0  4  A
## 3:  0  7  A
## 4:  4  0  B
## 5:  4  8  B
## 6:  1  5  B
## 7:  4  5  C
## 8:  1  0  C
## 9:  1  9  C
DF.rename(columns = {'V2':'v2'}, inplace = True)
cols = DF.columns.values; cols[1] = 'V2'
DF.columns = cols
DF
##    V1  V2 V4
## 0   0   3  A
## 1   0   4  A
## 2   0   7  A
## 4  16   0  B
## 5  16   8  B
## 3   1   5  B
## 8  16   5  C
## 6   1   0  C
## 7   1   9  C

Reorder columns

setcolorder(DT, c("V4", "V1", "V2"))
DT
##    V4 V1 V2
## 1:  A  0  3
## 2:  A  0  4
## 3:  A  0  7
## 4:  B  4  0
## 5:  B  4  8
## 6:  B  1  5
## 7:  C  4  5
## 8:  C  1  0
## 9:  C  1  9
DF = DF[['V4', 'V1', 'V2']]
DF
##   V4  V1  V2
## 0  A   0   3
## 1  A   0   4
## 2  A   0   7
## 4  B  16   0
## 5  B  16   8
## 3  B   1   5
## 8  C  16   5
## 6  C   1   0
## 7  C   1   9

Convert data

# ?setDT # data.frame or list to data.table
# ?setDF # data.table to data.frame
# ?setattr # modify attributes
pass


Advanced use of by

Select first/last/… row by group

DT[, .SD[1], by = V4]
DT[, .SD[c(1, .N)], by = V4]
DT[, tail(.SD, 2), by = V4]
##    V4 V1 V2
## 1:  A  0  3
## 2:  B  4  0
## 3:  C  4  5
##    V4 V1 V2
## 1:  A  0  3
## 2:  A  0  7
## 3:  B  4  0
## 4:  B  1  5
## 5:  C  4  5
## 6:  C  1  9
##    V4 V1 V2
## 1:  A  0  4
## 2:  A  0  7
## 3:  B  4  8
## 4:  B  1  5
## 5:  C  1  0
## 6:  C  1  9
DF.groupby('V4').head(1)
##   V4  V1  V2
## 0  A   0   3
## 4  B  16   0
## 8  C  16   5
DF.groupby('V4').nth([0, -1])
##     V1  V2
## V4        
## A    0   3
## A    0   7
## B   16   0
## B    1   5
## C   16   5
## C    1   9
DF.groupby('V4').nth([-1, -2])
##     V1  V2
## V4        
## A    0   4
## A    0   7
## B   16   8
## B    1   5
## C    1   0
## C    1   9

Select rows using a nested query

DT[, .SD[which.min(V2)], by = V4]
##    V4 V1 V2
## 1:  A  0  3
## 2:  B  4  0
## 3:  C  1  0
DF.loc[DF.groupby('V4').V2.idxmin()]
##   V4  V1  V2
## 0  A   0   3
## 4  B  16   0
## 6  C   1   0

Add a group counter column

DT[, Grp := .GRP, by = .(V4, V1)][]
DT[, Grp := NULL] # delete for consistency
##    V4 V1 V2 Grp
## 1:  A  0  3   1
## 2:  A  0  4   1
## 3:  A  0  7   1
## 4:  B  4  0   2
## 5:  B  4  8   2
## 6:  B  1  5   3
## 7:  C  4  5   4
## 8:  C  1  0   5
## 9:  C  1  9   5
DF['Grp'] = DF.groupby(['V4', 'V1']).ngroup()
## /home/xu/anaconda3/bin/python:1: SettingWithCopyWarning: 
## A value is trying to be set on a copy of a slice from a DataFrame.
## Try using .loc[row_indexer,col_indexer] = value instead
## 
## See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
DF
##   V4  V1  V2  Grp
## 0  A   0   3    0
## 1  A   0   4    0
## 2  A   0   7    0
## 4  B  16   0    2
## 5  B  16   8    2
## 3  B   1   5    1
## 8  C  16   5    4
## 6  C   1   0    3
## 7  C   1   9    3
del DF['Grp']

Get row number of first (and last) observation by group

DT[, .I, by = V4] # returns a data.table
DT[, .I[1], by = V4]
DT[, .I[c(1, .N)], by = V4]
##    V4 I
## 1:  A 1
## 2:  A 2
## 3:  A 3
## 4:  B 4
## 5:  B 5
## 6:  B 6
## 7:  C 7
## 8:  C 8
## 9:  C 9
##    V4 V1
## 1:  A  1
## 2:  B  4
## 3:  C  7
##    V4 V1
## 1:  A  1
## 2:  A  3
## 3:  B  4
## 4:  B  6
## 5:  C  7
## 6:  C  9
pd.DataFrame(DF.groupby('V4').indices).melt()
##   variable  value
## 0        A      0
## 1        A      1
## 2        A      2
## 3        B      3
## 4        B      4
## 5        B      5
## 6        C      6
## 7        C      7
## 8        C      8
pd.DataFrame(DF.groupby('V4').indices).loc[0]
## A    0
## B    3
## C    6
## Name: 0, dtype: int64
pd.DataFrame(DF.groupby('V4').indices).iloc[[0,-1]].melt()
##   variable  value
## 0        A      0
## 1        A      2
## 2        B      3
## 3        B      5
## 4        C      6
## 5        C      8

Handle list-columns by group

DT[, .(.(V1)),  by = V4]  # return V1 as a list
DT[, .(.(.SD)), by = V4] # subsets of the data
##    V4    V1
## 1:  A 0,0,0
## 2:  B 4,4,1
## 3:  C 4,1,1
##    V4                V1
## 1:  A <data.table[3x2]>
## 2:  B <data.table[3x2]>
## 3:  C <data.table[3x2]>
DF.groupby('V4')['V1'].apply(list)
# ?
## V4
## A      [0, 0, 0]
## B    [16, 16, 1]
## C     [16, 1, 1]
## Name: V1, dtype: object

Grouping sets (multiple by at once)

rollup(DT,
       .(SumV2 = sum(V2)),
       by = c("V1", "V4"))

rollup(DT,
       .(SumV2 = sum(V2), .N),
       by = c("V1", "V4"),
       id = TRUE)

cube(DT,
     .(SumV2 = sum(V2), .N),
     by = c("V1", "V4"),
     id = TRUE)

groupingsets(DT,
             .(SumV2 = sum(V2), .N),
             by   = c("V1", "V4"),
             sets = list("V1", c("V1", "V4")),
             id   = TRUE)
##    V1   V4 SumV2
## 1:  0    A    14
## 2:  4    B     8
## 3:  1    B     5
## 4:  4    C     5
## 5:  1    C     9
## 6:  0 <NA>    14
## 7:  4 <NA>    13
## 8:  1 <NA>    14
## 9: NA <NA>    41
##    grouping V1   V4 SumV2 N
## 1:        0  0    A    14 3
## 2:        0  4    B     8 2
## 3:        0  1    B     5 1
## 4:        0  4    C     5 1
## 5:        0  1    C     9 2
## 6:        1  0 <NA>    14 3
## 7:        1  4 <NA>    13 3
## 8:        1  1 <NA>    14 3
## 9:        3 NA <NA>    41 9
##     grouping V1   V4 SumV2 N
##  1:        0  0    A    14 3
##  2:        0  4    B     8 2
##  3:        0  1    B     5 1
##  4:        0  4    C     5 1
##  5:        0  1    C     9 2
##  6:        1  0 <NA>    14 3
##  7:        1  4 <NA>    13 3
##  8:        1  1 <NA>    14 3
##  9:        2 NA    A    14 3
## 10:        2 NA    B    13 3
## 11:        2 NA    C    14 3
## 12:        3 NA <NA>    41 9
##    grouping V1   V4 SumV2 N
## 1:        1  0 <NA>    14 3
## 2:        1  4 <NA>    13 3
## 3:        1  1 <NA>    14 3
## 4:        0  0    A    14 3
## 5:        0  4    B     8 2
## 6:        0  1    B     5 1
## 7:        0  4    C     5 1
## 8:        0  1    C     9 2
dd = pd.pivot_table(DF, values=['V2'], \
        index =['V1'], columns=['V4'], aggfunc = np.sum, margins = True)
pd.melt(dd, col_level = 1)
# ?
##      V4  value
## 0     A   14.0
## 1     A    NaN
## 2     A    NaN
## 3     A   14.0
## 4     B    NaN
## 5     B    5.0
## 6     B    8.0
## 7     B   13.0
## 8     C    NaN
## 9     C    9.0
## 10    C    5.0
## 11    C   14.0
## 12  All   14.0
## 13  All   14.0
## 14  All   13.0
## 15  All   41.0



Miscellaneous

Read / Write data

Write data to a csv file

fwrite(DT, "DT.csv")
DF.to_csv('DF.csv', index = False)

Write data to a tab-delimited file

fwrite(DT, "DT.txt", sep = "\t")
DF.to_csv('DF.txt', sep='\t', index = False)

Write list-column data to a csv file

fwrite(setDT(list(0, list(1:5))), "DT2.csv")
pd.DataFrame({'V1':0, 'V2':[[1,2,3,4,5]]}).to_csv('DF2.csv', index = False)

Read a csv / tab-delimited file

fread("DT.csv")
# fread("DT.csv", verbose = TRUE) # full details
fread("DT.txt", sep = "\t")
##    V4 V1 V2
## 1:  A  0  3
## 2:  A  0  4
## 3:  A  0  7
## 4:  B  4  0
## 5:  B  4  8
## 6:  B  1  5
## 7:  C  4  5
## 8:  C  1  0
## 9:  C  1  9
##    V4 V1 V2
## 1:  A  0  3
## 2:  A  0  4
## 3:  A  0  7
## 4:  B  4  0
## 5:  B  4  8
## 6:  B  1  5
## 7:  C  4  5
## 8:  C  1  0
## 9:  C  1  9
pd.read_csv('DF.csv')
##   V4  V1  V2
## 0  A   0   3
## 1  A   0   4
## 2  A   0   7
## 3  B  16   0
## 4  B  16   8
## 5  B   1   5
## 6  C  16   5
## 7  C   1   0
## 8  C   1   9
pd.read_csv('DF.txt', sep = '\t')
##   V4  V1  V2
## 0  A   0   3
## 1  A   0   4
## 2  A   0   7
## 3  B  16   0
## 4  B  16   8
## 5  B   1   5
## 6  C  16   5
## 7  C   1   0
## 8  C   1   9

Read a csv file selecting / droping columns

fread("DT.csv", select = c("V1", "V4"))
fread("DT.csv", drop = "V4")
##    V1 V4
## 1:  0  A
## 2:  0  A
## 3:  0  A
## 4:  4  B
## 5:  4  B
## 6:  1  B
## 7:  4  C
## 8:  1  C
## 9:  1  C
##    V1 V2
## 1:  0  3
## 2:  0  4
## 3:  0  7
## 4:  4  0
## 5:  4  8
## 6:  1  5
## 7:  4  5
## 8:  1  0
## 9:  1  9
pd.read_csv('DF.csv', usecols=['V1', 'V4'])
##   V4  V1
## 0  A   0
## 1  A   0
## 2  A   0
## 3  B  16
## 4  B  16
## 5  B   1
## 6  C  16
## 7  C   1
## 8  C   1
pd.read_csv('DF.csv', usecols = lambda x: x != 'V4')
##    V1  V2
## 0   0   3
## 1   0   4
## 2   0   7
## 3  16   0
## 4  16   8
## 5   1   5
## 6  16   5
## 7   1   0
## 8   1   9

Read and rbind several files

rbindlist(lapply(c("DT.csv", "DT.csv"), fread))
# c("DT.csv", "DT.csv") %>% lapply(fread) %>% rbindlist
##     V4 V1 V2
##  1:  A  0  3
##  2:  A  0  4
##  3:  A  0  7
##  4:  B  4  0
##  5:  B  4  8
##  6:  B  1  5
##  7:  C  4  5
##  8:  C  1  0
##  9:  C  1  9
## 10:  A  0  3
## 11:  A  0  4
## 12:  A  0  7
## 13:  B  4  0
## 14:  B  4  8
## 15:  B  1  5
## 16:  C  4  5
## 17:  C  1  0
## 18:  C  1  9
li = [pd.read_csv(fi) for fi in ['DF.csv', 'DF.csv']]
pd.concat(li, axis = 0, ignore_index = True)
##    V4  V1  V2
## 0   A   0   3
## 1   A   0   4
## 2   A   0   7
## 3   B  16   0
## 4   B  16   8
## 5   B   1   5
## 6   C  16   5
## 7   C   1   0
## 8   C   1   9
## 9   A   0   3
## 10  A   0   4
## 11  A   0   7
## 12  B  16   0
## 13  B  16   8
## 14  B   1   5
## 15  C  16   5
## 16  C   1   0
## 17  C   1   9


Reshape data

Melt data (from wide to long)

melt(DT, id.vars = "V4")
mDT <- melt(DT,
            id.vars       = "V4",
            measure.vars  = c("V1", "V2"),
            variable.name = "Variable",
            value.name    = "Value")
##     V4 variable value
##  1:  A       V1     0
##  2:  A       V1     0
##  3:  A       V1     0
##  4:  B       V1     4
##  5:  B       V1     4
##  6:  B       V1     1
##  7:  C       V1     4
##  8:  C       V1     1
##  9:  C       V1     1
## 10:  A       V2     3
## 11:  A       V2     4
## 12:  A       V2     7
## 13:  B       V2     0
## 14:  B       V2     8
## 15:  B       V2     5
## 16:  C       V2     5
## 17:  C       V2     0
## 18:  C       V2     9
pd.melt(DF, id_vars = ['V4'])
##    V4 variable  value
## 0   A       V1      0
## 1   A       V1      0
## 2   A       V1      0
## 3   B       V1     16
## 4   B       V1     16
## 5   B       V1      1
## 6   C       V1     16
## 7   C       V1      1
## 8   C       V1      1
## 9   A       V2      3
## 10  A       V2      4
## 11  A       V2      7
## 12  B       V2      0
## 13  B       V2      8
## 14  B       V2      5
## 15  C       V2      5
## 16  C       V2      0
## 17  C       V2      9
mDF = pd.melt(DF,
              id_vars    = 'V4',
              value_vars = ('V1', 'V2'), 
              var_name   = 'Variable',
              value_name = 'Value')

Cast data (from long to wide)

dcast(mDT, V4 ~ Variable) # aggregate by count
## Using 'Value' as value column. Use 'value.var' to override
## Aggregate function missing, defaulting to 'length'
dcast(mDT, V4 ~ Variable, fun.aggregate = sum)
## Using 'Value' as value column. Use 'value.var' to override
dcast(mDT, V4 ~ Value > 5)
## Using 'Value' as value column. Use 'value.var' to override
## Aggregate function missing, defaulting to 'length'
# see ?dcast: multiple values / fun.aggregate
##    V4 V1 V2
## 1:  A  3  3
## 2:  B  3  3
## 3:  C  3  3
##    V4 V1 V2
## 1:  A  0 14
## 2:  B  9 13
## 3:  C  6 14
##    V4 FALSE TRUE
## 1:  A     5    1
## 2:  B     5    1
## 3:  C     5    1
pd.crosstab(mDF.V4, mDF.Variable)
## Variable  V1  V2
## V4              
## A          3   3
## B          3   3
## C          3   3
pd.crosstab(mDF.V4, mDF.Variable, values = mDF.Value, aggfunc = 'sum')
## Variable  V1  V2
## V4              
## A          0  14
## B         33  13
## C         18  14
pd.crosstab(mDF.V4, mDF.Value > 5)
## Value  False  True 
## V4                 
## A          5      1
## B          3      3
## C          4      2

Split

split(DT, by = "V4") # S3 method
## $A
##    V4 V1 V2
## 1:  A  0  3
## 2:  A  0  4
## 3:  A  0  7
## 
## $B
##    V4 V1 V2
## 1:  B  4  0
## 2:  B  4  8
## 3:  B  1  5
## 
## $C
##    V4 V1 V2
## 1:  C  4  5
## 2:  C  1  0
## 3:  C  1  9
list(DF.groupby('V4'))
## [('A',   V4  V1  V2
## 0  A   0   3
## 1  A   0   4
## 2  A   0   7), ('B',   V4  V1  V2
## 4  B  16   0
## 5  B  16   8
## 3  B   1   5), ('C',   V4  V1  V2
## 8  C  16   5
## 6  C   1   0
## 7  C   1   9)]

Split and transpose a vector/column

vec <- c("A:a", "B:b", "C:c")
tstrsplit(vec, split = ":", keep = 2L) # works on vector
setDT(tstrsplit(vec, split = ":"))[]
## [[1]]
## [1] "a" "b" "c"
## 
##    V1 V2
## 1:  A  a
## 2:  B  b
## 3:  C  c
vec = ['A:a', 'B:b', 'C:c']
[i.split(':')[1] for i in vec]
## ['a', 'b', 'c']
pd.DataFrame([i.split(':') for i in vec])
##    0  1
## 0  A  a
## 1  B  b
## 2  C  c


Other

Check package installation

# test.data.table()
# There's more lines of test code in data.table than there is code!
pass

List data.tables/tibbles

tables()
##    NAME NROW NCOL MB              COLS KEY
## 1:   DT    9    3  0          V4,V1,V2    
## 2:  mDT   18    3  0 V4,Variable,Value    
## Total: 0MB
# %whos DataFrame # IPython
for i in dir():
    if type(globals()[i]) == pd.DataFrame:
        print(i)
## DF
## dd
## mDF

Get/Set number of threads when parallelized

getDTthreads() # setDTthreads()
## [1] 2
pass

Lead/Lag

shift(1:10, n = 1,   fill = NA, type = "lag")
shift(1:10, n = 1:2, fill = NA, type = "lag") # multiple
shift(1:10, n = 1,   fill = NA, type = "lead")
##  [1] NA  1  2  3  4  5  6  7  8  9
## [[1]]
##  [1] NA  1  2  3  4  5  6  7  8  9
## 
## [[2]]
##  [1] NA NA  1  2  3  4  5  6  7  8
## 
##  [1]  2  3  4  5  6  7  8  9 10 NA
pd.Series(np.arange(1, 11)).shift(periods = 1)
## 0    NaN
## 1    1.0
## 2    2.0
## 3    3.0
## 4    4.0
## 5    5.0
## 6    6.0
## 7    7.0
## 8    8.0
## 9    9.0
## dtype: float64
[pd.Series(np.arange(1, 11)).shift(periods = i) for i in [1,2]]
## [0    NaN
## 1    1.0
## 2    2.0
## 3    3.0
## 4    4.0
## 5    5.0
## 6    6.0
## 7    7.0
## 8    8.0
## 9    9.0
## dtype: float64, 0    NaN
## 1    NaN
## 2    1.0
## 3    2.0
## 4    3.0
## 5    4.0
## 6    5.0
## 7    6.0
## 8    7.0
## 9    8.0
## dtype: float64]
pd.Series(np.arange(1, 11)).shift(periods = -1)
## 0     2.0
## 1     3.0
## 2     4.0
## 3     5.0
## 4     6.0
## 5     7.0
## 6     8.0
## 7     9.0
## 8    10.0
## 9     NaN
## dtype: float64

Generate run-length ids

rleid(rep(c("a", "b", "a"), each = 3)) # see also ?rleidv
rleid(rep(c("a", "b", "a"), each = 3), prefix = "G")
## [1] 1 1 1 2 2 2 3 3 3
## [1] "G1" "G1" "G1" "G2" "G2" "G2" "G3" "G3" "G3"
pass

Vectorised ifelse statements

x <- 1:10
fcase(
  x %% 6 == 0, "fizz buzz",
  x %% 2 == 0, "fizz",
  x %% 3 == 0, "buzz",
  default = NA_character_
)
##  [1] NA          "fizz"      "buzz"      "fizz"      NA          "fizz buzz"
##  [7] NA          "fizz"      "buzz"      "fizz"
pass

Rolling functions

# todo
pass



Join/Bind data sets

Join

x <- data.table(Id  = c("A", "B", "C", "C"),
                X1  = c(1L, 3L, 5L, 7L),
                XY  = c("x2", "x4", "x6", "x8"),
                key = "Id")

y <- data.table(Id  = c("A", "B", "B", "D"),
                Y1  = c(1L, 3L, 5L, 7L),
                XY  = c("y1", "y3", "y5", "y7"),
                key = "Id")
x = pd.DataFrame(
  {"Id" : ['A', 'B', 'C', 'C'],
   "X1" : [1, 3, 5, 7],
   "XY" : ['x2', 'x4', 'x6', 'x8']})
y = pd.DataFrame(
  {"Id" : ['A', 'B', 'B', 'D'],
   "Y1" : [1, 3, 5, 7],
   "XY" : ['y1', 'y3', 'y5', 'y7']})

Join matching rows from y to x

y[x, on = "Id"]
merge(x, y, all.x = TRUE, by = "Id")
y[x] # requires keys
##    Id Y1   XY X1 i.XY
## 1:  A  1   y1  1   x2
## 2:  B  3   y3  3   x4
## 3:  B  5   y5  3   x4
## 4:  C NA <NA>  5   x6
## 5:  C NA <NA>  7   x8
##    Id X1 XY.x Y1 XY.y
## 1:  A  1   x2  1   y1
## 2:  B  3   x4  3   y3
## 3:  B  3   x4  5   y5
## 4:  C  5   x6 NA <NA>
## 5:  C  7   x8 NA <NA>
##    Id Y1   XY X1 i.XY
## 1:  A  1   y1  1   x2
## 2:  B  3   y3  3   x4
## 3:  B  5   y5  3   x4
## 4:  C NA <NA>  5   x6
## 5:  C NA <NA>  7   x8
pd.merge(x, y, how = 'left', on = 'Id')
##   Id  X1 XY_x   Y1 XY_y
## 0  A   1   x2  1.0   y1
## 1  B   3   x4  3.0   y3
## 2  B   3   x4  5.0   y5
## 3  C   5   x6  NaN  NaN
## 4  C   7   x8  NaN  NaN

Join matching rows from x to y

x[y, on = "Id"]
merge(x, y, all.y = TRUE, by = "Id")
x[y] # requires keys
##    Id X1   XY Y1 i.XY
## 1:  A  1   x2  1   y1
## 2:  B  3   x4  3   y3
## 3:  B  3   x4  5   y5
## 4:  D NA <NA>  7   y7
##    Id X1 XY.x Y1 XY.y
## 1:  A  1   x2  1   y1
## 2:  B  3   x4  3   y3
## 3:  B  3   x4  5   y5
## 4:  D NA <NA>  7   y7
##    Id X1   XY Y1 i.XY
## 1:  A  1   x2  1   y1
## 2:  B  3   x4  3   y3
## 3:  B  3   x4  5   y5
## 4:  D NA <NA>  7   y7
pd.merge(x, y, how = 'right', on = 'Id')
##   Id   X1 XY_x  Y1 XY_y
## 0  A  1.0   x2   1   y1
## 1  B  3.0   x4   3   y3
## 2  B  3.0   x4   5   y5
## 3  D  NaN  NaN   7   y7

Join matching rows from both x and y

x[y, on = "Id", nomatch = 0]
merge(x, y)
x[y, nomatch = 0] # requires keys
##    Id X1 XY Y1 i.XY
## 1:  A  1 x2  1   y1
## 2:  B  3 x4  3   y3
## 3:  B  3 x4  5   y5
##    Id X1 XY.x Y1 XY.y
## 1:  A  1   x2  1   y1
## 2:  B  3   x4  3   y3
## 3:  B  3   x4  5   y5
##    Id X1 XY Y1 i.XY
## 1:  A  1 x2  1   y1
## 2:  B  3 x4  3   y3
## 3:  B  3 x4  5   y5
pd.merge(x, y, how = 'inner', on = 'Id')
##   Id  X1 XY_x  Y1 XY_y
## 0  A   1   x2   1   y1
## 1  B   3   x4   3   y3
## 2  B   3   x4   5   y5

Join keeping all the rows

merge(x, y, all = TRUE, by = "Id")
##    Id X1 XY.x Y1 XY.y
## 1:  A  1   x2  1   y1
## 2:  B  3   x4  3   y3
## 3:  B  3   x4  5   y5
## 4:  C  5   x6 NA <NA>
## 5:  C  7   x8 NA <NA>
## 6:  D NA <NA>  7   y7
pd.merge(x, y, how = 'outer', on = 'Id')
##   Id   X1 XY_x   Y1 XY_y
## 0  A  1.0   x2  1.0   y1
## 1  B  3.0   x4  3.0   y3
## 2  B  3.0   x4  5.0   y5
## 3  C  5.0   x6  NaN  NaN
## 4  C  7.0   x8  NaN  NaN
## 5  D  NaN  NaN  7.0   y7

Return rows from x matching y

unique(x[y$Id, on = "Id", nomatch = 0])
unique(x[y$Id, nomatch = 0]) # requires keys
##    Id X1 XY
## 1:  A  1 x2
## 2:  B  3 x4
##    Id X1 XY
## 1:  A  1 x2
## 2:  B  3 x4
x[x.Id.isin(y.Id)]
##   Id  X1  XY
## 0  A   1  x2
## 1  B   3  x4

Return rows from x not matching y

x[!y, on = "Id"]
x[!y] # requires keys
##    Id X1 XY
## 1:  C  5 x6
## 2:  C  7 x8
##    Id X1 XY
## 1:  C  5 x6
## 2:  C  7 x8
x[~x.Id.isin(y.Id)]
##   Id  X1  XY
## 2  C   5  x6
## 3  C   7  x8


More joins

Select columns while joining

x[y, .(Id, X1, i.XY)]   # i. prefix refers to cols in y
x[y, .(Id, x.XY, i.XY)] # x. prefix refers to cols in x
##    Id X1 i.XY
## 1:  A  1   y1
## 2:  B  3   y3
## 3:  B  3   y5
## 4:  D NA   y7
##    Id x.XY i.XY
## 1:  A   x2   y1
## 2:  B   x4   y3
## 3:  B   x4   y5
## 4:  D <NA>   y7
pd.merge(x[['Id','X1']],
         y[['Id','XY']],
         how = 'right',
         on  = 'Id')
##   Id   X1  XY
## 0  A  1.0  y1
## 1  B  3.0  y3
## 2  B  3.0  y5
## 3  D  NaN  y7
pd.merge(x[['Id','XY']],
         y[['Id','XY']],
         how = 'right',
         on  = 'Id')
##   Id XY_x XY_y
## 0  A   x2   y1
## 1  B   x4   y3
## 2  B   x4   y5
## 3  D  NaN   y7

Aggregate columns while joining

y[x, .(X1Y1 = sum(Y1) * X1), by = .EACHI]
##    Id X1Y1
## 1:  A    1
## 2:  B   24
## 3:  C   NA
## 4:  C   NA
(y.groupby('Id')['Y1']
    .agg('sum')
    .to_frame()
    .rename(columns={'Y1': 'SumY1'})
    .merge(x, how = 'right', on = 'Id')
    .assign(X1Y1 = lambda df: df.SumY1 * df.X1)
    .loc[:, ['Id', 'X1Y1']])
##   Id  X1Y1
## 0  A   1.0
## 1  B  24.0
## 2  C   NaN
## 3  C   NaN

Update columns while joining

y[x, SqX1 := i.X1^2]
y[, SqX1 := x[.BY, X1^2, on = "Id"], by = Id] # more memory-efficient
y[, SqX1 := NULL] # rm column for consistency
(x.loc[:, ['Id', 'X1']]
    .assign(SqX1 = lambda df: df.X1**2)
    .merge(y, how = 'right', on = 'Id')
    .loc[:, list(y.columns) + ['SqX1']])
##   Id  Y1  XY  SqX1
## 0  A   1  y1   1.0
## 1  B   3  y3   9.0
## 2  B   5  y5   9.0
## 3  D   7  y7   NaN

Adds a list column with rows from y matching x (nest-join)

x[, y := .(.(y[.BY, on = "Id"])), by = Id]
x[, y := NULL] # rm column for consistency
pass

Update columns while joining (using vectors of colnames)

cols  <- c("NewXY", "NewX1")
icols <- paste0("i.", c("XY", "X1"))

y[x, (cols) := mget(icols)]

y[, (cols) := NULL] # rm columns for consistency
pass

Join passing columns to match in the on argument

z <- data.table(ID = "C", Z1 = 5:9, Z2 = paste0("z", 5:9))
x[, X2 := paste0("x", X1)] # used to track the results
z; x
##    ID Z1 Z2
## 1:  C  5 z5
## 2:  C  6 z6
## 3:  C  7 z7
## 4:  C  8 z8
## 5:  C  9 z9
##    Id X1 XY X2
## 1:  A  1 x2 x1
## 2:  B  3 x4 x3
## 3:  C  5 x6 x5
## 4:  C  7 x8 x7
x[z, on = "X1 == Z1"]
x[z, on = .(X1 == Z1)] # same
x[z, on = .(Id == ID, X1 == Z1)] # using two columns
##      Id X1   XY   X2 ID Z2
## 1:    C  5   x6   x5  C z5
## 2: <NA>  6 <NA> <NA>  C z6
## 3:    C  7   x8   x7  C z7
## 4: <NA>  8 <NA> <NA>  C z8
## 5: <NA>  9 <NA> <NA>  C z9
##      Id X1   XY   X2 ID Z2
## 1:    C  5   x6   x5  C z5
## 2: <NA>  6 <NA> <NA>  C z6
## 3:    C  7   x8   x7  C z7
## 4: <NA>  8 <NA> <NA>  C z8
## 5: <NA>  9 <NA> <NA>  C z9
##    Id X1   XY   X2 Z2
## 1:  C  5   x6   x5 z5
## 2:  C  6 <NA> <NA> z6
## 3:  C  7   x8   x7 z7
## 4:  C  8 <NA> <NA> z8
## 5:  C  9 <NA> <NA> z9
z = pd.DataFrame(
  {"ID" : ['C', 'C', 'C', 'C', 'C'],
   "Z1" : [5, 6, 7, 8, 9],
   "Z2" : ['z5', 'z6', 'z7', 'z8', 'z9']})
x = x.assign(X2 = 'x' + x.X1.astype(str))
#
pd.merge(x, z, how = 'right', left_on = 'X1', right_on = 'Z1')
##     Id   X1   XY   X2 ID  Z1  Z2
## 0    C  5.0   x6   x5  C   5  z5
## 1  NaN  NaN  NaN  NaN  C   6  z6
## 2    C  7.0   x8   x7  C   7  z7
## 3  NaN  NaN  NaN  NaN  C   8  z8
## 4  NaN  NaN  NaN  NaN  C   9  z9
pd.merge(x, z, how = 'right', left_on = ['Id', 'X1'], right_on = ['ID', 'Z1'])
##     Id   X1   XY   X2 ID  Z1  Z2
## 0    C  5.0   x6   x5  C   5  z5
## 1  NaN  NaN  NaN  NaN  C   6  z6
## 2    C  7.0   x8   x7  C   7  z7
## 3  NaN  NaN  NaN  NaN  C   8  z8
## 4  NaN  NaN  NaN  NaN  C   9  z9

Non-equi joins

x[z, on = .(Id == ID, X1 <= Z1)]
x[z, on = .(Id == ID, X1 > Z1)]
x[z, on = .(X1 < Z1), allow.cartesian = TRUE] # allows 'numerous' matching values
##    Id X1 XY X2 Z2
## 1:  C  5 x6 x5 z5
## 2:  C  6 x6 x5 z6
## 3:  C  7 x6 x5 z7
## 4:  C  7 x8 x7 z7
## 5:  C  8 x6 x5 z8
## 6:  C  8 x8 x7 z8
## 7:  C  9 x6 x5 z9
## 8:  C  9 x8 x7 z9
##    Id X1   XY   X2 Z2
## 1:  C  5   x8   x7 z5
## 2:  C  6   x8   x7 z6
## 3:  C  7 <NA> <NA> z7
## 4:  C  8 <NA> <NA> z8
## 5:  C  9 <NA> <NA> z9
##     Id X1 XY X2 ID Z2
##  1:  A  5 x2 x1  C z5
##  2:  B  5 x4 x3  C z5
##  3:  A  6 x2 x1  C z6
##  4:  B  6 x4 x3  C z6
##  5:  C  6 x6 x5  C z6
##  6:  A  7 x2 x1  C z7
##  7:  B  7 x4 x3  C z7
##  8:  C  7 x6 x5  C z7
##  9:  A  8 x2 x1  C z8
## 10:  B  8 x4 x3  C z8
## 11:  C  8 x6 x5  C z8
## 12:  C  8 x8 x7  C z8
## 13:  A  9 x2 x1  C z9
## 14:  B  9 x4 x3  C z9
## 15:  C  9 x6 x5  C z9
## 16:  C  9 x8 x7  C z9
(pd.merge(x, z, how = 'right', left_on = 'Id', right_on = 'ID')
    .query('X1 <= Z1')
    .eval('X1 = Z1')
    .drop(columns = ['ID', 'Z1']))
# ?
# ?
##   Id  X1  XY  X2  Z2
## 0  C   5  x6  x5  z5
## 2  C   6  x6  x5  z6
## 4  C   7  x6  x5  z7
## 5  C   7  x8  x7  z7
## 6  C   8  x6  x5  z8
## 7  C   8  x8  x7  z8
## 8  C   9  x6  x5  z9
## 9  C   9  x8  x7  z9

Rolling joins/subsets (performed on the last numeric column)

# Nearest
x[z, on = .(Id == ID, X1 == Z1), roll = "nearest"]
## below, simplified examples with ad hoc subsets on a keyed data.table
setkey(x, Id, X1)
x[.("C", 5:9), roll = "nearest"]
##    Id X1 XY X2 Z2
## 1:  C  5 x6 x5 z5
## 2:  C  6 x6 x5 z6
## 3:  C  7 x8 x7 z7
## 4:  C  8 x8 x7 z8
## 5:  C  9 x8 x7 z9
##    Id X1 XY X2
## 1:  C  5 x6 x5
## 2:  C  6 x6 x5
## 3:  C  7 x8 x7
## 4:  C  8 x8 x7
## 5:  C  9 x8 x7
pass
# Last Observation Carried Forward
x[.("C", 5:9), roll = Inf]
x[.("C", 5:9), roll = 0.5]  # bounded
x[.("C", 5:9), roll = Inf, rollends = c(FALSE, TRUE)]  # default
x[.("C", 5:9), roll = Inf, rollends = c(FALSE, FALSE)] # ends not rolled
##    Id X1 XY X2
## 1:  C  5 x6 x5
## 2:  C  6 x6 x5
## 3:  C  7 x8 x7
## 4:  C  8 x8 x7
## 5:  C  9 x8 x7
##    Id X1   XY   X2
## 1:  C  5   x6   x5
## 2:  C  6 <NA> <NA>
## 3:  C  7   x8   x7
## 4:  C  8 <NA> <NA>
## 5:  C  9 <NA> <NA>
##    Id X1 XY X2
## 1:  C  5 x6 x5
## 2:  C  6 x6 x5
## 3:  C  7 x8 x7
## 4:  C  8 x8 x7
## 5:  C  9 x8 x7
##    Id X1   XY   X2
## 1:  C  5   x6   x5
## 2:  C  6   x6   x5
## 3:  C  7   x8   x7
## 4:  C  8 <NA> <NA>
## 5:  C  9 <NA> <NA>
pass
# Next Observation Carried Backward
x[.("C", 5:9), roll = -Inf]
x[.("C", 5:9), roll = -0.5] # bounded
x[.("C", 5:9), roll = -Inf, rollends = c(TRUE, FALSE)]
x[.("C", 5:9), roll = -Inf, rollends = c(TRUE, TRUE)]  # roll both ends
##    Id X1   XY   X2
## 1:  C  5   x6   x5
## 2:  C  6   x8   x7
## 3:  C  7   x8   x7
## 4:  C  8 <NA> <NA>
## 5:  C  9 <NA> <NA>
##    Id X1   XY   X2
## 1:  C  5   x6   x5
## 2:  C  6 <NA> <NA>
## 3:  C  7   x8   x7
## 4:  C  8 <NA> <NA>
## 5:  C  9 <NA> <NA>
##    Id X1   XY   X2
## 1:  C  5   x6   x5
## 2:  C  6   x8   x7
## 3:  C  7   x8   x7
## 4:  C  8 <NA> <NA>
## 5:  C  9 <NA> <NA>
##    Id X1 XY X2
## 1:  C  5 x6 x5
## 2:  C  6 x8 x7
## 3:  C  7 x8 x7
## 4:  C  8 x8 x7
## 5:  C  9 x8 x7
pass

Cross join (CJ ~ expand.grid)

CJ(c(2, 1, 1), 3:2)
CJ(c(2, 1, 1), 3:2, sorted = FALSE, unique = TRUE)
##    V1 V2
## 1:  1  2
## 2:  1  2
## 3:  1  3
## 4:  1  3
## 5:  2  2
## 6:  2  3
##    V1 V2
## 1:  2  3
## 2:  2  2
## 3:  1  3
## 4:  1  2
(pd.DataFrame([(x, y) for x in [2, 1, 1] for y in [3, 2]],
              columns = ['V1', 'V2'])
   .sort_values(['V1', 'V2']))
##    V1  V2
## 3   1   2
## 5   1   2
## 2   1   3
## 4   1   3
## 1   2   2
## 0   2   3
(pd.DataFrame([(x, y) for x in [2, 1, 1] for y in [3, 2]],
              columns = ['V1', 'V2'])
   .drop_duplicates())
##    V1  V2
## 0   2   3
## 1   2   2
## 2   1   3
## 3   1   2


Bind

x <- data.table(1:3)
y <- data.table(4:6)
z <- data.table(7:9, 0L)
x = pd.DataFrame({"V1" : [1, 2, 3]})
y = pd.DataFrame({"V1" : [4, 5, 6]})
z = pd.DataFrame({"V1" : [7, 8, 9], "V2" : [0, 0, 0]})

Bind rows

rbind(x, y)
rbind(x, z, fill = TRUE)
##    V1
## 1:  1
## 2:  2
## 3:  3
## 4:  4
## 5:  5
## 6:  6
##    V1 V2
## 1:  1 NA
## 2:  2 NA
## 3:  3 NA
## 4:  7  0
## 5:  8  0
## 6:  9  0
pd.concat([x, y])
##    V1
## 0   1
## 1   2
## 2   3
## 0   4
## 1   5
## 2   6
pd.concat([x, z])
##    V1   V2
## 0   1  NaN
## 1   2  NaN
## 2   3  NaN
## 0   7  0.0
## 1   8  0.0
## 2   9  0.0

Bind rows using a list

rbindlist(list(x, y), idcol = TRUE)
##    .id V1
## 1:   1  1
## 2:   1  2
## 3:   1  3
## 4:   2  4
## 5:   2  5
## 6:   2  6
pass

Bind columns

base::cbind(x, y)
##    V1 V1
## 1:  1  4
## 2:  2  5
## 3:  3  6
pd.concat([x, y], axis = 1)
##    V1  V1
## 0   1   4
## 1   2   5
## 2   3   6


Set operations

x <- data.table(c(1, 2, 2, 3, 3))
y <- data.table(c(2, 2, 3, 4, 4))

Intersection

fintersect(x, y)
fintersect(x, y, all = TRUE)
##    V1
## 1:  2
## 2:  3
##    V1
## 1:  2
## 2:  2
## 3:  3
pass

Difference

fsetdiff(x, y)
fsetdiff(x, y, all = TRUE)
##    V1
## 1:  1
##    V1
## 1:  1
## 2:  3
pass

Union

funion(x, y)
funion(x, y, all = TRUE)
##    V1
## 1:  1
## 2:  2
## 3:  3
## 4:  4
##     V1
##  1:  1
##  2:  2
##  3:  2
##  4:  3
##  5:  3
##  6:  2
##  7:  2
##  8:  3
##  9:  4
## 10:  4
pass

Equality

fsetequal(x, x[order(-V1),])
all.equal(x, x) # S3 method
## [1] TRUE
## [1] TRUE
pass





## R version 3.6.3 (2020-02-29)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.4 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
## LAPACK: /home/xu/anaconda3/lib/libmkl_rt.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=fr_FR.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=fr_FR.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=fr_FR.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=fr_FR.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] data.table_1.13.0 knitr_1.26       
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.3      bookdown_0.16   lattice_0.20-41 digest_0.6.23  
##  [5] rappdirs_0.3.1  grid_3.6.3      jsonlite_1.6    magrittr_1.5   
##  [9] evaluate_0.14   blogdown_0.17   rlang_0.4.6     stringi_1.4.3  
## [13] Matrix_1.2-18   reticulate_1.16 rmarkdown_2.0   tools_3.6.3    
## [17] stringr_1.4.0   xfun_0.11       yaml_2.2.0      compiler_3.6.3 
## [21] htmltools_0.4.0