class*args, **kwargs)[source]

Do arbitrary operations on a dataframe

Considering the split-apply-combine data manipulation strategy, do gives a window into which to place the complex apply actions, and also control over the form of results when they are combined. This allows

datadataframe, optional

Useful when not using the >> operator.

funcfunction, optional

A single function to apply to each group. The function should accept a dataframe and return a dataframe.

kwargsdict, optional

{name: function} pairs. The function should accept a dataframe and return an array. The function computes a column called name.


You cannot have both a position argument and keyword arguments.


>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame({'x': [1, 2, 2, 3],
...                    'y': [2, 3, 4, 3],
...                    'z': list('aabb')})

Define a function that uses numpy to do a least squares fit. It takes input from a dataframe and output is a dataframe. gdf is a dataframe that contains only rows from the current group.

>>> def least_squares(gdf):
...     X = np.vstack([gdf.x, np.ones(len(gdf))]).T
...     (m, c), _, _, _ = np.linalg.lstsq(X, gdf.y, None)
...     return pd.DataFrame({'intercept': c, 'slope': [m]})

Define functions that take x and y values and compute the intercept and slope.

>>> def slope(x, y):
...     return np.diff(y)[0] / np.diff(x)[0]
>>> def intercept(x, y):
...     return y.values[0] - slope(x, y) * x.values[0]

Demonstrating do

>>> df >> group_by('z') >> do(least_squares)
groups: ['z']
   z  intercept  slope
0  a        1.0    1.0
1  b        6.0   -1.0

We can get the same result, by passing separate functions that calculate the columns independently.

>>> df >> group_by('z') >> do(
...     intercept=lambda gdf: intercept(gdf.x, gdf.y),
...     slope=lambda gdf: slope(gdf.x, gdf.y))
groups: ['z']
   z  intercept  slope
0  a        1.0    1.0
1  b        6.0   -1.0

The functions need not return numerical values. Pandas columns can hold any type of object. You could store result objects from more complicated models. Each model would be linked to a group. Notice that the group columns (z in the above cases) are included in the result.