plydata.one_table_verbs.do

class plydata.one_table_verbs.do(*args, **kwargs)[source]

Do arbitrary operations on a dataframe

Considering the split-apply-combine data manipulation strategy, do gives a window into which to place the complex apply actions, and also control over the form of results when they are combined. This allows

Parameters
datadataframe, optional

Useful when not using the >> operator.

funcfunction, optional

A single function to apply to each group. The function should accept a dataframe and return a dataframe.

kwargsdict, optional

{name: function} pairs. The function should accept a dataframe and return an array. The function computes a column called name.

Notes

You cannot have both a position argument and keyword arguments.

Examples

>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame({'x': [1, 2, 2, 3],
...                    'y': [2, 3, 4, 3],
...                    'z': list('aabb')})

Define a function that uses numpy to do a least squares fit. It takes input from a dataframe and output is a dataframe. gdf is a dataframe that contains only rows from the current group.

>>> def least_squares(gdf):
...     X = np.vstack([gdf.x, np.ones(len(gdf))]).T
...     (m, c), _, _, _ = np.linalg.lstsq(X, gdf.y, None)
...     return pd.DataFrame({'intercept': c, 'slope': [m]})

Define functions that take x and y values and compute the intercept and slope.

>>> def slope(x, y):
...     return np.diff(y)[0] / np.diff(x)[0]
...
>>> def intercept(x, y):
...     return y.values[0] - slope(x, y) * x.values[0]

Demonstrating do

>>> df >> group_by('z') >> do(least_squares)
groups: ['z']
   z  intercept  slope
0  a        1.0    1.0
1  b        6.0   -1.0

We can get the same result, by passing separate functions that calculate the columns independently.

>>> df >> group_by('z') >> do(
...     intercept=lambda gdf: intercept(gdf.x, gdf.y),
...     slope=lambda gdf: slope(gdf.x, gdf.y))
groups: ['z']
   z  intercept  slope
0  a        1.0    1.0
1  b        6.0   -1.0

The functions need not return numerical values. Pandas columns can hold any type of object. You could store result objects from more complicated models. Each model would be linked to a group. Notice that the group columns (z in the above cases) are included in the result.