plydata.helper_verbs.summarize_at

class plydata.helper_verbs.summarize_at(*args, **kwargs)[source]

Summarize select columns

Parameters
datadataframe, optional

Useful when not using the >> operator.

namestuple or dict

Names of columns in dataframe. If a tuple, they should be names of columns. If a dict, they keys must be in.

  • startswithstr or tuple, optional

    All column names that start with this string will be included.

  • endswithstr or tuple, optional

    All column names that end with this string will be included.

  • containsstr or tuple, optional

    All column names that contain with this string will be included.

  • matchesstr or regex or tuple, optional

    All column names that match the string or a compiled regex pattern will be included. A tuple can be used to match multiple regexs.

  • dropbool, optional

    If True, the selection is inverted. The unspecified/unmatched columns are returned instead. Default is False.

functionscallable() or tuple or dict or str

Functions to alter the columns:

  • function (any callable) - Function is applied to the column and the result columns replace the original columns.

  • tuple of functions - Each function is applied to all of the columns and the name (__name__) of the function is postfixed to resulting column names.

  • dict of the form {'name': function} - Allows you to apply one or more functions and also control the postfix to the name.

  • str - You can use this to access the aggregation functions provided in summarize:

    # Those that accept a single argument.
    'min'
    'max'
    'sum'
    'cumsum'
    'mean'
    'median'
    'std'
    'first'
    'last'
    'n_distinct'
    'n_unique'
    
argstuple

Arguments to the functions. The arguments are pass to all functions.

kwargsdict

Keyword arguments to the functions. The keyword arguments are passed to all functions.

Examples

>>> import pandas as pd
>>> import numpy as np
>>> from plydata import *
>>> df = pd.DataFrame({
...     'alpha': list('aaabbb'),
...     'beta': list('babruq'),
...     'theta': list('cdecde'),
...     'x': [1, 2, 3, 4, 5, 6],
...     'y': [6, 5, 4, 3, 2, 1],
...     'z': [7, 9, 11, 8, 10, 12]
... })

One variable

>>> df >> summarize_at('x', ('mean', np.std))
   x_mean     x_std
0     3.5  1.707825

Many variables

>>> df >> summarize_at(('x', 'y', 'z'), ('mean', np.std))
   x_mean  y_mean  z_mean     x_std     y_std     z_std
0     3.5     3.5     9.5  1.707825  1.707825  1.707825

Group by and many variables

>>> (df
...  >> group_by('theta')
...  >> summarize_at(('x', 'y', 'z'), ('mean', np.std))
... )
  theta  x_mean  y_mean  z_mean  x_std  y_std  z_std
0     c     2.5     4.5     7.5    1.5    1.5    0.5
1     d     3.5     3.5     9.5    1.5    1.5    0.5
2     e     4.5     2.5    11.5    1.5    1.5    0.5

Using select parameters

>>> (df
...  >> group_by('alpha')
...  >> summarize_at(
...         dict(endswith='ta'),
...         dict(unique_count=lambda col: len(pd.unique(col)))
...     )
... )
  alpha  beta_unique_count  theta_unique_count
0     a                  2                   3
1     b                  3                   3

For this data, we can achieve the same using summarize.

>>> (df
...  >> group_by('alpha')
...  >> summarize(
...         beta_unique_count='len(pd.unique(beta))',
...         theta_unique_count='len(pd.unique(theta))'
...     )
... )
  alpha  beta_unique_count  theta_unique_count
0     a                  2                   3
1     b                  3                   3