plydata.helper_verbs.summarize_at¶
-
class
plydata.helper_verbs.
summarize_at
(*args, **kwargs)[source]¶ Summarize select columns
- Parameters
- data
dataframe
, optional Useful when not using the
>>
operator.- names
tuple
ordict
Names of columns in dataframe. If a tuple, they should be names of columns. If a
dict
, they keys must be in.- startswithstr or tuple, optional
All column names that start with this string will be included.
- endswithstr or tuple, optional
All column names that end with this string will be included.
- containsstr or tuple, optional
All column names that contain with this string will be included.
- matchesstr or regex or tuple, optional
All column names that match the string or a compiled regex pattern will be included. A tuple can be used to match multiple regexs.
- dropbool, optional
If
True
, the selection is inverted. The unspecified/unmatched columns are returned instead. Default isFalse
.
- functions
callable()
ortuple
ordict
orstr
Functions to alter the columns:
function (any callable) - Function is applied to the column and the result columns replace the original columns.
tuple
of functions - Each function is applied to all of the columns and the name (__name__
) of the function is postfixed to resulting column names.dict
of the form{'name': function}
- Allows you to apply one or more functions and also control the postfix to the name.str
- You can use this to access the aggregation functions provided insummarize
:# Those that accept a single argument. 'min' 'max' 'sum' 'cumsum' 'mean' 'median' 'std' 'first' 'last' 'n_distinct' 'n_unique'
- args
tuple
Arguments to the functions. The arguments are pass to all functions.
- kwargs
dict
Keyword arguments to the functions. The keyword arguments are passed to all functions.
- data
Examples
>>> import pandas as pd >>> import numpy as np >>> from plydata import * >>> df = pd.DataFrame({ ... 'alpha': list('aaabbb'), ... 'beta': list('babruq'), ... 'theta': list('cdecde'), ... 'x': [1, 2, 3, 4, 5, 6], ... 'y': [6, 5, 4, 3, 2, 1], ... 'z': [7, 9, 11, 8, 10, 12] ... })
One variable
>>> df >> summarize_at('x', ('mean', np.std)) x_mean x_std 0 3.5 1.707825
Many variables
>>> df >> summarize_at(('x', 'y', 'z'), ('mean', np.std)) x_mean y_mean z_mean x_std y_std z_std 0 3.5 3.5 9.5 1.707825 1.707825 1.707825
Group by and many variables
>>> (df ... >> group_by('theta') ... >> summarize_at(('x', 'y', 'z'), ('mean', np.std)) ... ) theta x_mean y_mean z_mean x_std y_std z_std 0 c 2.5 4.5 7.5 1.5 1.5 0.5 1 d 3.5 3.5 9.5 1.5 1.5 0.5 2 e 4.5 2.5 11.5 1.5 1.5 0.5
Using select parameters
>>> (df ... >> group_by('alpha') ... >> summarize_at( ... dict(endswith='ta'), ... dict(unique_count=lambda col: len(pd.unique(col))) ... ) ... ) alpha beta_unique_count theta_unique_count 0 a 2 3 1 b 3 3
For this data, we can achieve the same using
summarize
.>>> (df ... >> group_by('alpha') ... >> summarize( ... beta_unique_count='len(pd.unique(beta))', ... theta_unique_count='len(pd.unique(theta))' ... ) ... ) alpha beta_unique_count theta_unique_count 0 a 2 3 1 b 3 3