plydata.one_table_verbs.summarize¶

class plydata.one_table_verbs.summarize(*args, **kwargs)[source]¶

Summarise multiple values to a single value

Parameters

datadataframe, optional: Useful when not using the >> operator.
argsstrs, tuples, optional: Expressions or (name, expression) pairs. This should be used when the name is not a valid python variable name. The expression should be of type str or an interable with the same number of elements as the dataframe.
kwargsdict, optional: {name: expression} pairs.

Examples

>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame({'x': [1, 5, 2, 2, 4, 0, 4],
...                    'y': [1, 2, 3, 4, 5, 6, 5],
...                    'z': [1, 3, 3, 4, 5, 5, 5]})

Can take only positional, only keyword arguments or both.

>>> df >> summarize('np.sum(x)', max='np.max(x)')
   np.sum(x)  max
0         18    5

When summarizing after a group_by operation the group columns are retained.

>>> df >> group_by('y', 'z') >> summarize(mean_x='np.mean(x)')
   y  z  mean_x
1  1     1.0
2  3     5.0
3  3     2.0
4  4     2.0
5  5     4.0
6  5     0.0

Aggregate Functions

When summarizing the following functions can be used, they take an array and return a single number.

min(x) - Alias of numpy.amin() (a.k.a numpy.min).
max(x) - Alias of numpy.amax() (a.k.a numpy.max).
sum(x) - Alias of numpy.sum().
cumsum(x) - Alias of numpy.cumsum().
mean(x) - Alias of numpy.mean().
median(x) - Alias of numpy.median().
std(x) - Alias of numpy.std().
first(x) - First element of x.
last(x) - Last element of x.
nth(x, n) - nth value of x or numpy.nan.
n_distinct(x) - Number of distint elements in x.
n_unique(x) - Alias of n_distinct.
n() - Number of elements in current group.

The aliases of the Numpy functions save you from typing 3 or 5 key strokes and you get better column names. i.e min(x) instead of np.min(x) or numpy.min(x) if you have Numpy imported.

>>> df = pd.DataFrame({'x': [0, 1, 2, 3, 4, 5],
...                    'y': [0, 0, 1, 1, 2, 3]})
>>> df >> summarize('min(x)', 'max(x)', 'mean(x)', 'sum(x)',
...                 'first(x)', 'last(x)', 'nth(x, 3)')
   min(x)  max(x)  mean(x)  sum(x)  first(x)  last(x)  nth(x, 3)
0       0       5      2.5      15         0        5          3

Summarizing groups with aggregate functions

>>> df >> group_by('y') >> summarize('mean(x)')
   y  mean(x)
0  0      0.5
1  1      2.5
2  2      4.0
3  3      5.0

>>> df >> group_by('y') >> summarize(y_count='n()')
   y  y_count
0  0        2
1  1        2
2  2        1
3  3        1

You can use n() even when there are no groups.

>>> df >> summarize('n()')
   n()
0    6