plydata.one_table_verbs.summarize

class plydata.one_table_verbs.summarize(*args, **kwargs)[source]

Summarise multiple values to a single value

Parameters
datadataframe, optional

Useful when not using the >> operator.

argsstrs, tuples, optional

Expressions or (name, expression) pairs. This should be used when the name is not a valid python variable name. The expression should be of type str or an interable with the same number of elements as the dataframe.

kwargsdict, optional

{name: expression} pairs.

Examples

>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame({'x': [1, 5, 2, 2, 4, 0, 4],
...                    'y': [1, 2, 3, 4, 5, 6, 5],
...                    'z': [1, 3, 3, 4, 5, 5, 5]})

Can take only positional, only keyword arguments or both.

>>> df >> summarize('np.sum(x)', max='np.max(x)')
   np.sum(x)  max
0         18    5

When summarizing after a group_by operation the group columns are retained.

>>> df >> group_by('y', 'z') >> summarize(mean_x='np.mean(x)')
   y  z  mean_x
0  1  1     1.0
1  2  3     5.0
2  3  3     2.0
3  4  4     2.0
4  5  5     4.0
5  6  5     0.0

Aggregate Functions

When summarizing the following functions can be used, they take an array and return a single number.

  • min(x) - Alias of numpy.amin() (a.k.a numpy.min).

  • max(x) - Alias of numpy.amax() (a.k.a numpy.max).

  • sum(x) - Alias of numpy.sum().

  • cumsum(x) - Alias of numpy.cumsum().

  • mean(x) - Alias of numpy.mean().

  • median(x) - Alias of numpy.median().

  • std(x) - Alias of numpy.std().

  • first(x) - First element of x.

  • last(x) - Last element of x.

  • nth(x, n) - nth value of x or numpy.nan.

  • n_distinct(x) - Number of distint elements in x.

  • n_unique(x) - Alias of n_distinct.

  • n() - Number of elements in current group.

The aliases of the Numpy functions save you from typing 3 or 5 key strokes and you get better column names. i.e min(x) instead of np.min(x) or numpy.min(x) if you have Numpy imported.

>>> df = pd.DataFrame({'x': [0, 1, 2, 3, 4, 5],
...                    'y': [0, 0, 1, 1, 2, 3]})
>>> df >> summarize('min(x)', 'max(x)', 'mean(x)', 'sum(x)',
...                 'first(x)', 'last(x)', 'nth(x, 3)')
   min(x)  max(x)  mean(x)  sum(x)  first(x)  last(x)  nth(x, 3)
0       0       5      2.5      15         0        5          3

Summarizing groups with aggregate functions

>>> df >> group_by('y') >> summarize('mean(x)')
   y  mean(x)
0  0      0.5
1  1      2.5
2  2      4.0
3  3      5.0
>>> df >> group_by('y') >> summarize(y_count='n()')
   y  y_count
0  0        2
1  1        2
2  2        1
3  3        1

You can use n() even when there are no groups.

>>> df >> summarize('n()')
   n()
0    6