plydata.one_table_verbs.summarize¶
-
class
plydata.one_table_verbs.summarize(*args, **kwargs)[source]¶ Summarise multiple values to a single value
- Parameters
- data
dataframe, optional Useful when not using the
>>operator.- args
strs,tuples, optional Expressions or
(name, expression)pairs. This should be used when the name is not a valid python variable name. The expression should be of typestror an interable with the same number of elements as the dataframe.- kwargs
dict, optional {name: expression}pairs.
- data
Examples
>>> import pandas as pd >>> import numpy as np >>> df = pd.DataFrame({'x': [1, 5, 2, 2, 4, 0, 4], ... 'y': [1, 2, 3, 4, 5, 6, 5], ... 'z': [1, 3, 3, 4, 5, 5, 5]})
Can take only positional, only keyword arguments or both.
>>> df >> summarize('np.sum(x)', max='np.max(x)') np.sum(x) max 0 18 5
When summarizing after a
group_byoperation the group columns are retained.>>> df >> group_by('y', 'z') >> summarize(mean_x='np.mean(x)') y z mean_x 0 1 1 1.0 1 2 3 5.0 2 3 3 2.0 3 4 4 2.0 4 5 5 4.0 5 6 5 0.0
Aggregate Functions
When summarizing the following functions can be used, they take an array and return a single number.
min(x)- Alias ofnumpy.amin()(a.k.anumpy.min).max(x)- Alias ofnumpy.amax()(a.k.anumpy.max).sum(x)- Alias ofnumpy.sum().cumsum(x)- Alias ofnumpy.cumsum().mean(x)- Alias ofnumpy.mean().median(x)- Alias ofnumpy.median().std(x)- Alias ofnumpy.std().first(x)- First element ofx.last(x)- Last element ofx.nth(x, n)- nth value ofxornumpy.nan.n_distinct(x)- Number of distint elements inx.n_unique(x)- Alias ofn_distinct.n()- Number of elements in current group.
The aliases of the Numpy functions save you from typing 3 or 5 key strokes and you get better column names. i.e
min(x)instead ofnp.min(x)ornumpy.min(x)if you have Numpy imported.>>> df = pd.DataFrame({'x': [0, 1, 2, 3, 4, 5], ... 'y': [0, 0, 1, 1, 2, 3]}) >>> df >> summarize('min(x)', 'max(x)', 'mean(x)', 'sum(x)', ... 'first(x)', 'last(x)', 'nth(x, 3)') min(x) max(x) mean(x) sum(x) first(x) last(x) nth(x, 3) 0 0 5 2.5 15 0 5 3
Summarizing groups with aggregate functions
>>> df >> group_by('y') >> summarize('mean(x)') y mean(x) 0 0 0.5 1 1 2.5 2 2 4.0 3 3 5.0
>>> df >> group_by('y') >> summarize(y_count='n()') y y_count 0 0 2 1 1 2 2 2 1 3 3 1
You can use
n()even when there are no groups.>>> df >> summarize('n()') n() 0 6