plydata.one_table_verbs.summarize¶
-
class
plydata.one_table_verbs.
summarize
(*args, **kwargs)[source]¶ Summarise multiple values to a single value
- Parameters
- data
dataframe
, optional Useful when not using the
>>
operator.- args
strs
,tuples
, optional Expressions or
(name, expression)
pairs. This should be used when the name is not a valid python variable name. The expression should be of typestr
or an interable with the same number of elements as the dataframe.- kwargs
dict
, optional {name: expression}
pairs.
- data
Examples
>>> import pandas as pd >>> import numpy as np >>> df = pd.DataFrame({'x': [1, 5, 2, 2, 4, 0, 4], ... 'y': [1, 2, 3, 4, 5, 6, 5], ... 'z': [1, 3, 3, 4, 5, 5, 5]})
Can take only positional, only keyword arguments or both.
>>> df >> summarize('np.sum(x)', max='np.max(x)') np.sum(x) max 0 18 5
When summarizing after a
group_by
operation the group columns are retained.>>> df >> group_by('y', 'z') >> summarize(mean_x='np.mean(x)') y z mean_x 0 1 1 1.0 1 2 3 5.0 2 3 3 2.0 3 4 4 2.0 4 5 5 4.0 5 6 5 0.0
Aggregate Functions
When summarizing the following functions can be used, they take an array and return a single number.
min(x)
- Alias ofnumpy.amin()
(a.k.anumpy.min
).max(x)
- Alias ofnumpy.amax()
(a.k.anumpy.max
).sum(x)
- Alias ofnumpy.sum()
.cumsum(x)
- Alias ofnumpy.cumsum()
.mean(x)
- Alias ofnumpy.mean()
.median(x)
- Alias ofnumpy.median()
.std(x)
- Alias ofnumpy.std()
.first(x)
- First element ofx
.last(x)
- Last element ofx
.nth(x, n)
- nth value ofx
ornumpy.nan
.n_distinct(x)
- Number of distint elements inx
.n_unique(x)
- Alias ofn_distinct
.n()
- Number of elements in current group.
The aliases of the Numpy functions save you from typing 3 or 5 key strokes and you get better column names. i.e
min(x)
instead ofnp.min(x)
ornumpy.min(x)
if you have Numpy imported.>>> df = pd.DataFrame({'x': [0, 1, 2, 3, 4, 5], ... 'y': [0, 0, 1, 1, 2, 3]}) >>> df >> summarize('min(x)', 'max(x)', 'mean(x)', 'sum(x)', ... 'first(x)', 'last(x)', 'nth(x, 3)') min(x) max(x) mean(x) sum(x) first(x) last(x) nth(x, 3) 0 0 5 2.5 15 0 5 3
Summarizing groups with aggregate functions
>>> df >> group_by('y') >> summarize('mean(x)') y mean(x) 0 0 0.5 1 1 2.5 2 2 4.0 3 3 5.0
>>> df >> group_by('y') >> summarize(y_count='n()') y y_count 0 0 2 1 1 2 2 2 1 3 3 1
You can use
n()
even when there are no groups.>>> df >> summarize('n()') n() 0 6