plydata.cat_tools.cat_lump_n

plydata.cat_tools.cat_lump_n(c, n, w=None, other_category='other', ties_method='min')[source]

Lump together most/least common n categories

Parameters
clist-like

Values that will make up the categorical.

nint

Number of most/least common values to preserve (not lumped together). Positive n preserves the most common, negative n preserves the least common.

Lumping happens on condition that the lumped category "other" will have the smallest number of items. You should only specify one of n or prop

wlist[int|float] (optional)

Weights for the frequency of each value. It should be the same length as c.

other_categoryobject (default: 'other')

Value used for the 'other' values. It is placed at the end of the categories.

ties_method{'min', 'max', 'average', 'first', 'dense'} (default: min)

How to treat categories that occur the same number of times (i.e. ties): * min: lowest rank in the group * max: highest rank in the group * average: average rank of the group * first: ranks assigned in order they appear in the array * dense: like 'min', but rank always increases by 1 between groups.

Examples

>>> c = pd.Categorical(list('abccdd'))
>>> cat_lump_n(c, 1)
['other', 'other', 'c', 'c', 'd', 'd']
Categories (3, object): ['c', 'd', 'other']
>>> cat_lump_n(c, 2)
['other', 'other', 'c', 'c', 'd', 'd']
Categories (3, object): ['c', 'd', 'other']

n Least common categories

>>> cat_lump_n(c, -2)
['a', 'b', 'other', 'other', 'other', 'other']
Categories (3, object): ['a', 'b', 'other']

There are fewer than n categories that are the most/least common.

>>> cat_lump_n(c, 3)
['a', 'b', 'c', 'c', 'd', 'd']
Categories (4, object): ['a', 'b', 'c', 'd']
>>> cat_lump_n(c, -3)
['a', 'b', 'c', 'c', 'd', 'd']
Categories (4, object): ['a', 'b', 'c', 'd']

Order of categoricals is maintained

>>> c = pd.Categorical(
...     list('abccdd'),
...     categories=list('adcb'),
...     ordered=True
... )
>>> cat_lump_n(c, 2)
['other', 'other', 'c', 'c', 'd', 'd']
Categories (3, object): ['d' < 'c' < 'other']

Weighted lumping

>>> c = list('abcd')
>>> weights = [3, 2, 1, 1]
>>> cat_lump_n(c, n=2)  # No lumping
['a', 'b', 'c', 'd']
Categories (4, object): ['a', 'b', 'c', 'd']
>>> cat_lump_n(c, n=2, w=weights)
['a', 'b', 'other', 'other']
Categories (3, object): ['a', 'b', 'other']