plydata.cat_tools.cat_lump¶

plydata.cat_tools.cat_lump(c, n=None, prop=None, w=None, other_category='other', ties_method='min')[source]¶

Lump together least or most common categories

This is a general method that calls one of cat_lump_n() cat_lump_prop() or cat_lump_lowfreq() depending on the parameters.

Parameters

clist-like

Values that will make up the categorical.

nint (optional)

Number of most/least common values to preserve (not lumped together). Positive n preserves the most common, negative n preserves the least common.

Lumping happens on condition that the lumped category "other" will have the smallest number of items. You should only specify one of n or prop

propfloat (optional)

Proportion above/below which the values of a category will be preserved (not lumped together). Positive prop preserves categories whose proportion of values is more than prop. Negative prop preserves categories whose proportion of values is less than prop.

Lumping happens on condition that the lumped category "other" will have the smallest number of items. You should only specify one of n or prop

wlist[int|float] (optional)

Weights for the frequency of each value. It should be the same length as c.

other_categoryobject (default: 'other')

Value used for the 'other' values. It is placed at the end of the categories.

ties_method{'min', 'max', 'average', 'first', 'dense'} (default: min)

How to treat categories that occur the same number of times (i.e. ties): * min: lowest rank in the group * max: highest rank in the group * average: average rank of the group * first: ranks assigned in order they appear in the array * dense: like 'min', but rank always increases by 1 between groups.

Examples

>>> cat_lump(list('abbccc'))
['other', 'b', 'b', 'c', 'c', 'c']
Categories (3, object): ['b', 'c', 'other']

When the least categories put together are not less than the next smallest group.

>>> cat_lump(list('abcddd'))
['a', 'b', 'c', 'd', 'd', 'd']
Categories (4, object): ['a', 'b', 'c', 'd']
>>> cat_lump(list('abcdddd'))
['other', 'other', 'other', 'd', 'd', 'd', 'd']
Categories (2, object): ['d', 'other']

>>> c = pd.Categorical(list('abccdd'))
>>> cat_lump(c, n=1)
['other', 'other', 'c', 'c', 'd', 'd']
Categories (3, object): ['c', 'd', 'other']

>>> cat_lump(c, n=2)
['other', 'other', 'c', 'c', 'd', 'd']
Categories (3, object): ['c', 'd', 'other']

n Least common categories

>>> cat_lump(c, n=-2)
['a', 'b', 'other', 'other', 'other', 'other']
Categories (3, object): ['a', 'b', 'other']

There are fewer than n categories that are the most/least common.

>>> cat_lump(c, n=3)
['a', 'b', 'c', 'c', 'd', 'd']
Categories (4, object): ['a', 'b', 'c', 'd']
>>> cat_lump(c, n=-3)
['a', 'b', 'c', 'c', 'd', 'd']
Categories (4, object): ['a', 'b', 'c', 'd']

By proportions, categories that make up more than prop fraction of the items.

>>> cat_lump(c, prop=1/3.01)
['other', 'other', 'c', 'c', 'd', 'd']
Categories (3, object): ['c', 'd', 'other']
>>> cat_lump(c, prop=-1/3.01)
['a', 'b', 'other', 'other', 'other', 'other']
Categories (3, object): ['a', 'b', 'other']
>>> cat_lump(c, prop=1/2)
['other', 'other', 'other', 'other', 'other', 'other']
Categories (1, object): ['other']

Order of categoricals is maintained

>>> c = pd.Categorical(
...     list('abccdd'),
...     categories=list('adcb'),
...     ordered=True
... )
>>> cat_lump(c, n=2)
['other', 'other', 'c', 'c', 'd', 'd']
Categories (3, object): ['d' < 'c' < 'other']

Weighted lumping

>>> c = list('abcd')
>>> weights = [3, 2, 1, 1]
>>> cat_lump(c, n=2)  # No lumping
['a', 'b', 'c', 'd']
Categories (4, object): ['a', 'b', 'c', 'd']
>>> cat_lump(c, n=2, w=weights)
['a', 'b', 'other', 'other']
Categories (3, object): ['a', 'b', 'other']