plydata.cat_tools.cat_lump_min

plydata.cat_tools.cat_lump_min(c, min, w=None, other_category='other')[source]

Lump catogeries, preserving those that appear min number of times

Parameters
clist-like

Values that will make up the categorical.

minint

Minum number of times a category must be represented to be preserved.

wlist[int|float] (optional)

Weights for the frequency of each value. It should be the same length as c.

other_categoryobject (default: 'other')

Value used for the 'other' values. It is placed at the end of the categories.

Examples

>>> c = list('abccdd')
>>> cat_lump_min(c, min=1)
['a', 'b', 'c', 'c', 'd', 'd']
Categories (4, object): ['a', 'b', 'c', 'd']
>>> cat_lump_min(c, min=2)
['other', 'other', 'c', 'c', 'd', 'd']
Categories (3, object): ['c', 'd', 'other']

Weighted Lumping

>>> weights = [2, 2, .5, .5, 1, 1]
>>> cat_lump_min(c, min=2, w=weights)
['a', 'b', 'other', 'other', 'd', 'd']
Categories (4, object): ['a', 'b', 'd', 'other']

Unlike cat_lump(), cat_lump_min() can lump together and create a category larger than the preserved categories.

>>> c = list('abxyzccdd')
>>> cat_lump_min(c, min=2)
['other', 'other', 'other', 'other', 'other', 'c', 'c', 'd', 'd']
Categories (3, object): ['c', 'd', 'other']