plydata.cat_tools.cat_lump_prop

plydata.cat_tools.cat_lump_prop(c, prop, w=None, other_category='other')[source]

Lump together least or most common categories by proportion

Parameters
clist-like

Values that will make up the categorical.

propfloat

Proportion above/below which the values of a category will be preserved (not lumped together). Positive prop preserves categories whose proportion of values is more than prop. Negative prop preserves categories whose proportion of values is less than prop.

Lumping happens on condition that the lumped category "other" will have the smallest number of items. You should only specify one of n or prop

wlist[int|float] (optional)

Weights for the frequency of each value. It should be the same length as c.

other_categoryobject (default: 'other')

Value used for the 'other' values. It is placed at the end of the categories.

Examples

By proportions, categories that make up more than prop fraction of the items.

>>> c = pd.Categorical(list('abccdd'))
>>> cat_lump_prop(c, 1/3.01)
['other', 'other', 'c', 'c', 'd', 'd']
Categories (3, object): ['c', 'd', 'other']
>>> cat_lump_prop(c, -1/3.01)
['a', 'b', 'other', 'other', 'other', 'other']
Categories (3, object): ['a', 'b', 'other']
>>> cat_lump_prop(c, 1/2)
['other', 'other', 'other', 'other', 'other', 'other']
Categories (1, object): ['other']