plydata.cat_tools.cat_lump_n¶
-
plydata.cat_tools.cat_lump_n(c, n, w=None, other_category='other', ties_method='min')[source]¶ Lump together most/least common n categories
- Parameters
- clist-like
Values that will make up the categorical.
- n
int Number of most/least common values to preserve (not lumped together). Positive
npreserves the most common, negativenpreserves the least common.Lumping happens on condition that the lumped category "other" will have the smallest number of items. You should only specify one of
norprop- w
list[int|float] (optional) Weights for the frequency of each value. It should be the same length as
c.- other_category
object(default: 'other') Value used for the 'other' values. It is placed at the end of the categories.
- ties_method{'min', 'max', 'average', 'first', 'dense'} (default:
min) How to treat categories that occur the same number of times (i.e. ties): * min: lowest rank in the group * max: highest rank in the group * average: average rank of the group * first: ranks assigned in order they appear in the array * dense: like 'min', but rank always increases by 1 between groups.
Examples
>>> c = pd.Categorical(list('abccdd')) >>> cat_lump_n(c, 1) ['other', 'other', 'c', 'c', 'd', 'd'] Categories (3, object): ['c', 'd', 'other']
>>> cat_lump_n(c, 2) ['other', 'other', 'c', 'c', 'd', 'd'] Categories (3, object): ['c', 'd', 'other']
nLeast common categories>>> cat_lump_n(c, -2) ['a', 'b', 'other', 'other', 'other', 'other'] Categories (3, object): ['a', 'b', 'other']
There are fewer than
ncategories that are the most/least common.>>> cat_lump_n(c, 3) ['a', 'b', 'c', 'c', 'd', 'd'] Categories (4, object): ['a', 'b', 'c', 'd'] >>> cat_lump_n(c, -3) ['a', 'b', 'c', 'c', 'd', 'd'] Categories (4, object): ['a', 'b', 'c', 'd']
Order of categoricals is maintained
>>> c = pd.Categorical( ... list('abccdd'), ... categories=list('adcb'), ... ordered=True ... ) >>> cat_lump_n(c, 2) ['other', 'other', 'c', 'c', 'd', 'd'] Categories (3, object): ['d' < 'c' < 'other']
Weighted lumping
>>> c = list('abcd') >>> weights = [3, 2, 1, 1] >>> cat_lump_n(c, n=2) # No lumping ['a', 'b', 'c', 'd'] Categories (4, object): ['a', 'b', 'c', 'd'] >>> cat_lump_n(c, n=2, w=weights) ['a', 'b', 'other', 'other'] Categories (3, object): ['a', 'b', 'other']