plydata.cat_tools.cat_lump_n¶
-
plydata.cat_tools.
cat_lump_n
(c, n, w=None, other_category='other', ties_method='min')[source]¶ Lump together most/least common n categories
- Parameters
- clist-like
Values that will make up the categorical.
- n
int
Number of most/least common values to preserve (not lumped together). Positive
n
preserves the most common, negativen
preserves the least common.Lumping happens on condition that the lumped category "other" will have the smallest number of items. You should only specify one of
n
orprop
- w
list
[int|float] (optional) Weights for the frequency of each value. It should be the same length as
c
.- other_category
object
(default: 'other') Value used for the 'other' values. It is placed at the end of the categories.
- ties_method{'min', 'max', 'average', 'first', 'dense'} (default:
min
) How to treat categories that occur the same number of times (i.e. ties): * min: lowest rank in the group * max: highest rank in the group * average: average rank of the group * first: ranks assigned in order they appear in the array * dense: like 'min', but rank always increases by 1 between groups.
Examples
>>> c = pd.Categorical(list('abccdd')) >>> cat_lump_n(c, 1) ['other', 'other', 'c', 'c', 'd', 'd'] Categories (3, object): ['c', 'd', 'other']
>>> cat_lump_n(c, 2) ['other', 'other', 'c', 'c', 'd', 'd'] Categories (3, object): ['c', 'd', 'other']
n
Least common categories>>> cat_lump_n(c, -2) ['a', 'b', 'other', 'other', 'other', 'other'] Categories (3, object): ['a', 'b', 'other']
There are fewer than
n
categories that are the most/least common.>>> cat_lump_n(c, 3) ['a', 'b', 'c', 'c', 'd', 'd'] Categories (4, object): ['a', 'b', 'c', 'd'] >>> cat_lump_n(c, -3) ['a', 'b', 'c', 'c', 'd', 'd'] Categories (4, object): ['a', 'b', 'c', 'd']
Order of categoricals is maintained
>>> c = pd.Categorical( ... list('abccdd'), ... categories=list('adcb'), ... ordered=True ... ) >>> cat_lump_n(c, 2) ['other', 'other', 'c', 'c', 'd', 'd'] Categories (3, object): ['d' < 'c' < 'other']
Weighted lumping
>>> c = list('abcd') >>> weights = [3, 2, 1, 1] >>> cat_lump_n(c, n=2) # No lumping ['a', 'b', 'c', 'd'] Categories (4, object): ['a', 'b', 'c', 'd'] >>> cat_lump_n(c, n=2, w=weights) ['a', 'b', 'other', 'other'] Categories (3, object): ['a', 'b', 'other']