plydata.cat_tools.cat_lump¶
-
plydata.cat_tools.
cat_lump
(c, n=None, prop=None, w=None, other_category='other', ties_method='min')[source]¶ Lump together least or most common categories
This is a general method that calls one of
cat_lump_n()
cat_lump_prop()
orcat_lump_lowfreq()
depending on the parameters.- Parameters
- clist-like
Values that will make up the categorical.
- n
int
(optional) Number of most/least common values to preserve (not lumped together). Positive
n
preserves the most common, negativen
preserves the least common.Lumping happens on condition that the lumped category "other" will have the smallest number of items. You should only specify one of
n
orprop
- prop
float
(optional) Proportion above/below which the values of a category will be preserved (not lumped together). Positive
prop
preserves categories whose proportion of values is more thanprop
. Negativeprop
preserves categories whose proportion of values is less thanprop
.Lumping happens on condition that the lumped category "other" will have the smallest number of items. You should only specify one of
n
orprop
- w
list
[int|float] (optional) Weights for the frequency of each value. It should be the same length as
c
.- other_category
object
(default: 'other') Value used for the 'other' values. It is placed at the end of the categories.
- ties_method{'min', 'max', 'average', 'first', 'dense'} (default:
min
) How to treat categories that occur the same number of times (i.e. ties): * min: lowest rank in the group * max: highest rank in the group * average: average rank of the group * first: ranks assigned in order they appear in the array * dense: like 'min', but rank always increases by 1 between groups.
Examples
>>> cat_lump(list('abbccc')) ['other', 'b', 'b', 'c', 'c', 'c'] Categories (3, object): ['b', 'c', 'other']
When the least categories put together are not less than the next smallest group.
>>> cat_lump(list('abcddd')) ['a', 'b', 'c', 'd', 'd', 'd'] Categories (4, object): ['a', 'b', 'c', 'd'] >>> cat_lump(list('abcdddd')) ['other', 'other', 'other', 'd', 'd', 'd', 'd'] Categories (2, object): ['d', 'other']
>>> c = pd.Categorical(list('abccdd')) >>> cat_lump(c, n=1) ['other', 'other', 'c', 'c', 'd', 'd'] Categories (3, object): ['c', 'd', 'other']
>>> cat_lump(c, n=2) ['other', 'other', 'c', 'c', 'd', 'd'] Categories (3, object): ['c', 'd', 'other']
n
Least common categories>>> cat_lump(c, n=-2) ['a', 'b', 'other', 'other', 'other', 'other'] Categories (3, object): ['a', 'b', 'other']
There are fewer than
n
categories that are the most/least common.>>> cat_lump(c, n=3) ['a', 'b', 'c', 'c', 'd', 'd'] Categories (4, object): ['a', 'b', 'c', 'd'] >>> cat_lump(c, n=-3) ['a', 'b', 'c', 'c', 'd', 'd'] Categories (4, object): ['a', 'b', 'c', 'd']
By proportions, categories that make up more than
prop
fraction of the items.>>> cat_lump(c, prop=1/3.01) ['other', 'other', 'c', 'c', 'd', 'd'] Categories (3, object): ['c', 'd', 'other'] >>> cat_lump(c, prop=-1/3.01) ['a', 'b', 'other', 'other', 'other', 'other'] Categories (3, object): ['a', 'b', 'other'] >>> cat_lump(c, prop=1/2) ['other', 'other', 'other', 'other', 'other', 'other'] Categories (1, object): ['other']
Order of categoricals is maintained
>>> c = pd.Categorical( ... list('abccdd'), ... categories=list('adcb'), ... ordered=True ... ) >>> cat_lump(c, n=2) ['other', 'other', 'c', 'c', 'd', 'd'] Categories (3, object): ['d' < 'c' < 'other']
Weighted lumping
>>> c = list('abcd') >>> weights = [3, 2, 1, 1] >>> cat_lump(c, n=2) # No lumping ['a', 'b', 'c', 'd'] Categories (4, object): ['a', 'b', 'c', 'd'] >>> cat_lump(c, n=2, w=weights) ['a', 'b', 'other', 'other'] Categories (3, object): ['a', 'b', 'other']