plydata.cat_tools.cat_lump¶
-
plydata.cat_tools.cat_lump(c, n=None, prop=None, w=None, other_category='other', ties_method='min')[source]¶ Lump together least or most common categories
This is a general method that calls one of
cat_lump_n()cat_lump_prop()orcat_lump_lowfreq()depending on the parameters.- Parameters
- clist-like
Values that will make up the categorical.
- n
int(optional) Number of most/least common values to preserve (not lumped together). Positive
npreserves the most common, negativenpreserves the least common.Lumping happens on condition that the lumped category "other" will have the smallest number of items. You should only specify one of
norprop- prop
float(optional) Proportion above/below which the values of a category will be preserved (not lumped together). Positive
proppreserves categories whose proportion of values is more thanprop. Negativeproppreserves categories whose proportion of values is less thanprop.Lumping happens on condition that the lumped category "other" will have the smallest number of items. You should only specify one of
norprop- w
list[int|float] (optional) Weights for the frequency of each value. It should be the same length as
c.- other_category
object(default: 'other') Value used for the 'other' values. It is placed at the end of the categories.
- ties_method{'min', 'max', 'average', 'first', 'dense'} (default:
min) How to treat categories that occur the same number of times (i.e. ties): * min: lowest rank in the group * max: highest rank in the group * average: average rank of the group * first: ranks assigned in order they appear in the array * dense: like 'min', but rank always increases by 1 between groups.
Examples
>>> cat_lump(list('abbccc')) ['other', 'b', 'b', 'c', 'c', 'c'] Categories (3, object): ['b', 'c', 'other']
When the least categories put together are not less than the next smallest group.
>>> cat_lump(list('abcddd')) ['a', 'b', 'c', 'd', 'd', 'd'] Categories (4, object): ['a', 'b', 'c', 'd'] >>> cat_lump(list('abcdddd')) ['other', 'other', 'other', 'd', 'd', 'd', 'd'] Categories (2, object): ['d', 'other']
>>> c = pd.Categorical(list('abccdd')) >>> cat_lump(c, n=1) ['other', 'other', 'c', 'c', 'd', 'd'] Categories (3, object): ['c', 'd', 'other']
>>> cat_lump(c, n=2) ['other', 'other', 'c', 'c', 'd', 'd'] Categories (3, object): ['c', 'd', 'other']
nLeast common categories>>> cat_lump(c, n=-2) ['a', 'b', 'other', 'other', 'other', 'other'] Categories (3, object): ['a', 'b', 'other']
There are fewer than
ncategories that are the most/least common.>>> cat_lump(c, n=3) ['a', 'b', 'c', 'c', 'd', 'd'] Categories (4, object): ['a', 'b', 'c', 'd'] >>> cat_lump(c, n=-3) ['a', 'b', 'c', 'c', 'd', 'd'] Categories (4, object): ['a', 'b', 'c', 'd']
By proportions, categories that make up more than
propfraction of the items.>>> cat_lump(c, prop=1/3.01) ['other', 'other', 'c', 'c', 'd', 'd'] Categories (3, object): ['c', 'd', 'other'] >>> cat_lump(c, prop=-1/3.01) ['a', 'b', 'other', 'other', 'other', 'other'] Categories (3, object): ['a', 'b', 'other'] >>> cat_lump(c, prop=1/2) ['other', 'other', 'other', 'other', 'other', 'other'] Categories (1, object): ['other']
Order of categoricals is maintained
>>> c = pd.Categorical( ... list('abccdd'), ... categories=list('adcb'), ... ordered=True ... ) >>> cat_lump(c, n=2) ['other', 'other', 'c', 'c', 'd', 'd'] Categories (3, object): ['d' < 'c' < 'other']
Weighted lumping
>>> c = list('abcd') >>> weights = [3, 2, 1, 1] >>> cat_lump(c, n=2) # No lumping ['a', 'b', 'c', 'd'] Categories (4, object): ['a', 'b', 'c', 'd'] >>> cat_lump(c, n=2, w=weights) ['a', 'b', 'other', 'other'] Categories (3, object): ['a', 'b', 'other']