plydata.tidy.extract¶

class plydata.tidy.extract(*args, **kwargs)[source]¶

Split a column using a regular expression with capturing groups.

If the groups don't match, or the input is NA, the output will be NA.

Parameters

datadataframe, optional: Useful when not using the >> operator.
colstr | int: Column name or position of variable to separate.
intolist-like: Column names. Use None to omit the variable from the output.
regexstr | regex: Pattern used to extract columns from col. There should be only one group (defined by ()) for each element of into.
removebool: If True remove input column from output frame.
convertbool: If True convert result columns to int, float or bool where appropriate.

Examples

>>> import pandas as pd
>>> df = pd.DataFrame({
...    'alpha': 1,
...    'x': ['a,1', 'b,2', 'c,3'],
...    'zeta': 6
... })
>>> df
   alpha    x  zeta
0      1  a,1     6
1      1  b,2     6
2      1  c,3     6
>>> df >> extract('x', into='A')
   alpha  A  zeta
0      1  a     6
1      1  b     6
2      1  c     6
>>> df >> extract('x', into=['A', 'B'], regex=r'(\w+),(\w+)')
   alpha  A  B  zeta
0      1  a  1     6
1      1  b  2     6
2      1  c  3     6

>>> df >> extract('x', into=['A', 'B'], regex=r'(\w+),(\w+)', remove=False)
   alpha    x  A  B  zeta
0      1  a,1  a  1     6
1      1  b,2  b  2     6
2      1  c,3  c  3     6

Convert extracted columns to appropriate data types.

>>> result = df >> extract(
...    'x', into=['A', 'B'], regex=r'(\w+),(\w+)', convert=True)
>>> result['B'].dtype
dtype('int64')

The regex must match fully, not just the individual groups.

>>> df >> extract('x', into=['A', 'B'], regex=r'(\w+),([12]+)')
   alpha    A    B  zeta
0      1    a    1     6
1      1    b    2     6
2      1  NaN  NaN     6