plydata.tidy.extract

class plydata.tidy.extract(*args, **kwargs)[source]

Split a column using a regular expression with capturing groups.

If the groups don't match, or the input is NA, the output will be NA.

Parameters
datadataframe, optional

Useful when not using the >> operator.

colstr | int

Column name or position of variable to separate.

intolist-like

Column names. Use None to omit the variable from the output.

regexstr | regex

Pattern used to extract columns from col. There should be only one group (defined by ()) for each element of into.

removebool

If True remove input column from output frame.

convertbool

If True convert result columns to int, float or bool where appropriate.

Examples

>>> import pandas as pd
>>> df = pd.DataFrame({
...    'alpha': 1,
...    'x': ['a,1', 'b,2', 'c,3'],
...    'zeta': 6
... })
>>> df
   alpha    x  zeta
0      1  a,1     6
1      1  b,2     6
2      1  c,3     6
>>> df >> extract('x', into='A')
   alpha  A  zeta
0      1  a     6
1      1  b     6
2      1  c     6
>>> df >> extract('x', into=['A', 'B'], regex=r'(\w+),(\w+)')
   alpha  A  B  zeta
0      1  a  1     6
1      1  b  2     6
2      1  c  3     6
>>> df >> extract('x', into=['A', 'B'], regex=r'(\w+),(\w+)', remove=False)
   alpha    x  A  B  zeta
0      1  a,1  a  1     6
1      1  b,2  b  2     6
2      1  c,3  c  3     6

Convert extracted columns to appropriate data types.

>>> result = df >> extract(
...    'x', into=['A', 'B'], regex=r'(\w+),(\w+)', convert=True)
>>> result['B'].dtype
dtype('int64')

The regex must match fully, not just the individual groups.

>>> df >> extract('x', into=['A', 'B'], regex=r'(\w+),([12]+)')
   alpha    A    B  zeta
0      1    a    1     6
1      1    b    2     6
2      1  NaN  NaN     6