plydata.helper_verbs.query_if¶

class plydata.helper_verbs.query_if(*args, **kwargs)[source]¶

Query all columns that match a predicate

Parameters

datadataframe, optional

Useful when not using the >> operator.

predicatefunction

A predicate function to be applied to the columns of the dataframe. Good candidates for predicate functions are those that check the type of the column. Such function are avaible at pandas.api.dtypes, for example pandas.api.types.is_numeric_dtype().

For convenience, you can reference the is_*_dtype functions with shorter strings:

'is_bool'             # pandas.api.types.is_bool_dtype
'is_categorical'      # pandas.api.types.is_categorical_dtype
'is_complex'          # pandas.api.types.is_complex_dtype
'is_datetime64_any'   # pandas.api.types.is_datetime64_any_dtype
'is_datetime64'       # pandas.api.types.is_datetime64_dtype
'is_datetime64_ns'    # pandas.api.types.is_datetime64_ns_dtype
'is_datetime64tz'     # pandas.api.types.is_datetime64tz_dtype
'is_float'            # pandas.api.types.is_float_dtype
'is_int64'            # pandas.api.types.is_int64_dtype
'is_integer'          # pandas.api.types.is_integer_dtype
'is_interval'         # pandas.api.types.is_interval_dtype
'is_numeric'          # pandas.api.types.is_numeric_dtype
'is_object'           # pandas.api.types.is_object_dtype
'is_period'           # pandas.api.types.is_period_dtype
'is_signed_integer'   # pandas.api.types.is_signed_integer_dtype
'is_string'           # pandas.api.types.is_string_dtype
'is_timedelta64'      # pandas.api.types.is_timedelta64_dtype
'is_timedelta64_ns'   # pandas.api.types.is_timedelta64_ns_dtype
'is_unsigned_integer' # pandas.api.types.is_unsigned_integer_dtype

No other string values are allowed.

all_varsstr, optional

A predicate statement to evaluate. It should conform to python syntax and should return an array of boolean values (one for every item in the column) or a single boolean (for the whole column). You should use {_} to refer to the column names.

After the statement is evaluated for all columns selected by the predicate, the union (|), is used to select the output rows.

any_varsstr, optional

After the statement is evaluated for all columns selected by the predicate, intersection (&), is used to select the output rows.

Examples

>>> import pandas as pd
>>> import numpy as np
>>> from plydata import *
>>> df = pd.DataFrame({
...     'alpha': list('aaabbb'),
...     'beta': list('babruq'),
...     'theta': list('cdecde'),
...     'x': [1, 2, 3, 4, 5, 6],
...     'y': [6, 5, 4, 3, 2, 1],
...     'z': [7, 9, 11, 8, 10, 12]
... })

Select all rows where any of the entries along the integer columns is a 4.

>>> df >> query_if('is_integer', any_vars='({_} == 4)')
  alpha beta theta  x  y   z
2     a    b     e  3  4  11
3     b    r     c  4  3   8

The opposite, select all rows where none of the entries along the integer columns is a 4.

>>> df >> query_if('is_integer', all_vars='({_} != 4)')
  alpha beta theta  x  y   z
0     a    b     c  1  6   7
1     a    a     d  2  5   9
4     b    u     d  5  2  10
5     b    q     e  6  1  12

For something more complicated, group-wise selection.

Select groups where any of the columns a large (> 28) sum. First by using summarize_if, we see that there is one such group. Then using query_if selects it.

>>> (df
...  >> group_by('alpha')
...  >> summarize_if('is_integer', 'sum'))
  alpha   x   y   z
0     a   6  15  27
1     b  15   6  30
>>> (df
...  >> group_by('alpha')
...  >> query_if('is_integer', any_vars='(sum({_}) > 28)'))
groups: ['alpha']
  alpha beta theta  x  y   z
3     b    r     c  4  3   8
4     b    u     d  5  2  10
5     b    q     e  6  1  12

Note that sum({_}) > 28 is a column operation, it returns a single number for the whole column. Therefore the whole column is either selected or not selected. Column operations are what enable group-wise selection.