plydata.helper_verbs.query_if¶
-
class
plydata.helper_verbs.
query_if
(*args, **kwargs)[source]¶ Query all columns that match a predicate
- Parameters
- data
dataframe
, optional Useful when not using the
>>
operator.- predicate
function
A predicate function to be applied to the columns of the dataframe. Good candidates for predicate functions are those that check the type of the column. Such function are avaible at
pandas.api.dtypes
, for examplepandas.api.types.is_numeric_dtype()
.For convenience, you can reference the
is_*_dtype
functions with shorter strings:'is_bool' # pandas.api.types.is_bool_dtype 'is_categorical' # pandas.api.types.is_categorical_dtype 'is_complex' # pandas.api.types.is_complex_dtype 'is_datetime64_any' # pandas.api.types.is_datetime64_any_dtype 'is_datetime64' # pandas.api.types.is_datetime64_dtype 'is_datetime64_ns' # pandas.api.types.is_datetime64_ns_dtype 'is_datetime64tz' # pandas.api.types.is_datetime64tz_dtype 'is_float' # pandas.api.types.is_float_dtype 'is_int64' # pandas.api.types.is_int64_dtype 'is_integer' # pandas.api.types.is_integer_dtype 'is_interval' # pandas.api.types.is_interval_dtype 'is_numeric' # pandas.api.types.is_numeric_dtype 'is_object' # pandas.api.types.is_object_dtype 'is_period' # pandas.api.types.is_period_dtype 'is_signed_integer' # pandas.api.types.is_signed_integer_dtype 'is_string' # pandas.api.types.is_string_dtype 'is_timedelta64' # pandas.api.types.is_timedelta64_dtype 'is_timedelta64_ns' # pandas.api.types.is_timedelta64_ns_dtype 'is_unsigned_integer' # pandas.api.types.is_unsigned_integer_dtype
No other string values are allowed.
- all_vars
str
, optional A predicate statement to evaluate. It should conform to python syntax and should return an array of boolean values (one for every item in the column) or a single boolean (for the whole column). You should use
{_}
to refer to the column names.After the statement is evaluated for all columns selected by the predicate, the union (
|
), is used to select the output rows.- any_vars
str
, optional A predicate statement to evaluate. It should conform to python syntax and should return an array of boolean values (one for every item in the column) or a single boolean (for the whole column). You should use
{_}
to refer to the column names.After the statement is evaluated for all columns selected by the predicate, intersection (
&
), is used to select the output rows.
- data
Examples
>>> import pandas as pd >>> import numpy as np >>> from plydata import * >>> df = pd.DataFrame({ ... 'alpha': list('aaabbb'), ... 'beta': list('babruq'), ... 'theta': list('cdecde'), ... 'x': [1, 2, 3, 4, 5, 6], ... 'y': [6, 5, 4, 3, 2, 1], ... 'z': [7, 9, 11, 8, 10, 12] ... })
Select all rows where any of the entries along the integer columns is a 4.
>>> df >> query_if('is_integer', any_vars='({_} == 4)') alpha beta theta x y z 2 a b e 3 4 11 3 b r c 4 3 8
The opposite, select all rows where none of the entries along the integer columns is a 4.
>>> df >> query_if('is_integer', all_vars='({_} != 4)') alpha beta theta x y z 0 a b c 1 6 7 1 a a d 2 5 9 4 b u d 5 2 10 5 b q e 6 1 12
For something more complicated, group-wise selection.
Select groups where any of the columns a large (> 28) sum. First by using
summarize_if
, we see that there is one such group. Then usingquery_if
selects it.>>> (df ... >> group_by('alpha') ... >> summarize_if('is_integer', 'sum')) alpha x y z 0 a 6 15 27 1 b 15 6 30 >>> (df ... >> group_by('alpha') ... >> query_if('is_integer', any_vars='(sum({_}) > 28)')) groups: ['alpha'] alpha beta theta x y z 3 b r c 4 3 8 4 b u d 5 2 10 5 b q e 6 1 12
Note that
sum({_}) > 28
is a column operation, it returns a single number for the whole column. Therefore the whole column is either selected or not selected. Column operations are what enable group-wise selection.