plydata.helper_verbs.query_at¶
-
class
plydata.helper_verbs.
query_at
(*args, **kwargs)[source]¶ Query specific columns
- Parameters
- data
dataframe
, optional Useful when not using the
>>
operator.- names
tuple
ordict
Names of columns in dataframe. If a tuple, they should be names of columns. If a
dict
, they keys must be in.- startswithstr or tuple, optional
All column names that start with this string will be included.
- endswithstr or tuple, optional
All column names that end with this string will be included.
- containsstr or tuple, optional
All column names that contain with this string will be included.
- matchesstr or regex or tuple, optional
All column names that match the string or a compiled regex pattern will be included. A tuple can be used to match multiple regexs.
- dropbool, optional
If
True
, the selection is inverted. The unspecified/unmatched columns are returned instead. Default isFalse
.
- all_vars
str
, optional A predicate statement to evaluate. It should conform to python syntax and should return an array of boolean values (one for every item in the column) or a single boolean (for the whole column). You should use
{_}
to refer to the column names.After the statement is evaluated for all columns selected by the names specification, the union (
|
), is used to select the output rows.- any_vars
str
, optional A predicate statement to evaluate. It should conform to python syntax and should return an array of boolean values (one for every item in the column) or a single boolean (for the whole column). You should use
{_}
to refer to the column names.After the statement is evaluated for all columns selected by the names specification, intersection (
&
), is used to select the output rows.
- data
Examples
>>> import pandas as pd >>> import numpy as np >>> from plydata import * >>> df = pd.DataFrame({ ... 'alpha': list('aaabbb'), ... 'beta': list('babruq'), ... 'theta': list('cdecde'), ... 'x': [1, 2, 3, 4, 5, 6], ... 'y': [6, 5, 4, 3, 2, 1], ... 'z': [7, 9, 11, 8, 10, 12] ... })
Select all rows where any of the entries along the integer columns is a 4.
>>> df >> query_at(('x', 'y', 'z'), any_vars='({_} == 4)') alpha beta theta x y z 2 a b e 3 4 11 3 b r c 4 3 8
The opposit, select all rows where none of the entries along the integer columns is a 4.
>>> df >> query_at(('x', 'y', 'z'), all_vars='({_} != 4)') alpha beta theta x y z 0 a b c 1 6 7 1 a a d 2 5 9 4 b u d 5 2 10 5 b q e 6 1 12
For something more complicated, group-wise selection.
Select groups where any of the columns a large (> 28) sum. First by using
summarize_at
, we see that there is one such group. Then usingquery_at
selects it.>>> (df ... >> group_by('alpha') ... >> summarize_at(('x', 'y', 'z'), 'sum')) alpha x y z 0 a 6 15 27 1 b 15 6 30 >>> (df ... >> group_by('alpha') ... >> query_at(('x', 'y', 'z'), any_vars='(sum({_}) > 28)')) groups: ['alpha'] alpha beta theta x y z 3 b r c 4 3 8 4 b u d 5 2 10 5 b q e 6 1 12
Note that
sum({_}) > 28
is a column operation, it returns a single number for the whole column. Therefore the whole column is either selected or not selected. Column operations are what enable group-wise selection.