plydata.helper_verbs.query_at¶

class plydata.helper_verbs.query_at(*args, **kwargs)[source]¶

Query specific columns

Parameters

datadataframe, optional

Useful when not using the >> operator.

namestuple or dict

Names of columns in dataframe. If a tuple, they should be names of columns. If a dict, they keys must be in.

startswithstr or tuple, optional
All column names that start with this string will be included.
endswithstr or tuple, optional
All column names that end with this string will be included.
containsstr or tuple, optional
All column names that contain with this string will be included.
matchesstr or regex or tuple, optional
All column names that match the string or a compiled regex pattern will be included. A tuple can be used to match multiple regexs.
dropbool, optional
If True, the selection is inverted. The unspecified/unmatched columns are returned instead. Default is False.

all_varsstr, optional

A predicate statement to evaluate. It should conform to python syntax and should return an array of boolean values (one for every item in the column) or a single boolean (for the whole column). You should use {_} to refer to the column names.

After the statement is evaluated for all columns selected by the names specification, the union (|), is used to select the output rows.

any_varsstr, optional

After the statement is evaluated for all columns selected by the names specification, intersection (&), is used to select the output rows.

Examples

>>> import pandas as pd
>>> import numpy as np
>>> from plydata import *
>>> df = pd.DataFrame({
...     'alpha': list('aaabbb'),
...     'beta': list('babruq'),
...     'theta': list('cdecde'),
...     'x': [1, 2, 3, 4, 5, 6],
...     'y': [6, 5, 4, 3, 2, 1],
...     'z': [7, 9, 11, 8, 10, 12]
... })

Select all rows where any of the entries along the integer columns is a 4.

>>> df >> query_at(('x', 'y', 'z'), any_vars='({_} == 4)')
  alpha beta theta  x  y   z
2     a    b     e  3  4  11
3     b    r     c  4  3   8

The opposit, select all rows where none of the entries along the integer columns is a 4.

>>> df >> query_at(('x', 'y', 'z'), all_vars='({_} != 4)')
  alpha beta theta  x  y   z
0     a    b     c  1  6   7
1     a    a     d  2  5   9
4     b    u     d  5  2  10
5     b    q     e  6  1  12

For something more complicated, group-wise selection.

Select groups where any of the columns a large (> 28) sum. First by using summarize_at, we see that there is one such group. Then using query_at selects it.

>>> (df
...  >> group_by('alpha')
...  >> summarize_at(('x', 'y', 'z'), 'sum'))
  alpha   x   y   z
0     a   6  15  27
1     b  15   6  30
>>> (df
...  >> group_by('alpha')
...  >> query_at(('x', 'y', 'z'), any_vars='(sum({_}) > 28)'))
groups: ['alpha']
  alpha beta theta  x  y   z
3     b    r     c  4  3   8
4     b    u     d  5  2  10
5     b    q     e  6  1  12

Note that sum({_}) > 28 is a column operation, it returns a single number for the whole column. Therefore the whole column is either selected or not selected. Column operations are what enable group-wise selection.