plydata.helper_verbs.query_all¶

class plydata.helper_verbs.query_all(*args, **kwargs)[source]¶

Query all columns

Parameters

datadataframe, optional

Useful when not using the >> operator.

all_varsstr, optional

A predicate statement to evaluate. It should conform to python syntax and should return an array of boolean values (one for every item in the column) or a single boolean (for the whole column). You should use {_} to refer to the column names.

After the statement is evaluated for all columns, the union (|), is used to select the output rows.

any_varsstr, optional

After the statement is evaluated for all columns, the intersection (&), is used to select the output rows.

Examples

>>> import pandas as pd
>>> import numpy as np
>>> from plydata import *
>>> df = pd.DataFrame({
...     'alpha': list('aaabbb'),
...     'beta': list('babruq'),
...     'theta': list('cdecde'),
...     'x': [1, 2, 3, 4, 5, 6],
...     'y': [6, 5, 4, 3, 2, 1],
...     'z': [7, 9, 11, 8, 10, 12]
... })

Select all rows where any of the entries along the columns is a 4.

>>> df >> query_all(any_vars='({_} == 4)')
  alpha beta theta  x  y   z
2     a    b     e  3  4  11
3     b    r     c  4  3   8

The opposit, select all rows where none of the entries along the columns is a 4.

>>> df >> query_all(all_vars='({_} != 4)')
  alpha beta theta  x  y   z
0     a    b     c  1  6   7
1     a    a     d  2  5   9
4     b    u     d  5  2  10
5     b    q     e  6  1  12

For something more complicated, group-wise selection.

Select groups where any of the columns a large (> 28) sum. First by using summarize_all, we see that there is one such group. Then using query_all selects it.

>>> (df
...  >> group_by('alpha')
...  >> select('x', 'y', 'z')
...  >> summarize_all('sum'))
  alpha   x   y   z
0     a   6  15  27
1     b  15   6  30
>>> (df
...  >> group_by('alpha')
...  >> select('x', 'y', 'z')
...  >> query_all(any_vars='(sum({_}) > 28)'))
groups: ['alpha']
  alpha  x  y   z
3     b  4  3   8
4     b  5  2  10
5     b  6  1  12

Note that sum({_}) > 28 is a column operation, it returns a single number for the whole column. Therefore the whole column is either selected or not selected. Column operations are what enable group-wise selection.