plydata.helper_verbs.query_at

class plydata.helper_verbs.query_at(*args, **kwargs)[source]

Query specific columns

Parameters
datadataframe, optional

Useful when not using the >> operator.

namestuple or dict

Names of columns in dataframe. If a tuple, they should be names of columns. If a dict, they keys must be in.

  • startswithstr or tuple, optional

    All column names that start with this string will be included.

  • endswithstr or tuple, optional

    All column names that end with this string will be included.

  • containsstr or tuple, optional

    All column names that contain with this string will be included.

  • matchesstr or regex or tuple, optional

    All column names that match the string or a compiled regex pattern will be included. A tuple can be used to match multiple regexs.

  • dropbool, optional

    If True, the selection is inverted. The unspecified/unmatched columns are returned instead. Default is False.

all_varsstr, optional

A predicate statement to evaluate. It should conform to python syntax and should return an array of boolean values (one for every item in the column) or a single boolean (for the whole column). You should use {_} to refer to the column names.

After the statement is evaluated for all columns selected by the names specification, the union (|), is used to select the output rows.

any_varsstr, optional

A predicate statement to evaluate. It should conform to python syntax and should return an array of boolean values (one for every item in the column) or a single boolean (for the whole column). You should use {_} to refer to the column names.

After the statement is evaluated for all columns selected by the names specification, intersection (&), is used to select the output rows.

Examples

>>> import pandas as pd
>>> import numpy as np
>>> from plydata import *
>>> df = pd.DataFrame({
...     'alpha': list('aaabbb'),
...     'beta': list('babruq'),
...     'theta': list('cdecde'),
...     'x': [1, 2, 3, 4, 5, 6],
...     'y': [6, 5, 4, 3, 2, 1],
...     'z': [7, 9, 11, 8, 10, 12]
... })

Select all rows where any of the entries along the integer columns is a 4.

>>> df >> query_at(('x', 'y', 'z'), any_vars='({_} == 4)')
  alpha beta theta  x  y   z
2     a    b     e  3  4  11
3     b    r     c  4  3   8

The opposit, select all rows where none of the entries along the integer columns is a 4.

>>> df >> query_at(('x', 'y', 'z'), all_vars='({_} != 4)')
  alpha beta theta  x  y   z
0     a    b     c  1  6   7
1     a    a     d  2  5   9
4     b    u     d  5  2  10
5     b    q     e  6  1  12

For something more complicated, group-wise selection.

Select groups where any of the columns a large (> 28) sum. First by using summarize_at, we see that there is one such group. Then using query_at selects it.

>>> (df
...  >> group_by('alpha')
...  >> summarize_at(('x', 'y', 'z'), 'sum'))
  alpha   x   y   z
0     a   6  15  27
1     b  15   6  30
>>> (df
...  >> group_by('alpha')
...  >> query_at(('x', 'y', 'z'), any_vars='(sum({_}) > 28)'))
groups: ['alpha']
  alpha beta theta  x  y   z
3     b    r     c  4  3   8
4     b    u     d  5  2  10
5     b    q     e  6  1  12

Note that sum({_}) > 28 is a column operation, it returns a single number for the whole column. Therefore the whole column is either selected or not selected. Column operations are what enable group-wise selection.