Filtering DataFrames with the .query() Method in Pandas

Posted 2021-02-19 • Last updated 2021-10-15

Pandas provides a .query() method on DataFrames with a convenient string syntax for filtering. Think of the .query() syntax like the where clause in SQL.

Here’s a basic example:

import pandas as pd

df = pd.read_csv("https://jbencook.s3.amazonaws.com/data/dummy-sales.csv")
df.query("region == 'APAC' and revenue < 300")

# Expected result
#           date region  revenue
# 3   1999-01-06   APAC      135
# 9   1999-01-18   APAC      147
# 11  1999-01-24   APAC      100
# 24  1999-03-20   APAC      108

The query string "region == 'APAC' and revenue < 300" selects the rows where region is 'APAC' and revenue is less than 300.

Pretty simple! You can also reference local variables by prefixing them with @. If we wanted to get examples where revenue was 1 standard deviation above the mean, we could first compute these values and then reference them in our query string:

avg_revenue = df.revenue.mean()
std_revenue = df.revenue.std()
df.query("revenue > @avg_revenue + @std_revenue")

# Expected result
#           date region  revenue
# 0   1999-01-02   APAC      928
# 8   1999-01-16   APAC      970
# 19  1999-02-16   EMEA      918
# 25  1999-03-23   AMER      972
# 26  1999-03-24   AMER      956
# 27  1999-03-24   EMEA      954
# 29  1999-03-26   AMER      994

But you can also call methods on the columns inside the string. Here’s the same query without pre-computing the mean and standard deviation:

df.query("revenue > revenue.mean() + revenue.std()")

# Expected result
#           date region  revenue
# 0   1999-01-02   APAC      928
# 8   1999-01-16   APAC      970
# 19  1999-02-16   EMEA      918
# 25  1999-03-23   AMER      972
# 26  1999-03-24   AMER      956
# 27  1999-03-24   EMEA      954
# 29  1999-03-26   AMER      994

It’s worth mentioning one other cool trick. You can check whether a column value is in a local list:

valid_dates = ["1999-01-02", "1999-01-03", "1999-01-04"]
df.query("date in @valid_dates")

# Expected result
#          date region  revenue
# 0  1999-01-02   APAC      928
# 1  1999-01-03   AMER      526
# 2  1999-01-04   EMEA      497

A few other things to be aware of:

  • You can’t reference columns if they share a name with Python keywords.
  • You can use backticks, e.g. hello world to reference a columns that aren’t valid Python variables.
  • The result is a new DataFrame, unless you pass inplace=True, in which case it modifies the existing DataFrame.
  • Performance of .query() will often be better than complex masking operations (such as df[(df.region == "APAC") & (df.revenue < 300)]), because .query() doesn’t create intermediate objects, leaving everything in C.

Check out my Jupyter notebook if you want to play around with the .query() method!

If you want to improve your knowledge of Pandas, I recommend Chapter 3 of the Python Data Science Handbook. I get commissions for purchases made through this link. So you can learn more about Pandas and support the blog at the same time!

Connect

Contact

ben [at] sparrow [dot] dev

Email List