Pandas provides a .query()
method on DataFrames with a convenient string syntax for filtering. Think of the .query()
syntax like the where
clause in SQL.
Here’s a basic example:
import pandas as pd
df = pd.read_csv("https://jbencook.s3.amazonaws.com/data/dummy-sales.csv")
df.query("region == 'APAC' and revenue < 300")
# Expected result
# date region revenue
# 3 1999-01-06 APAC 135
# 9 1999-01-18 APAC 147
# 11 1999-01-24 APAC 100
# 24 1999-03-20 APAC 108
The query string "region == 'APAC' and revenue < 300"
selects the rows where region
is 'APAC'
and revenue
is less than 300.
Pretty simple! You can also reference local variables by prefixing them with @
. If we wanted to get examples where revenue
was 1 standard deviation above the mean, we could first compute these values and then reference them in our query string:
avg_revenue = df.revenue.mean()
std_revenue = df.revenue.std()
df.query("revenue > @avg_revenue + @std_revenue")
# Expected result
# date region revenue
# 0 1999-01-02 APAC 928
# 8 1999-01-16 APAC 970
# 19 1999-02-16 EMEA 918
# 25 1999-03-23 AMER 972
# 26 1999-03-24 AMER 956
# 27 1999-03-24 EMEA 954
# 29 1999-03-26 AMER 994
But you can also call methods on the columns inside the string. Here’s the same query without pre-computing the mean and standard deviation:
df.query("revenue > revenue.mean() + revenue.std()")
# Expected result
# date region revenue
# 0 1999-01-02 APAC 928
# 8 1999-01-16 APAC 970
# 19 1999-02-16 EMEA 918
# 25 1999-03-23 AMER 972
# 26 1999-03-24 AMER 956
# 27 1999-03-24 EMEA 954
# 29 1999-03-26 AMER 994
It’s worth mentioning one other cool trick. You can check whether a column value is in a local list:
valid_dates = ["1999-01-02", "1999-01-03", "1999-01-04"]
df.query("date in @valid_dates")
# Expected result
# date region revenue
# 0 1999-01-02 APAC 928
# 1 1999-01-03 AMER 526
# 2 1999-01-04 EMEA 497
A few other things to be aware of:
- You can’t reference columns if they share a name with Python keywords.
- You can use backticks, e.g.
hello world
to reference a columns that aren’t valid Python variables. - The result is a new DataFrame, unless you pass
inplace=True
, in which case it modifies the existing DataFrame. - Performance of
.query()
will often be better than complex masking operations (such asdf[(df.region == "APAC") & (df.revenue < 300)]
), because.query()
doesn’t create intermediate objects, leaving everything in C.
Check out my Jupyter notebook if you want to play around with the .query()
method!