df.drop()
The easiest way to drop rows and columns from a Pandas DataFrame is with the .drop()
method, which accepts one or more labels passed in as index=<rows to drop>
and/or columns=<cols to drop>
:
import pandas as pd
df = pd.read_csv("https://jbencook.s3.amazonaws.com/data/dummy-sales.csv").head()
df
# Expected result
# date region revenue
# 0 1999-01-02 APAC 928
# 1 1999-01-03 AMER 526
# 2 1999-01-04 EMEA 497
# 3 1999-01-06 APAC 135
# 4 1999-01-07 APAC 829
df.drop(columns='region', index=[0, 2, 4])
# Expected result
# date revenue
# 1 1999-01-03 526
# 3 1999-01-06 135
The above command drops both the 'region'
column and rows [0, 2, 4]
. Notice the columns
and index
arguments both accept a single label or a “list-like” sequence of labels. A couple things to be aware of:
- “list-like” does not include tuples. For reasons I don’t fully understand, if you pass a tuple into
columns
orindex
you’ll get aKeyError
. You can pass inrange
objects. - Before version 0.21.0, you need to drop rows and columns separately using the
axis
argument, e.g.df.drop('region', axis=1)
. Probably better to upgrade Pandas 🙂
Dropping by index
Even if your axis is not labeled with an integer index, you can still drop rows and columns by index: just slice the labels.
For example, to drop the odd-numbered rows and the even-numbered columns:
df.drop(index=df.index[1::2], columns=df.columns[::2])
# Expected result
# region
# 0 APAC
# 2 EMEA
# 4 APAC
The del operator
You can also drop individual columns with the del
operator. Be careful because this actually deletes the column (which is similar to passing inplace=True
to .drop()
). Here, I’ll copy the DataFrame first so we can use it again later:
copied_df = df.copy()
del copied_df["date"]
copied_df
# region revenue
# 0 APAC 928
# 1 AMER 526
# 2 EMEA 497
# 3 APAC 135
# 4 APAC 829
Row-dropping utilities
Pandas also provides a couple convenience methods for dropping duplicate rows and rows with missing data.
You can drop duplicate rows with df.drop_duplicates()
:
df.drop_duplicates(subset='region')
# Expected result
# date region revenue
# 0 1999-01-02 APAC 928
# 1 1999-01-03 AMER 526
# 2 1999-01-04 EMEA 497
And you can use df.dropna()
to drop rows with missing data. My dummy dataset doesn’t have any missing data so I’ll leave this one as an exercise for the reader! If you do this, you should think about:
- How am I going to insert a handful of missing values?
- How can I drop rows where only some columns have a missing value?
- How can I drop rows where any column has a missing value?
Happy tinkering!