The easiest way to drop rows and columns from a Pandas DataFrame is with the
.drop() method, which accepts one or more labels passed in as
index=<rows to drop> and/or
columns=<cols to drop>:
import pandas as pd df = pd.read_csv("https://jbencook.s3.amazonaws.com/data/dummy-sales.csv").head() df # Expected result # date region revenue # 0 1999-01-02 APAC 928 # 1 1999-01-03 AMER 526 # 2 1999-01-04 EMEA 497 # 3 1999-01-06 APAC 135 # 4 1999-01-07 APAC 829 df.drop(columns='region', index=[0, 2, 4]) # Expected result # date revenue # 1 1999-01-03 526 # 3 1999-01-06 135
The above command drops both the
'region' column and rows
[0, 2, 4]. Notice the
index arguments both accept a single label or a “list-like” sequence of labels. A couple things to be aware of:
- “list-like” does not include tuples. For reasons I don’t fully understand, if you pass a tuple into
indexyou’ll get a
KeyError. You can pass in
- Before version 0.21.0, you need to drop rows and columns separately using the
df.drop('region', axis=1). Probably better to upgrade Pandas 🙂
Dropping by index
Even if your axis is not labeled with an integer index, you can still drop rows and columns by index: just slice the labels.
For example, to drop the odd-numbered rows and the even-numbered columns:
df.drop(index=df.index[1::2], columns=df.columns[::2]) # Expected result # region # 0 APAC # 2 EMEA # 4 APAC
The del operator
You can also drop individual columns with the
del operator. Be careful because this actually deletes the column (which is similar to passing
.drop()). Here, I’ll copy the DataFrame first so we can use it again later:
copied_df = df.copy() del copied_df["date"] copied_df # region revenue # 0 APAC 928 # 1 AMER 526 # 2 EMEA 497 # 3 APAC 135 # 4 APAC 829
Pandas also provides a couple convenience methods for dropping duplicate rows and rows with missing data.
You can drop duplicate rows with
df.drop_duplicates(subset='region') # Expected result # date region revenue # 0 1999-01-02 APAC 928 # 1 1999-01-03 AMER 526 # 2 1999-01-04 EMEA 497
And you can use
df.dropna() to drop rows with missing data. My dummy dataset doesn’t have any missing data so I’ll leave this one as an exercise for the reader! If you do this, you should think about:
- How am I going to insert a handful of missing values?
- How can I drop rows where only some columns have a missing value?
- How can I drop rows where any column has a missing value?