df.drop()
The easiest way to drop rows and columns from a Pandas DataFrame is with the .drop() method, which accepts one or more labels passed in as index=<rows to drop> and/or columns=<cols to drop>:
import pandas as pd
df = pd.read_csv("https://jbencook.s3.amazonaws.com/data/dummy-sales.csv").head()
df
# Expected result
#          date region  revenue
# 0  1999-01-02   APAC      928
# 1  1999-01-03   AMER      526
# 2  1999-01-04   EMEA      497
# 3  1999-01-06   APAC      135
# 4  1999-01-07   APAC      829
df.drop(columns='region', index=[0, 2, 4])
# Expected result
#          date  revenue
# 1  1999-01-03      526
# 3  1999-01-06      135The above command drops both the 'region' column and rows [0, 2, 4]. Notice the columns and index arguments both accept a single label or a “list-like” sequence of labels. A couple things to be aware of:
- “list-like” does not include tuples. For reasons I don’t fully understand, if you pass a tuple into columnsorindexyou’ll get aKeyError. You can pass inrangeobjects.
- Before version 0.21.0, you need to drop rows and columns separately using the axisargument, e.g.df.drop('region', axis=1). Probably better to upgrade Pandas 🙂
Dropping by index
Even if your axis is not labeled with an integer index, you can still drop rows and columns by index: just slice the labels.
For example, to drop the odd-numbered rows and the even-numbered columns:
df.drop(index=df.index[1::2], columns=df.columns[::2])
# Expected result
#   region
# 0   APAC
# 2   EMEA
# 4   APACThe del operator
You can also drop individual columns with the del operator. Be careful because this actually deletes the column (which is similar to passing inplace=True to .drop()). Here, I’ll copy the DataFrame first so we can use it again later:
copied_df = df.copy()
del copied_df["date"]
copied_df
#   region  revenue
# 0   APAC      928
# 1   AMER      526
# 2   EMEA      497
# 3   APAC      135
# 4   APAC      829Row-dropping utilities
Pandas also provides a couple convenience methods for dropping duplicate rows and rows with missing data.
You can drop duplicate rows with df.drop_duplicates():
df.drop_duplicates(subset='region')
# Expected result
#          date region  revenue
# 0  1999-01-02   APAC      928
# 1  1999-01-03   AMER      526
# 2  1999-01-04   EMEA      497And you can use df.dropna() to drop rows with missing data. My dummy dataset doesn’t have any missing data so I’ll leave this one as an exercise for the reader! If you do this, you should think about:
- How am I going to insert a handful of missing values?
- How can I drop rows where only some columns have a missing value?
- How can I drop rows where any column has a missing value?
Happy tinkering!
