Generally, you should avoid iterating over rows in a DataFrame. Python for loops are slow and Pandas is not designed for this type of access pattern. Instead, you should be applying vectorized operations. But look, sometimes, for whatever reason, you’re going to want to do it anyway.
If you absolutely must iterate over the rows in your DataFrame, use the .itertuples()
method:
import pandas as pd
df = pd.read_csv("https://jbencook.s3.amazonaws.com/data/dummy-sales.csv").head()
for row in df.itertuples():
print(row.Index, row.date)
# Expected result
# 0 1999-01-02
# 1 1999-01-03
# 2 1999-01-04
# 3 1999-01-06
# 4 1999-01-07
This is better than .iterrows()
because:
.itertuples()
preserves the data type of the column values.- It’s faster.
- You get namedtuples, which lets you access the column values as attributes, e.g.
row.date
. Pandas also includes the index asrow.Index
for good measure. - The namedtuples are immutable. This is good because weird things happen if you try to modify rows generated by
.iterrows()
, another popular method for iterating through Pandas rows.
One additional thing to point out is that if your columns have names that don’t allow them to be accessed as attributes, such as "bad column name"
, you can still use .itertuples()
, but it becomes a little uglier. In this case, pass index=False
and then you can access by index or with df.columns.get_loc("bad column name")
. Here’s an example with both approaches on our dummy dataset:
for row in df.itertuples(index=False):
print(row[0], row[df.columns.get_loc("region")])
# Expected result
# 1999-01-02 APAC
# 1999-01-03 AMER
# 1999-01-04 EMEA
# 1999-01-06 APAC
# 1999-01-07 APAC
That’s not a huge deal, but it’s probably better to rename your column instead.