For loop in Pandas a.k.a. df.apply(). Why you shouldn't iterate over rows.
Posted on pon 08 kwietnia 2019 in Data Science
A wise man once said:
If you use for loop in Pandas, something smells fishy.
Why?! It works and my output is exactly like I wanted it to be! - you might think. In this article I will show you why a wise man is right.
How does .iterrows()
work?
If you ever iterated over rows, which is the the most popular use case of for loop in Pandas, there is a huge chance that you used this construction:
:::python
for index, row in df.iterrows():
# Access any cell in row and set it to 0
df.loc[index, 'column_name'] = 0
It's very convenient, because one line of code gives you both access to index of the DataFrame and to all values in the row. However, what if we wanted to assign a particular value only for a given condition? Probably, we would write something like this:
:::python
for index, row in df.iterrows():
# Access any cell in row and set it to 0
# Check if value in cell fulfils condition
if df.loc[index, 'column_name'] == 1:
df.loc[index, 'column_name'] = 0
else:
pass
It's still possible. So one would say that why should we care about .apply() if we can do everything using for loops? Let's see how long does it take to execute each of those cells. To do that, we will use pre-generated dataset with daily prices of real estates. DataFrame contains 3 columns: date, price per day, and price per night. Source code and input dataset can be downloaded here.
df.head()
+---+------------+-----------+-------------+
| | date | price_day | price_night |
+---+------------+-----------+-------------+
| 0 | 2018-11-06 | 60.0 | 54.0 |
| 1 | 2018-11-05 | 60.0 | 54.0 |
| 2 | 2018-11-04 | 60.0 | 54.0 |
| 3 | 2018-11-03 | 60.0 | 54.0 |
| 4 | 2018-11-02 | 65.0 | 58.5 |
+---+------------+-----------+-------------+
This DataFrame has 4000 rows, so it is neither big nor small. Let's see how .iterrows()
performs on the following DataFrame in terms of time execution:
:::python
%%time
# Create new column and assign default value to it
df['month'] = np.nan
# Use of iterrows, almost always not preferable!
# Always try to use .apply instead!
for index, row in df.iterrows():
df.loc[index, 'month'] = df.loc[index,'date'].month
We created new column 'month' and we assign np.nan as its initial value.
We iterated over rows, which means that we went over all rows in DataFrame on by one.
With the use of .loc[]
, we extracted month for every date value from a particular row and we assigned it to df['month'] column.
We achieved a desired result in 1.92 seconds, so we can say that this application has an order of magnitude of seconds. Now let's see how you can decrease this time by 3 orders of magnitude!
Use of .apply()
.apply()
is a Pandas way to perform iterations on columns/rows. It takes advantage of vectorized techniques and speeds up execution of simple and complex operations by many times. Moreover, its syntax is easy to understand once you see a few examples. More detailed explanation on apply (how it works under the hood) in this excellent post here. Now, let's go back to our dataset and see its implementation. The same operation done with the use of .apply():
%%time
df['month'] = df['date'].apply(lambda x: x.month)
We created a new column named "month".
We called .apply
on date column and we used lambda function that returns month from datetime.
Time execution on the same data set: 8 ms. It's way faster and it will always be as long as you use it right. This is how .apply structure looks like: Following official Pandas documentation here: "Apply a function along an axis of the DataFrame."
As we know, axis can be either rows or columns and you control this with the use of axis parameter. What is important to remember is that the function that you apply should usually return a single value in order to work correctly. We can rewrite the previous example to be more verbose:
:::python
def extract_month(x):
month = x.month
return month
df['month'] = df.date.apply(extract_month)
We created a function that returns exactly one value (month).
We passed this function as a parameter to .apply
to use on each column.
.apply()
with condition
I am sure that you immediately start seeing advantages of the previous approach. Inside the function that you pass to .apply you can do literally anything as long as you return a single value. Let's check how we can apply any condition.
:::python
# Create function that checks multiple conditions
def extract_month(x):
month = x.month
if month == 11:
return "It's november. So cold!"
elif month == 6:
return "It's june. I love sun!"
else:
return "It's ok."
# Apply function on column
df['month'] = df.date.apply(extract_month)
Easy, clear and first-of-all very, very fast.
.apply()
on multiple cells in row
Probably you waited for this part. You don't have to limit yourself to applying function only on one column. You can apply it on the whole DataFrame as well. Let's go straight to the example:
:::python
# Calculate mean price per day
def mean_cost_per_day(row):
mean_cost = np.mean([row.price_day, row.price_night])
return mean_cost
df['mean_cost_per_day'] = df.apply(mean_cost_per_day, axis=1)
The main difference lays in the function execution, now it takes each row (or column if you wish) as an input. Later, it calls values in each cells with the use of a particular column name after a dot. In the last line, check that instead of calling apply on a single column, we called it this time on the whole DataFrame. We also specified axis parameter and told precisely that we want to use it on columns.
Summary
In this post, we examined the use of df.apply()
. It's Pandas way for row/column iteration for the following reasons:
It's very fast especially with the growth of your data.
You can "iterate" on both columns and rows by selecting axis parameter.
From now on, every time when you think of row iteration, you should consider using df.apply()
instead:)!
Happy coding!