GroupBy on steroids, unleash GroupBy's full potential with the use of .agg()
Posted on pon 15 kwietnia 2019 in Data Science
We all use .groupby
and it saves tons of our time and effort. But do you know how one can:
- use
.groupby()
and apply more than one aggregation function on the same column? - apply different aggregation functions over different columns in the same line of code?
- use a custom function as an aggregator?
That's when .agg()
comes into play! Let's see how you can use it in your code.
Our input dataset - AirBnB prices in Barcelona
We will use the dataset provided by Inside Airbnb, describing all apartments that were listed on AirBnB in Barcelona in 2018. Dataset contains 4 columns:
::python
df = pd.read_csv('listings.csv', dtype={'name': str})
df.head(5)
Output:
+---+-------------------------------------------------+---------------------+-----------------+---------------------+
| | name | neighbourhood_group | price_per_night | annual_availability |
+---+-------------------------------------------------+---------------------+-----------------+---------------------+
| 0 | Flat with Sunny Terrace | Sant Martí | 130 | 23 |
| 1 | Huge flat for 8 people close to Sagrada Familia | Eixample | 60 | 61 |
| 2 | Nice and sunny duble room | Sant Martí | 33 | 300 |
| 3 | NICE ROOM AVAILABLE IN THE HEART OF GRACIA | Gràcia | 31 | 36 |
| 4 | Room for 2, Sagrada Famili | Horta-Guinardó | 42 | 190 |
+---+-------------------------------------------------+---------------------+-----------------+---------------------+
In name
column there are names of all listings from AirBnB in Barcelona. neighbourhood_group
describes where a particular listings is located.
price_per_night
tells how much does a rental cost and annual_availability
specifies how many days per year a particular listing was available.
.agg()
can help us quickly synthesize this data especially if we need to write more complex aggregations on DataFrame.
Use of .agg()
.agg()
is an alias for aggregate. Usually it is used in combination with .groupby()
when we need more complex grouping to be applied.
It allows you to specify an exact aggregation function (it can be more than one) that should be used to calculate
results on a particular axis.
Enough talk, let's see an example:
::python
df_mean_prices = (
df.groupby(by=['neighbourhood_group'])
# Pass a dictionary with the structure: {column_name: [operation]}
.agg({'price_per_night': ['mean']})
)
df_mean_prices
Output:
price_per_night
--------------------- -----------------
mean
neighbourhood_group
Ciutat Vella 70.306712
Eixample 142.417469
Gràcia 89.263447
Horta-Guinardó 76.709565
Les Corts 120.997409
Nou Barris 35.365957
Sant Andreu 60.475610
Sant Martí 94.208556
Sants-Montjuïc 77.347344
Sarrià-Sant Gervasi 107.404908
Notice that thanks to the dictionary that we passed to .agg()
we can not only control which columns are interesting
to us, but also what is the aggregation function that we want to apply.
As a result we got a summary with mean prices per night in Barcelona for each separate district.
Right now, we would get the same result by using df.groupby(by=['neighbourhood_group']).price_per_night.mean()
.
But .agg()
gives use more flexibility.
.agg()
- different operations on columns
Let's build on the previous example and see why we should care about .agg()
:::python
df_prices = (
df.groupby(by=['neighbourhood_group'])
# Use different aggregation function for each column
.agg({'price_per_night': ['mean', 'min', 'max'],
'annual_availability': ['mean', 'count']})
)
df_prices
Output:
price_per_night annual_availability
--------------------- ----------------- --------------------- ------ ------------ ------
mean min max mean count
neighbourhood_group
Ciutat Vella 70.306712 8 2000 183.585497 4082
Eixample 142.417469 8 6000 203.462267 6228
Gràcia 89.263447 7 5000 189.768949 1636
Horta-Guinardó 76.709565 9 1800 184.514783 575
Les Corts 120.997409 9 7777 182.072539 386
Nou Barris 35.365957 11 175 191.540426 235
Sant Andreu 60.475610 11 2000 192.426829 328
Sant Martí 94.208556 8 3000 193.895965 2057
Sants-Montjuïc 77.347344 8 8000 185.933025 2165
Sarrià-Sant Gervasi 107.404908 9 2000 213.748466 652
This time dictionary that we passed to .agg()
contains list of functions that we selected to apply on a particular
column groups. Very powerful when you want to quickly synthesize dataset using several aggregation functions.
But this is not the end...
.agg()
- use of custom aggregation function
I saved the best part for the end, so I hope that your ready for it. Aggregation function doesn't have to be pre-defined.
You can create your own one as long as it returns a single value!
In our dataset, we have names
column that contains all listing names. Let's say that we would like to check how many times
word "sunny" and "friendly" was mentioned in the listing's name.
We start from defining out aggregation functions:
:::python
def count_sunny_in_description(titles):
count = 0
for title in titles:
if 'sunny' in str(title):
count += 1
return count
def count_friendly_in_description(titles):
count = 0
for title in titles:
if 'friendly' in str(title):
count += 1
return count
Pretty straightforward, we return count
which is an integer stating how many times a particular word appeared in the name.
Now, having defined our aggregation functions, we can use these in the same manner as in the previous example:
:::python
df_sunny_rainy = (
df.groupby(by=['neighbourhood_group'])
# Use custom aggregation function for each column
.agg({'name': [count_sunny_in_description, count_friendly_in_description]})
)
df_sunny_rainy
Output:
name
---------------------------- -------------------------------------------------------------
count_sunny_in_description count_friendly_in_description
neighbourhood_group
Ciutat Vella 30 7
Eixample 46 8
Gràcia 19 1
Horta-Guinardó 3 0
Les Corts 8 1
Nou Barris 0 0
Sant Andreu 2 3
Sant Martí 21 7
Sants-Montjuïc 26 3
Sarrià-Sant Gervasi 10 0
And voilà, now you can analyze all types of data using .groupby()
and .agg()
as long as you write an appropriate aggregation function!
Summary
In this post, we examined the use of .agg()
, especially in combination with .groupby()
.
No more merging of several .groupby()
results!
From now on, you should always use .agg()
when you need more than one or custom aggregation operations over
the DataFrame's axis.
Happy coding!