GroupBy on steroids, unleash GroupBy's full potential with the use of .agg()

Posted on pon 15 kwietnia 2019 in Data Science

We all use .groupby and it saves tons of our time and effort. But do you know how one can:

  • use .groupby() and apply more than one aggregation function on the same column?
  • apply different aggregation functions over different columns in the same line of code?
  • use a custom function as an aggregator?

That's when .agg() comes into play! Let's see how you can use it in your code.


Our input dataset - AirBnB prices in Barcelona

We will use the dataset provided by Inside Airbnb, describing all apartments that were listed on AirBnB in Barcelona in 2018. Dataset contains 4 columns:

::python
df = pd.read_csv('listings.csv', dtype={'name': str})
df.head(5)

Output:

+---+-------------------------------------------------+---------------------+-----------------+---------------------+
|   |                      name                       | neighbourhood_group | price_per_night | annual_availability |
+---+-------------------------------------------------+---------------------+-----------------+---------------------+
| 0 | Flat with Sunny Terrace                         | Sant Martí          |             130 |                  23 |
| 1 | Huge flat for 8 people close to Sagrada Familia | Eixample            |              60 |                  61 |
| 2 | Nice and sunny duble room                       | Sant Martí          |              33 |                 300 |
| 3 | NICE ROOM AVAILABLE IN THE HEART OF GRACIA      | Gràcia              |              31 |                  36 |
| 4 | Room for 2, Sagrada Famili                      | Horta-Guinardó      |              42 |                 190 |
+---+-------------------------------------------------+---------------------+-----------------+---------------------+

In name column there are names of all listings from AirBnB in Barcelona. neighbourhood_group describes where a particular listings is located. price_per_night tells how much does a rental cost and annual_availability specifies how many days per year a particular listing was available.

.agg() can help us quickly synthesize this data especially if we need to write more complex aggregations on DataFrame.


Use of .agg()

.agg() is an alias for aggregate. Usually it is used in combination with .groupby() when we need more complex grouping to be applied. It allows you to specify an exact aggregation function (it can be more than one) that should be used to calculate results on a particular axis.

Enough talk, let's see an example:

::python
df_mean_prices = ( 
            df.groupby(by=['neighbourhood_group'])
              # Pass a dictionary with the structure: {column_name: [operation]}
              .agg({'price_per_night': ['mean']})
     )

df_mean_prices

Output:

                        price_per_night  
 --------------------- ----------------- 
                              mean       
  neighbourhood_group                    
  Ciutat Vella                70.306712  
  Eixample                   142.417469  
  Gràcia                      89.263447  
  Horta-Guinardó              76.709565  
  Les Corts                  120.997409  
  Nou Barris                  35.365957  
  Sant Andreu                 60.475610  
  Sant Martí                  94.208556  
  Sants-Montjuïc              77.347344  
  Sarrià-Sant Gervasi        107.404908  

Notice that thanks to the dictionary that we passed to .agg() we can not only control which columns are interesting to us, but also what is the aggregation function that we want to apply. As a result we got a summary with mean prices per night in Barcelona for each separate district.

Right now, we would get the same result by using df.groupby(by=['neighbourhood_group']).price_per_night.mean(). But .agg() gives use more flexibility.


.agg() - different operations on columns

Let's build on the previous example and see why we should care about .agg()

:::python
df_prices = (
            df.groupby(by=['neighbourhood_group'])
              # Use different aggregation function for each column
              .agg({'price_per_night': ['mean', 'min', 'max'],
                    'annual_availability': ['mean', 'count']})
     )

df_prices

Output:

                        price_per_night                               annual_availability
 --------------------- ----------------- --------------------- ------ ------------ ------ 
                        mean              min                   max    mean        count  
  neighbourhood_group                                                                     
  Ciutat Vella          70.306712         8                     2000   183.585497   4082  
  Eixample              142.417469        8                     6000   203.462267   6228  
  Gràcia                89.263447         7                     5000   189.768949   1636  
  Horta-Guinardó        76.709565         9                     1800   184.514783    575  
  Les Corts             120.997409        9                     7777   182.072539    386  
  Nou Barris            35.365957         11                    175    191.540426    235  
  Sant Andreu           60.475610         11                    2000   192.426829    328  
  Sant Martí            94.208556         8                     3000   193.895965   2057  
  Sants-Montjuïc        77.347344         8                     8000   185.933025   2165  
  Sarrià-Sant Gervasi   107.404908        9                     2000   213.748466    652  

This time dictionary that we passed to .agg() contains list of functions that we selected to apply on a particular column groups. Very powerful when you want to quickly synthesize dataset using several aggregation functions.

But this is not the end...


.agg() - use of custom aggregation function

I saved the best part for the end, so I hope that your ready for it. Aggregation function doesn't have to be pre-defined. You can create your own one as long as it returns a single value! In our dataset, we have names column that contains all listing names. Let's say that we would like to check how many times word "sunny" and "friendly" was mentioned in the listing's name.

We start from defining out aggregation functions:

:::python
def count_sunny_in_description(titles):
    count = 0
    for title in titles:
        if 'sunny' in str(title):
            count += 1
    return count

def count_friendly_in_description(titles):
    count = 0
    for title in titles:
        if 'friendly' in str(title):
            count += 1
    return count

Pretty straightforward, we return count which is an integer stating how many times a particular word appeared in the name.

Now, having defined our aggregation functions, we can use these in the same manner as in the previous example:

:::python
df_sunny_rainy = (
        df.groupby(by=['neighbourhood_group'])
              # Use custom aggregation function for each column
              .agg({'name': [count_sunny_in_description, count_friendly_in_description]})
     )

df_sunny_rainy

Output:

                              name                                
 ---------------------------- ------------------------------------------------------------- 
                              count_sunny_in_description      count_friendly_in_description
  neighbourhood_group                                             
  Ciutat Vella                 30                              7  
  Eixample                     46                              8  
  Gràcia                       19                              1  
  Horta-Guinardó               3                               0  
  Les Corts                    8                               1  
  Nou Barris                   0                               0  
  Sant Andreu                  2                               3  
  Sant Martí                   21                              7  
  Sants-Montjuïc               26                              3  
  Sarrià-Sant Gervasi          10                              0  

And voilà, now you can analyze all types of data using .groupby() and .agg() as long as you write an appropriate aggregation function!


Summary

In this post, we examined the use of .agg(), especially in combination with .groupby(). No more merging of several .groupby() results!

From now on, you should always use .agg() when you need more than one or custom aggregation operations over the DataFrame's axis.

Happy coding!