Pandas in 2019 - let's see what's new!

Posted on czw 05 września 2019 in Data Science

In case you forgot to read release notes for Pandas 0.24 & 0.25 version, then don't worry. Below I summarized new functionalities that will make your coding process faster, easier and more effective. All thank's to Pandas open source community that is constantly improving the package!

Particularly, we are going to see the following examples:

  • Int64 type in Pandas
  • Named aggregations
  • pandas .explode method
  • JSON normalization
  • how .rename can raise KeyError
  • Using your favourite plotting backend directly in Pandas

How int can live with None

:::python
cities = pd.DataFrame({'continent': ['Europe', 'Europe', 'Europe', 'North America', 'North America'],
                       'cities': ['barcelona', 'warsaw', 'paris', 'montreal', 'toronto'],
                       'residents': [1620000, 1777000, 2140000, None, 2731000],
                       'area': [101, 517, 105 , None, 630]
                      })
cities
+---+---------------+-----------+-----------+-------+
|   |   continent   |  cities   | residents | area  |
+---+---------------+-----------+-----------+-------+
| 0 | Europe        | barcelona | 1620000.0 | 101.0 |
| 1 | Europe        | warsaw    | 1777000.0 | 517.0 |
| 2 | Europe        | paris     | 2140000.0 | 105.0 |
| 3 | North America | montreal  | NaN       | NaN   |
| 4 | North America | toronto   | 2731000.0 | 630.0 |
+---+---------------+-----------+-----------+-------+

As we can see, residents and area columns are float type even though we passed int types as input. That's because fourth row contains NaNs. Luckily, now we can change it with the use of Int64 data type.

:::python
cities[['residents', 'area']] = cities[['residents', 'area']].astype('Int64')
cities
+---+---------------+-----------+-----------+------+
|   |   continent   |  cities   | residents | area |
+---+---------------+-----------+-----------+------+
| 0 | Europe        | barcelona | 1620000   | 101  |
| 1 | Europe        | warsaw    | 1777000   | 517  |
| 2 | Europe        | paris     | 2140000   | 105  |
| 3 | North America | montreal  | NaN       | NaN  |
| 4 | North America | toronto   | 2731000   | 630  |
+---+---------------+-----------+-----------+------+

Works like a charm!


Named aggregations

Named aggregations is what will make your groupby/rename process much faster. Now you can do it in one line:

This is how you can use it on DataFrame:

:::python
# Named aggregations on DataFrame
cities.groupby('continent').agg(
    avg_habitants = ('residents', 'mean'),
    max_area = ('area', 'max')
)

Result:

+---------------+---------------+----------+
|               | avg_habitants | max_area |
+---------------+---------------+----------+
| continent     |               |          |
| Europe        | 1.845667e+06  |      517 |
| North America | 2.731000e+06  |      630 |
+---------------+---------------+----------+

And on series:

:::python
# On Series
cities.groupby('continent').area.agg(
    max_area='max',
    min_area=np.min,
)

Result:

+---------------+----------+----------+
|               | max_area | min_area |
+---------------+----------+----------+
| continent     |          |          |
| Europe        |      517 |      101 |
| North America |      630 |      630 |
+---------------+----------+----------+

list to multiple columns

Personally, I don't like storing list in DataFrame cells, but let's be honest -> It sometimes happens. Now we can use .explode to handle their transformation:

:::python
# Data column contains a list of two values
barcelona = pd.DataFrame({'city': ['Barcelona'],
                       'data': [[1620000,101]]})
barcelona
+---+-----------+----------------+
|   |   city    |      data      |
+---+-----------+----------------+
| 0 | Barcelona | [1620000, 101] |
+---+-----------+----------------+

Here is how you can transform data column to long format and get rid of list:

:::python
barcelona.explode('data')
+---+-----------+---------+
|   |   city    |  data   |
+---+-----------+---------+
| 0 | Barcelona | 1620000 |
| 0 | Barcelona |     101 |
+---+-----------+---------+

I bet that this is going to help you a lot!


JSON as columns

We use JSON a lot, it has it's pros and cons, but so far there was no sufficient option in Pandas to read nested JSON file such us the following:

:::python
data = [{
    'City': {'name': 'Barcelona',
             'area': 101,
             'residents': {'2018': 1600000, '2019': 1620000}},
    'Country': {'name': 'Spain',
                'continent': 'Europe'}
}]

See that residents attribute is a descendant of City attribute. But here comes Pandas with its new json_normalize method!

:::python
from pandas.io.json import json_normalize
json_normalize(data, max_level=2)
+---+-----------+-----------+---------------------+---------------------+--------------+-------------------+
|   | City.name | City.area | City.residents.2018 | City.residents.2019 | Country.name | Country.continent |
+---+-----------+-----------+---------------------+---------------------+--------------+-------------------+
| 0 | Barcelona |       101 |             1600000 |             1620000 | Spain        | Europe            |
+---+-----------+-----------+---------------------+---------------------+--------------+-------------------+

Take a note that max_level parameter describes how many levels of depth we would like to transform to columns.


.rename raises errors

Earlier, all errors were silent. Now this happens if we want to rename column that does not exist and you set appropriate flag:

:::python
cities.rename({'does_not_exist': 'will_raise_an_error'}, errors='raise')
:::python
KeyError: "['does_not_exist'] not found in axis"

Various plotting backends

This is something that I personally have been waiting for a long time. Pandas now allows to use different plotting backends straight from their API. To do so, you need to first find your favourite plotting package. Important: this package has to be adjusted to Pandas. So far, I tested two plotting packages: - Pandas_Bokeh -> Source code and docs - Plotly -> unfortunately, I haven't found their Pandas implementation:(

To use is as a plotting backend (and say goodbye to Matplotlib...), we need to set the following option:

:::python
import pandas_bokeh
pd.set_option('plotting.backend', 'pandas_bokeh')

Now you can use it to create your favourite plots:

:::python
cities.plot_bokeh.scatter(
    x="residents",
    y="area",
    category="cities",
    title="Cities information",
    show_figure=True
)

Voilà!


Summary

If you want to read the whole release notes, you can find it here. And remember that you can also contribute to open source by either donating some money or pushing your commits. Just go to Pandas (or any other open source package) repo, find your first issue and create your first pull request!:)

Happy coding!