Pandas in 2019 - let's see what's new!
Posted on czw 05 września 2019 in Data Science
In case you forgot to read release notes for Pandas 0.24 & 0.25 version, then don't worry. Below I summarized new functionalities that will make your coding process faster, easier and more effective. All thank's to Pandas open source community that is constantly improving the package!
Particularly, we are going to see the following examples:
Int64
type in Pandas- Named aggregations
- pandas
.explode
method - JSON normalization
- how
.rename
can raiseKeyError
- Using your favourite plotting backend directly in Pandas
How int
can live with None
:::python
cities = pd.DataFrame({'continent': ['Europe', 'Europe', 'Europe', 'North America', 'North America'],
'cities': ['barcelona', 'warsaw', 'paris', 'montreal', 'toronto'],
'residents': [1620000, 1777000, 2140000, None, 2731000],
'area': [101, 517, 105 , None, 630]
})
cities
+---+---------------+-----------+-----------+-------+
| | continent | cities | residents | area |
+---+---------------+-----------+-----------+-------+
| 0 | Europe | barcelona | 1620000.0 | 101.0 |
| 1 | Europe | warsaw | 1777000.0 | 517.0 |
| 2 | Europe | paris | 2140000.0 | 105.0 |
| 3 | North America | montreal | NaN | NaN |
| 4 | North America | toronto | 2731000.0 | 630.0 |
+---+---------------+-----------+-----------+-------+
As we can see, residents and area columns are float type even though we passed int
types as input.
That's because fourth row contains NaNs
. Luckily, now we can change it with the use of Int64
data type.
:::python
cities[['residents', 'area']] = cities[['residents', 'area']].astype('Int64')
cities
+---+---------------+-----------+-----------+------+
| | continent | cities | residents | area |
+---+---------------+-----------+-----------+------+
| 0 | Europe | barcelona | 1620000 | 101 |
| 1 | Europe | warsaw | 1777000 | 517 |
| 2 | Europe | paris | 2140000 | 105 |
| 3 | North America | montreal | NaN | NaN |
| 4 | North America | toronto | 2731000 | 630 |
+---+---------------+-----------+-----------+------+
Works like a charm!
Named aggregations
Named aggregations is what will make your groupby/rename process much faster. Now you can do it in one line:
This is how you can use it on DataFrame:
:::python
# Named aggregations on DataFrame
cities.groupby('continent').agg(
avg_habitants = ('residents', 'mean'),
max_area = ('area', 'max')
)
Result:
+---------------+---------------+----------+
| | avg_habitants | max_area |
+---------------+---------------+----------+
| continent | | |
| Europe | 1.845667e+06 | 517 |
| North America | 2.731000e+06 | 630 |
+---------------+---------------+----------+
And on series:
:::python
# On Series
cities.groupby('continent').area.agg(
max_area='max',
min_area=np.min,
)
Result:
+---------------+----------+----------+
| | max_area | min_area |
+---------------+----------+----------+
| continent | | |
| Europe | 517 | 101 |
| North America | 630 | 630 |
+---------------+----------+----------+
list
to multiple columns
Personally, I don't like storing list in DataFrame cells, but let's be honest -> It sometimes happens.
Now we can use .explode
to handle their transformation:
:::python
# Data column contains a list of two values
barcelona = pd.DataFrame({'city': ['Barcelona'],
'data': [[1620000,101]]})
barcelona
+---+-----------+----------------+
| | city | data |
+---+-----------+----------------+
| 0 | Barcelona | [1620000, 101] |
+---+-----------+----------------+
Here is how you can transform data
column to long format and get rid of list:
:::python
barcelona.explode('data')
+---+-----------+---------+
| | city | data |
+---+-----------+---------+
| 0 | Barcelona | 1620000 |
| 0 | Barcelona | 101 |
+---+-----------+---------+
I bet that this is going to help you a lot!
JSON as columns
We use JSON a lot, it has it's pros and cons, but so far there was no sufficient option in Pandas to read nested JSON file such us the following:
:::python
data = [{
'City': {'name': 'Barcelona',
'area': 101,
'residents': {'2018': 1600000, '2019': 1620000}},
'Country': {'name': 'Spain',
'continent': 'Europe'}
}]
See that residents
attribute is a descendant of City
attribute.
But here comes Pandas with its new json_normalize
method!
:::python
from pandas.io.json import json_normalize
json_normalize(data, max_level=2)
+---+-----------+-----------+---------------------+---------------------+--------------+-------------------+
| | City.name | City.area | City.residents.2018 | City.residents.2019 | Country.name | Country.continent |
+---+-----------+-----------+---------------------+---------------------+--------------+-------------------+
| 0 | Barcelona | 101 | 1600000 | 1620000 | Spain | Europe |
+---+-----------+-----------+---------------------+---------------------+--------------+-------------------+
Take a note that max_level
parameter describes how many levels of depth we would like to transform to columns.
.rename
raises errors
Earlier, all errors were silent. Now this happens if we want to rename column that does not exist and you set appropriate flag:
:::python
cities.rename({'does_not_exist': 'will_raise_an_error'}, errors='raise')
:::python
KeyError: "['does_not_exist'] not found in axis"
Various plotting backends
This is something that I personally have been waiting for a long time. Pandas now allows to use different plotting backends straight from their API. To do so, you need to first find your favourite plotting package. Important: this package has to be adjusted to Pandas. So far, I tested two plotting packages: - Pandas_Bokeh -> Source code and docs - Plotly -> unfortunately, I haven't found their Pandas implementation:(
To use is as a plotting backend (and say goodbye to Matplotlib...), we need to set the following option:
:::python
import pandas_bokeh
pd.set_option('plotting.backend', 'pandas_bokeh')
Now you can use it to create your favourite plots:
:::python
cities.plot_bokeh.scatter(
x="residents",
y="area",
category="cities",
title="Cities information",
show_figure=True
)
Voilà!
Summary
If you want to read the whole release notes, you can find it here. And remember that you can also contribute to open source by either donating some money or pushing your commits. Just go to Pandas (or any other open source package) repo, find your first issue and create your first pull request!:)
Happy coding!