Can Pandas understand that tall is taller than short? Use of categorical variables.
Posted on pon 22 kwietnia 2019 in Data Science
Why categorical variables?
With the growth of your data science projects, sooner or later you will realize that amount of RAM your computer has, is not enough. Probably you know that categorical variables give you a possibility to optimize used memory for columns that contain categorical data. Examples of categorical data are:
- Column: height, values: short, tall
- Column: country, values: USA, Poland
- Column: food, values: pizza, kebab
Usually these will be strings containing some information on things that can't be written numerically.
Apart from memory optimization, categorical data type gives you an option to tell Pandas about your information semantics and this is what we are going to focus on in this post.
Particularly, I am going to show you how:
- you can convert column to categorical data and use
df.describe()
on it - take advantage of
pd.cut()
Pandas method - include data semantics and be able to execute statement
tall > short
From continuous data to categorical description - use od pd.cut()
Often we find in our datasets column with numerical data with clear semantic meaning. Example can be a column containing
height of a person in centimeters. Sometimes we care only about quick information that this column conveys such as information
stating if a person is short, medium or tall. Let's see how we can call this in Pandas with the use of pd.cut()
method.
In order to understand how it works, let's see the following example:
::python
# Create dataframe with height column containing height in centimeters
df = pd.DataFrame({'height': np.random.randint(100, 200, 20)})
# Create labels for your bins
labels = ["short", "medium", "tall"]
# Create ranges for your categtories (bins parameter). Notice that bins need to cover all numerical range included
# in x column.
df['group'] = pd.cut(x=df.height, bins=[99, 150, 180, 200], labels=labels)
# In this case our bins will be created in the following way:
# short -> (99,150]
# medium -> (150,180]
# tall -> (180, 200]
df.head()
Output:
+---+--------+--------+
| | height | group |
+---+--------+--------+
| 0 | 178 | medium |
| 1 | 104 | short |
| 2 | 178 | medium |
| 3 | 121 | short |
| 4 | 157 | medium |
+---+--------+--------+
A few things to highlight:
- labels need to be contained in a list
- bins parameter needs to cover all numerical range that analyzed column has
- number of elements in bins list need to correspond with number of labels that were created in order to create appropriate numerical ranges.
Use of .describe()
Once we have our group
column containing semantic height meaning, it's a great moment to introduce Pandas categorical variables.
Pandas provides a special object CategoricalDtype
to create this type of variable. We initialize it by providing a list of
categories as a parameter. Then, we convert a column with the use of .astype()
method and pass category object as a
parameter. Finally, we can use .describe()
method which I am sure that you call in case of columns that have numerical values. It has
it's application for categorical columns as well.
::python
from pandas.api.types import CategoricalDtype
height_order = CategoricalDtype(categories=['short', 'medium', 'tall'], ordered=True)
df.group = df.group.astype(height_order)
df.group.describe()
Output:
count 20
unique 3
top medium
freq 9
Name: group, dtype: object
We see that now .describe()
describes perfectly our categorical column. It tells how many rows it has, what are the unique
categories in column (short, medium and tall), what is the most frequent category (top
parameter),
and how many times it occurs in a column (freq
parameter).
Categorical columns - specifying order and taking advantage of it
There is one thing that we haven't discussed previously. Notice when we create our CategoricalDtype
, we set ordered
parameter to True
.
By doing so, we tell Pandas that our category has the following semantics: short < medium < tall. From now on Pandas knows that tall is taller
than short and medium!
This comes in handy when we want to sort a DataFrame based on categorical data type.
::python
# Using order to sort values
df.sort_values(by=['height'], ascending=False).head()
Output:
+----+--------+--------+
| | height | group |
+----+--------+--------+
| 17 | 196 | tall |
| 8 | 195 | tall |
| 9 | 194 | tall |
| 5 | 190 | tall |
| 0 | 178 | medium |
+----+--------+--------+
Moreover, we can use methods such us .min()
and .max()
on categorical columns:
::python
df.group.min(), df.group.max()
Output:
('short', 'tall')
Finally, we can also create regular masks for categorical columns. We do that in the following way:
::python
# Comparing categorical variables
df = df.assign(is_short = df.group == "short")
df = df.assign(taller_than_short = df.group > "short")
df.head()
Output:
+---+--------+--------+----------+-------------------+
| | height | group | is_short | taller_than_short |
+---+--------+--------+----------+-------------------+
| 0 | 113 | short | True | False |
| 1 | 152 | medium | False | True |
| 2 | 177 | medium | False | True |
| 3 | 163 | medium | False | True |
| 4 | 179 | medium | False | True |
+---+--------+--------+----------+-------------------+
Here important thing to remember is that you must compare a categorical column with a variable that was previously
defined in the categories for that column. Any other comparison will result with TypeError
.
Summary
In this posted we checked how and when we can use Pandas categorical variables. We saw how we can go from continous to categorical type of data, add structure to the categorical type and benefit from it in various ways.
Did you have a situation when you could use a categorial type? What is your most favourite use case for a categorical type? Will you use this type in the future? Leave a comment below and let's discuss it a little!
Happy coding!