Dealing with text data using regular expressions. How Pandas can .extract information.
Posted on nie 05 maja 2019 in Data Science
What are regular expressions and why are these useful?
Working with text data can be laborious when you don't know where to start. Usually, one will need to search for regularities in text, especially when the output needs to be presented in a tabular format.
That's one of the reasons, why Python contains .re
as one of its default modules. .re
helps you to work with text
data and search for various regularities. Using regex (common way to call regular expressions) can help you immensely
with text analysis. Moreover, Pandas has a few methods that will make it easier for you.
In today's post we are going to:
- See basic and advanced uses of
re.search
with their thorough explanation - Use
.expand
to show how Pandas works with regular expressions - Provide useful resources to learn more and speed up your work especially in text analysis
Regular expressions - a gentle introduction
Url of this blog post is the following: https://filipgeppert.com/pandas-extract-text-analysis-regular-expressions
.
Let's say that we are particularly interested in extracting both a domain name and a title of the post, as separate strings.
This is the code that does this job for a domain name:
::python
import re
# Our search string
search_string = 'https://filipgeppert.com/pandas-extract-text-analysis-regular-expressions'
# Regular expresssion
regex = '\w+(\.)\w+'
# Domain name
domain_name = re.search(regex, search_string).group(0)
print(domain_name)
Output:
filipgeppert.com
A few things to highlight:
\w
is a shortcut for any word character+
means that we are interested in more than one neighbouring characters(\.)
- dot is a special character and we use backslash to highlight that it also should be matched
The translation of our regex is: Search for two groups of letters that are separated by one dot.
re.search
takes our regex, and tries to search this pattern over the whole string. If it finds any (can be more than
one), then it returns your results. .group(0)
means that we want the first matched result to be returned.
Regular expressions - working with groups
For a remaining part of url, which is a title of this post, the regex will be the following:
::python
import re
# Our search string
search_string = 'https://filipgeppert.com/pandas-extract-text-analysis-regular-expressions'
# Regular expresssion
regex = '\w\/(?P<post_title>(.+))\Z'
# Domain name
post_title = re.search(regex, search_string).group('post_title')
print(post_title)
Output:
pandas-extract-text-analysis-regular-expressions
A few things have changed, so let me explain it a little. Please start reading this regex from the end:
\Z
is a flag that says that we want to start searching exactly from the end of a string(?P<post_title>(.+))
introduces the concept of groups. The syntax is the following(?P<group_name>YOUR REGEX)
So basically, our group's name is post_title and our regex is(.+)
that matches any character for an infinite number of times\/
is a match for a forward slash. It is a special character and it needs to be preceded by backslash\w
tells that we want to match exactly one word character
The translation of our regex is: Start searching from the end. Match any character until a forward slash and a letter are found. Name
this group post_title
.
re.search
takes our regex, and tries to search this pattern over the whole string. It is going to find exactly one string (if there is one)
and create a dictionary with the name of the group as a key and value being the string. That is why this time we called
.group('post_title')
to obtain our result.
.extract
information with the use of Pandas
So far we've seen operations on only one string. But what if we want to apply a regex to the whole column in any DataFrame?
In this case, Pandas helps us with .extract
method. Let's see how it can be used.
Our DataFrame contains titles of blog posts:
+-------------------------------------------------------------------------------------------------------------+
| url |
+-------------------------------------------------------------------------------------------------------------+
| https://filipgeppert.com/aws-python-s3fs-api.html#aws-python-s3fs-api |
| https://filipgeppert.com/DataFrame-Pandas-Categorical-Variables.html#DataFrame-Pandas-Categorical-Variables |
| https://filipgeppert.com/DataFrame-GroupBy-agg.html#DataFrame-GroupBy-agg |
| https://filipgeppert.com/for-loop-in-pandas-apply.html#for-loop-in-pandas-apply |
| https://filipgeppert.com |
+-------------------------------------------------------------------------------------------------------------+
Let's see how we can use regex in order to extract a domain name.
::python
df.url.str.extract('https:\/\/(?P<domain_name>\w+)\.\w+', expand=True)
Output:
+-----------------+
| domain_name |
+-----------------+
| 0 filipgeppert |
| 1 filipgeppert |
| 2 filipgeppert |
| 3 filipgeppert |
| 4 filipgeppert |
+-----------------+
We used .extract
which is a str
method. We used it on url
column and gave a regular expression as its first parameter.
Moreover, we set expand
to True
, which tells that we want our output to be a DataFrame.
Similarly, we can use .extract
to separate post title.
::python
df.url.str.extract('\w\/(?P<post_title>.+)\Z', expand=True)
Output:
+------------------------------------------------------------------------------------+
| post_title |
+------------------------------------------------------------------------------------+
| aws-python-s3fs-api.html#aws-python-s3fs-api |
| DataFrame-Pandas-Categorical-Variables.html#DataFrame-Pandas-Categorical-Variables |
| DataFrame-GroupBy-agg.html#DataFrame-GroupBy-agg |
| for-loop-in-pandas-apply.html#for-loop-in-pandas-apply |
| NaN |
+------------------------------------------------------------------------------------+
Note that for stings where post_title
was not found, Pandas returned NaN
.
Summary and useful resources
In this post we checked how Pandas and Python regular expressions can be used in your text analysis.
If you plan to work more with regexes, these are two resources that I strongly recommend you to check:
- regex101.com - a place, where you can quickly validate your regex and see its output
- Python documentation - a comprehensive guide on regular expressions, explaining all its advantages and use cases
Let me know in the comment if you have any texts that you would like to analyze! Hopefully, I can suggest the right tool for it!:) Finally, don't forget to sign up for newsletter and be notified once a new post is released:)
Happy coding!