Dealing with text data using regular expressions. How Pandas can .extract information.

Posted on nie 05 maja 2019 in Data Science


What are regular expressions and why are these useful?

Working with text data can be laborious when you don't know where to start. Usually, one will need to search for regularities in text, especially when the output needs to be presented in a tabular format.

That's one of the reasons, why Python contains .re as one of its default modules. .re helps you to work with text data and search for various regularities. Using regex (common way to call regular expressions) can help you immensely with text analysis. Moreover, Pandas has a few methods that will make it easier for you.

In today's post we are going to:

  • See basic and advanced uses of re.search with their thorough explanation
  • Use .expand to show how Pandas works with regular expressions
  • Provide useful resources to learn more and speed up your work especially in text analysis

Regular expressions - a gentle introduction

Url of this blog post is the following: https://filipgeppert.com/pandas-extract-text-analysis-regular-expressions. Let's say that we are particularly interested in extracting both a domain name and a title of the post, as separate strings. This is the code that does this job for a domain name:

::python
import re
# Our search string
search_string = 'https://filipgeppert.com/pandas-extract-text-analysis-regular-expressions'
# Regular expresssion
regex = '\w+(\.)\w+'
# Domain name
domain_name = re.search(regex, search_string).group(0)
print(domain_name)

Output:

filipgeppert.com

A few things to highlight:

  • \w is a shortcut for any word character
  • + means that we are interested in more than one neighbouring characters
  • (\.) - dot is a special character and we use backslash to highlight that it also should be matched

The translation of our regex is: Search for two groups of letters that are separated by one dot.

re.search takes our regex, and tries to search this pattern over the whole string. If it finds any (can be more than one), then it returns your results. .group(0) means that we want the first matched result to be returned.


Regular expressions - working with groups

For a remaining part of url, which is a title of this post, the regex will be the following:

::python
import re
# Our search string
search_string = 'https://filipgeppert.com/pandas-extract-text-analysis-regular-expressions'
# Regular expresssion
regex = '\w\/(?P<post_title>(.+))\Z'
# Domain name
post_title = re.search(regex, search_string).group('post_title')
print(post_title)

Output:

pandas-extract-text-analysis-regular-expressions

A few things have changed, so let me explain it a little. Please start reading this regex from the end:

  • \Z is a flag that says that we want to start searching exactly from the end of a string
  • (?P<post_title>(.+)) introduces the concept of groups. The syntax is the following (?P<group_name>YOUR REGEX) So basically, our group's name is post_title and our regex is (.+) that matches any character for an infinite number of times
  • \/ is a match for a forward slash. It is a special character and it needs to be preceded by backslash
  • \w tells that we want to match exactly one word character

The translation of our regex is: Start searching from the end. Match any character until a forward slash and a letter are found. Name this group post_title.

re.search takes our regex, and tries to search this pattern over the whole string. It is going to find exactly one string (if there is one) and create a dictionary with the name of the group as a key and value being the string. That is why this time we called .group('post_title') to obtain our result.


.extract information with the use of Pandas

So far we've seen operations on only one string. But what if we want to apply a regex to the whole column in any DataFrame?

In this case, Pandas helps us with .extract method. Let's see how it can be used. Our DataFrame contains titles of blog posts:

+-------------------------------------------------------------------------------------------------------------+
|                                                     url                                                     |
+-------------------------------------------------------------------------------------------------------------+
| https://filipgeppert.com/aws-python-s3fs-api.html#aws-python-s3fs-api                                       |
| https://filipgeppert.com/DataFrame-Pandas-Categorical-Variables.html#DataFrame-Pandas-Categorical-Variables |
| https://filipgeppert.com/DataFrame-GroupBy-agg.html#DataFrame-GroupBy-agg                                   |
| https://filipgeppert.com/for-loop-in-pandas-apply.html#for-loop-in-pandas-apply                             |
| https://filipgeppert.com                                                                                    |
+-------------------------------------------------------------------------------------------------------------+

Let's see how we can use regex in order to extract a domain name.

::python
df.url.str.extract('https:\/\/(?P<domain_name>\w+)\.\w+', expand=True)

Output:

+-----------------+
|   domain_name   |
+-----------------+
| 0  filipgeppert |
| 1  filipgeppert |
| 2  filipgeppert |
| 3  filipgeppert |
| 4  filipgeppert |
+-----------------+

We used .extract which is a str method. We used it on url column and gave a regular expression as its first parameter. Moreover, we set expand to True, which tells that we want our output to be a DataFrame.

Similarly, we can use .extract to separate post title.

::python
df.url.str.extract('\w\/(?P<post_title>.+)\Z', expand=True)

Output:

+------------------------------------------------------------------------------------+
|                                      post_title                                    |
+------------------------------------------------------------------------------------+
| aws-python-s3fs-api.html#aws-python-s3fs-api                                       |
| DataFrame-Pandas-Categorical-Variables.html#DataFrame-Pandas-Categorical-Variables |
| DataFrame-GroupBy-agg.html#DataFrame-GroupBy-agg                                   |
| for-loop-in-pandas-apply.html#for-loop-in-pandas-apply                             |
| NaN                                                                                |
+------------------------------------------------------------------------------------+

Note that for stings where post_title was not found, Pandas returned NaN.


Summary and useful resources

In this post we checked how Pandas and Python regular expressions can be used in your text analysis.

If you plan to work more with regexes, these are two resources that I strongly recommend you to check:

  • regex101.com - a place, where you can quickly validate your regex and see its output
  • Python documentation - a comprehensive guide on regular expressions, explaining all its advantages and use cases

Let me know in the comment if you have any texts that you would like to analyze! Hopefully, I can suggest the right tool for it!:) Finally, don't forget to sign up for newsletter and be notified once a new post is released:)

Happy coding!