Python - Pattern Matching with Regular Expressions

In this article, we'll be taking a look at what Regular Expressions are and
what their use-case is. For this purpose, we'll consider the following scenario

  1. Assume we have written a web-scraper.
  2. We have extracted the contact pages from a few websites and created a giant text file with the collected contact page texts.
  3. We need to find all phone numbers, mobile/cell number and email-addresses from this text.

Patterns

When we see 9876678998 on a piece of text, we immediately think, this is a
phone number, same goes for +91 9876 678998. But 9,87,66,78,998 is definitely
not a phone number.

Of course, there are a few more variations on how people write phone numbers, but
the gist is, there are different patterns for writing a phone number.

What if we could represent this pattern or any pattern with a syntax?
Regular Expressions or regex is syntax for representing such patterns.

Regular Expressions

When editing text, we do Ctrl+f to find something. The usual flow is

  1. Ctrl+f
  2. Input the thing that we want to find.
  3. The editor will find instances of that thing(string) and we can jump around them or edit them as we like.

Think of Regular Expressions(regex) as Ctrl+f on steroids. We tell our program
how a thing would look like, instead of actually typing it in. And it'd find
strings that match that description.

Most editors already do a subset of this in their implementation of Ctrl+f.
For example, when we search for ani in a text file, we could get results for

  • Ani (case-insensitive)
  • ani (matched entire word)
  • animal (partly matched)
  • Animal (case insensitive, partly matched)

So to summarize, regex matches patterns in strings.

Why use Regular Expressions at all?

Let's take an imperative approach to deduce if a string is an Indian cell phone
number or not, by writing a function that does just that.

def isCellNumber(txt):
  # must have 10 characters
  if len(txt) != 10:
    return False
  # all chars must be digits
  for i in range(0, 10):
    if not txt[i].isdecimal():
      return False
  return True 

All this simple function does is, check if the argument has 10 characters and
if all those characters are digits. If these conditions are met, it returns
True, otherwise False.

We can call this function as isCellNumber('9876543210') and it would return
True. Call it like isCellNumber('I am inevitable') and it returns False.

But, you might be thinking, this is not the only way we write cell phone numbers.

98765 43210
+919876543210
+91 98765 43210
(+91) 9876543210

If we start to write, if statements for all possible variations, the program
becomes a bit too long. This is where, the beauty of regex come in. We get a
clear one-liner and a rich set of functions to operate on the result set.

Let's get started

  • Open up the python interpretor and let's get cracking.

    $ python3
    >>>
    
  • All regex functions are in the re module. So lets use it.

    >>> import re
    
  • Now let's look at the regex for a 10 digit string.

    >>> cellNumRegex = re.compile(r'\d\d\d\d\d\d\d\d\d\d')
    
  • \d means "a digit character". The cellNumRegex variable is now a Regex object.
    The r in r'\d\d\d\d\d\d\d\d\d\d' tells python to consider the string as a
    raw string. Which basically means, if there is for example, a \n or a \t, don't treat it as a <new-line> or a <tab> character.

  • The search() method returns a MatchObject if a string matches a Regex or None
    if it doesnt.

  • Let's take a closer look, by invoking it like so

    >>> mo = cellNumRegex.search('My number is 9876543210.')
    
  • You might note, I passed on a full piece of text rather than just 9876543210. In our
    old isCellNumber() function, we'd have to write a line+word parsing logic to
    check if a sequence is a cell number or not. This is where Regex really shines.

  • Now to see if we have any MatchObjects returned, we use the group() method

    >>> print(mo.group())
    9876543210
    

Grouping

A common way to write landphone numbers is like for example, 0413 987654. We
can make a pattern for this as (\d{4})\s(\d{6}). By now, you the dear reader
should know how to check if that pattern works or not.

  • \d means a digit. \d{4} means a string of 4 digits.
  • (\d{4}) means grouping a string of 4 digits
  • \s is a space
  • (\d{6}) means grouping a string of 6 digits

Let's see this in action, with the previous text variable.

>>> ph_reg = re.compile(r'(\d{4})\s(\d{6})')
>>> res = ph_reg.search(text)
>>> res.group()
'0413 987654'
>>> res.groups()
('0413', '987654')
>>> res.group(1)
'0413'
>>> res.group(2)
'987654'
>>> ph_reg.findall(text)
[('0413', '987654')]
  • The group() method returns the entire matched string
  • The groups() method returns a tuple of the groupings.
  • To access, the nth part of the grouping, use group(n)

Optional items

Some people would write their phone number as 0413987654, without the space.
So we could say, the space is optional. This is represented with a ?

So, our ph_reg pattern now becomes

>>> ph_reg = re.compile(r'(\d{4})\s?(\d{6})')
>>> ph_reg.findall(text)
[('9198', '765432'), ('8765', '432109'), ('0413', '987654')]

We see that in a way this is similar to \d{10}, but if the string has a space,
we catch that condition too.

Replace

The sub() method, replaces all occurances of a pattern with a substitute pattern.

>>> ph_reg.sub(r'**********', text)
'\nMy phone number is +**********10\nAlso reach me at **********\nPh: **********\n'
>>> ph_reg.sub(r'\1 \2', text)
'\nMy phone number is +9198 76543210\nAlso reach me at 8765 432109\nPh: 0413 987654\n'

The \1 \2 says when substituting, have a space between the 1st and 2nd groups.

Words and matching 1 or more

An email is usually of the form xxx@yyy.zzz. The character \w means any letter,
numeric digit or underscore.

>>> emails = 'foo@bar.com is an email. So is foo@nic.edu. Mails by @foobar.edu'
>>> email_reg = re.compile(r'\w@\w.\w')
>>> email_reg.findall(emails)
['o@bar', 'o@nic']
  • \w+ means match a string with 1 or more letters

    >>> email_reg = re.compile(r'\w+@\w+.\w+')
    >>> email_reg.findall(emails)
    ['foo@bar.com', 'foo@nic.edu']
    
  • \w* means match a string with 0 or more letters

    >>> em_reg = re.compile(r'\w*@\w+.\w+')
    >>> em_reg.findall(emails)
    ['foo@bar.com', 'foo@nic.edu', '@foobar.edu']
    

Say we want just emails with alphabets in them. \w would also get us digits and
underscore. For that we can use a range [a-z].

>>> em_reg = re.compile(r'[a-z]+@[a-z]+.[a-z]+')
>>> em_reg.findall(emails)
['foo@bar.com', 'foo@nic.edu']

Closing and credits

This writing is a set of notes prepared from the session of the same name,
taken by our Gokulnath. Automate the Boring Stuff with Python by Al Sweigart, is our study/reference material.