Python - Pattern Matching with Regular Expressions
In this article, we'll be taking a look at what Regular Expressions are and
what their use-case is. For this purpose, we'll consider the following scenario
- Assume we have written a web-scraper.
- We have extracted the contact pages from a few websites and created a giant text file with the collected contact page texts.
- We need to find all phone numbers, mobile/cell number and email-addresses from this text.
Patterns
When we see 9876678998
on a piece of text, we immediately think, this is a
phone number, same goes for +91 9876 678998
. But 9,87,66,78,998
is definitely
not a phone number.
Of course, there are a few more variations on how people write phone numbers, but
the gist is, there are different patterns for writing a phone number.
What if we could represent this pattern or any pattern with a syntax?
Regular Expressions or regex
is syntax for representing such patterns.
Regular Expressions
When editing text, we do Ctrl+f
to find something. The usual flow is
Ctrl+f
- Input the
thing
that we want to find. - The editor will find instances of that thing(
string
) and we can jump around them or edit them as we like.
Think of Regular Expressions(regex
) as Ctrl+f
on steroids. We tell our program
how a thing
would look like, instead of actually typing it in. And it'd find
strings that match
that description.
Most editors already do a subset of this in their implementation of Ctrl+f
.
For example, when we search for ani
in a text file, we could get results for
Ani
(case-insensitive)ani
(matched entire word)animal
(partly matched)Animal
(case insensitive, partly matched)
So to summarize, regex
matches patterns
in string
s.
Why use Regular Expressions at all?
Let's take an imperative approach to deduce if a string is an Indian cell phone
number or not, by writing a function that does just that.
def isCellNumber(txt):
# must have 10 characters
if len(txt) != 10:
return False
# all chars must be digits
for i in range(0, 10):
if not txt[i].isdecimal():
return False
return True
All this simple function does is, check if the argument has 10 characters and
if all those characters are digits. If these conditions are met, it returnsTrue
, otherwise False
.
We can call this function as isCellNumber('9876543210')
and it would returnTrue
. Call it like isCellNumber('I am inevitable')
and it returns False
.
But, you might be thinking, this is not the only way we write cell phone numbers.
98765 43210
+919876543210
+91 98765 43210
(+91) 9876543210
If we start to write, if
statements for all possible variations, the program
becomes a bit too long. This is where, the beauty of regex
come in. We get a
clear one-liner and a rich set of functions to operate on the result set.
Let's get started
-
Open up the python interpretor and let's get cracking.
$ python3 >>>
-
All
regex
functions are in there
module. So lets use it.>>> import re
-
Now let's look at the regex for a 10 digit string.
>>> cellNumRegex = re.compile(r'\d\d\d\d\d\d\d\d\d\d')
-
\d
means "a digit character". ThecellNumRegex
variable is now aRegex
object.
Ther
inr'\d\d\d\d\d\d\d\d\d\d'
tells python to consider the string as a
raw string. Which basically means, if there is for example, a\n
or a\t
, don't treat it as a<new-line>
or a<tab>
character. -
The
search()
method returns aMatchObject
if a string matches a Regex orNone
if it doesnt. -
Let's take a closer look, by invoking it like so
>>> mo = cellNumRegex.search('My number is 9876543210.')
-
You might note, I passed on a full piece of text rather than just
9876543210
. In our
oldisCellNumber()
function, we'd have to write a line+word parsing logic to
check if a sequence is a cell number or not. This is where Regex really shines. -
Now to see if we have any
MatchObjects
returned, we use thegroup()
method>>> print(mo.group()) 9876543210
Grouping
A common way to write landphone numbers is like for example, 0413 987654
. We
can make a pattern for this as (\d{4})\s(\d{6})
. By now, you the dear reader
should know how to check if that pattern works or not.
\d
means a digit.\d{4}
means a string of 4 digits.(\d{4})
means grouping a string of 4 digits\s
is a space(\d{6})
means grouping a string of 6 digits
Let's see this in action, with the previous text
variable.
>>> ph_reg = re.compile(r'(\d{4})\s(\d{6})')
>>> res = ph_reg.search(text)
>>> res.group()
'0413 987654'
>>> res.groups()
('0413', '987654')
>>> res.group(1)
'0413'
>>> res.group(2)
'987654'
>>> ph_reg.findall(text)
[('0413', '987654')]
- The
group()
method returns the entire matched string - The
groups()
method returns a tuple of the groupings. - To access, the
n
th part of the grouping, usegroup(n)
Optional items
Some people would write their phone number as 0413987654
, without the space.
So we could say, the space is optional. This is represented with a ?
So, our ph_reg
pattern now becomes
>>> ph_reg = re.compile(r'(\d{4})\s?(\d{6})')
>>> ph_reg.findall(text)
[('9198', '765432'), ('8765', '432109'), ('0413', '987654')]
We see that in a way this is similar to \d{10}
, but if the string has a space,
we catch that condition too.
Replace
The sub()
method, replaces all occurances of a pattern with a substitute pattern.
>>> ph_reg.sub(r'**********', text)
'\nMy phone number is +**********10\nAlso reach me at **********\nPh: **********\n'
>>> ph_reg.sub(r'\1 \2', text)
'\nMy phone number is +9198 76543210\nAlso reach me at 8765 432109\nPh: 0413 987654\n'
The \1 \2
says when substituting, have a space between the 1st and 2nd groups.
Words and matching 1 or more
An email is usually of the form xxx@yyy.zzz
. The character \w
means any letter,
numeric digit or underscore.
>>> emails = 'foo@bar.com is an email. So is foo@nic.edu. Mails by @foobar.edu'
>>> email_reg = re.compile(r'\w@\w.\w')
>>> email_reg.findall(emails)
['o@bar', 'o@nic']
-
\w+
means match a string with 1 or more letters>>> email_reg = re.compile(r'\w+@\w+.\w+') >>> email_reg.findall(emails) ['foo@bar.com', 'foo@nic.edu']
-
\w*
means match a string with 0 or more letters>>> em_reg = re.compile(r'\w*@\w+.\w+') >>> em_reg.findall(emails) ['foo@bar.com', 'foo@nic.edu', '@foobar.edu']
Say we want just emails with alphabets in them. \w
would also get us digits and
underscore. For that we can use a range [a-z]
.
>>> em_reg = re.compile(r'[a-z]+@[a-z]+.[a-z]+')
>>> em_reg.findall(emails)
['foo@bar.com', 'foo@nic.edu']
Closing and credits
This writing is a set of notes prepared from the session of the same name,
taken by our Gokulnath. Automate the Boring Stuff with Python by Al Sweigart, is our study/reference material.