The Dump - Code and Beyond > Python regular expressions

Python regex tutorial

One of the most important tools in any programming language is regular expressions that identify patterns in strings. The use of regular expressions allows us to write short code that does a lot. For example, to identify a phone number in a document you might be tempted to write a multi-line script that uses a loop with many conditions or you can simply use a single line of code that utilizes the power of regular expressions.

Regular expressions help us in performing 4 main tasks:

In finding data that interests us within a large amount of information.
In validating user inputs to forms such as an email address or a phone number.
For replacing an expression. Eg., swear filters.
In adjusting the format of data obtained from unreliable sources. For example, formatting dates that users enter into a form.

So if you want to improve your programming skills, and work less (much less) then you really should learn the subject of regular expressions. Let's get started!

The "re" Python module

To work with regular expressions in Python we will start by importing the module re (short for regular expression).

import re

The re.search() method

The re.search() method finds an expression in a string.

For example, let's find the expression "regex" inside the string "regex are awesome":

if re.search("regex", "regex are awesome"):
    print("regex exists")
else:
    print("no regex")

The result:

regex exists

As simple as that!

To understand how the method works, let's see the following examples:

print(re.search("regex", "all the regexes are awesome"))

The result:

re.Match object; span=(8, 13), match='regex'

The method found a match for the expression between positions 8 and 13.

But what happens when the method doesn't find a match?

print(re.search("regex", "all the ... are awesome"))

The result:

None

The method re.search() finds the first match in the string and then stops searching:

print(re.search("regex", "all the regexes are awesome including the regex [a-z]"))

The result:

re.Match object; span=(8, 13), match='regex'

To find all the matches in a string use the method re.findall(), as we'll see in the following section.

What happens if we change even a single character in the expression? To test this let's change the first letter of the expression from lower case to capital: "regex" -> "Regex".

if re.search("Regex", "regex are awesome"):
    print("expression exists")
else:
    print("no match")

The regular expression engine can't find a match:

no match

We can also use a regular expression (=regex) to find a match regardless of the case. For example, by using the following expression:

r"[Rr]egex"

Regular expressions (=regex) are enclosed in brackets and preceded by "r".

Our regex tries to find a string that starts with "r" ignoring the case and followed by the text "egex".

Let's give it a try:

if re.search(r"[Rr]egex", "regex are awesome"):
    print("found a match")
else:
    print("no match")

The result:

found a match

The regular expression is made up of characters, each of which represents part of the pattern we want to capture. We will immediately explain the special characters that make up the regular expression but first let's see another Python method for working with regular expressions.

The re.findall method

The re.findall() method is the most popular among Python's methods for finding regular expressions. While the re.search() method, which we saw in the previous section, finds only the first match, re.findall() returns an array that includes all the matches to the expression.

In the following example, we try to find all the people's names in the string:

str = 'George is 21 yrs. old and Kate is 19'

We know that English names start with a capital letter followed by at least one lowercase letter. The following expression can identify the pattern:

'[A-Z][a-z]+'

A regular expression enclosed in square brackets searches for a range of characters:

[A-Z] to find a capital letter.
[a-z]+ to search for at least one lowercase letter. The plus sign (+) matches one or more occurrences of the expression that it follows.

The following code finds all the names in the sentence:

str = 'George is 21 yrs. old and Kate is 19'

# re.findall() returns a list of all the matching names
names = re.findall(r'[A-Z][a-z]+', str)   
print (names)

The result:

['George', 'Kate']

Range of characters

To find a match for a range of characters, we use square brackets and hyphens to separate the starting character from the ending one. For example, [A-Z] matches any uppercase letter.

Let's see some of the most used ranges:

The expression	Matches
[ab]	a or b
[abc]	a, b or c
[a-c]	the range of characters a to c equivalent of [abc]
[a-d]	the range of lower case characters a to d
[A-Z]	all the uppercase letters
[a-z]	all the lowercase letters
[A-Za-z]	all the letters, uppercase and lowercase
[A-Za-z0-9_]	all the letters including the underscore
[0-9]	all the digits

For example, to match the range of letters a to d or the range of digits 2-5 use the expression:

[a-d2-5]

So far, so good!

To find the complement of a set add the carrot symbol (^) at the beginning of the brackets. For example, the regex:

[^a-d]

Finds a match in any character that is not in the range a-d.

Let's see other examples to regular expressions that are complementary sets:

The expression	Matches
[^A-Z]	any character that is not uppercase letter
[^a-z]	any character that is not lowercase letter
[^A-Za-z]	any non latin character
[^0-9]	anything that is not a digit
[^A-Za-z0-9]	anything that is not a Latin letter or a digit
[^a-d]	any character that is not a-d

Quantifiers

Let us return for a moment to the example we have already seen:

str = 'George is 21 yrs. old and Kate is 19'

As we already know, the method re.findall() returns a list of all the matches to the expression that it can find:

names = re.findall(r'[A-Z][a-z]+', str)

We can see that the expression:

r'[A-Z][a-z]+'

Contains a range of characters and also the plus sign (+). The plus sign has a special meaning in a regex since it means that the expression that comes before it has to appear at least once.

The plus sign (+) is an example of quantifiers which are special signs that specify how many instances of a character or group of characters should be in the input to find a match.

The following table shows the quantifiers:

The quantifier	Description
+	matches one or more times
*	matches zero or more times
?	matches zero or 1 time
{3}	matches exactly 3 times
{n}	matches exactly n times
{n, }	matches at least n times
{n, m}	matches from n to m times
{, m}	matches at most m times

Special characters

So far we have seen how to denote a range of characters, but there is even a shorter syntax to denote special ranges:

Character	Description
.	A dot (.) matches any character except for line terminators
\d	matches any digit. equivalent of [0-9]
\w	matches any Latin character, including the underscore. equivalent to [A-Za-z0-9_]
\s	matches any white space character, including space, tab, carriage return, etc.

The special characters are extremely useful. For example, to find all the ages in the following string:

str = "George is 21 yrs. old and Kate is 19"
ages = re.findall(r"\d+",str)
print(ages)

The result:

['21', '19']

The following code snippet returns an array of objects containing the people's names and ages:

str = "George is 21 yrs. old and Kate is 19"

ages = re.findall(r"\d+",str)
names = re.findall(r'[A-Z][a-z]+', str)   
 
people = []
for i in range(0, len(names)):
    people.append({'name':names[i], 'age':ages[i]})

Now, that we know that special ranges are denoted by lower case characters, it is time to learn that the complementary sets are indicated by an uppercase letters instead of the lowercase ones. For example, if the character \d indicates the range of digits 0-9 then to specify the complementary set use the character \D.

Character	Description
\D	the set of characters that are outside the range 0-9 equivalent to [^0-9]
\W	the set of non Latin characters equivalent to [^A-Za-z0-9_]
\S	any non white space character

Start of string and end of string anchors

Regex anchors don't match any character. Instead, they match a position before, after, or between characters.

Anchor	description
^	matches the start of a string
$	matches the end of a string
\b	matches a single space between words

The following code finds a match in any word as long as it comes at the beginning of a string and it is followed by a space:

print(re.findall(r"^\w+\b", "Everything you look for"))

The result:

['Everything']

Escaping the special characters

What happens when we want to find a match for a point or $ (even though they have a special meaning)? In such cases, we use a backslash (\), which causes the special meaning of the character to be lost.

Character	description
\.	simply a dot
\$	simply a $ sign
\^	simply a ^ (carrot) sign
\\	simply a backslash

Python regex replacement

To replace an expression use the re.sub() method.

The following code replaces everything that is not a digit with an empty string:

only_numbers = re.sub(r'\D','','088-9123-4100')
print(only_numbers)

The result:

08891234100

Lazy expressions

Regex quantifiers tend to be greedy and find a broader match than intended.

For example, in the following string there are 2 people, and I would like to replace the people's color with "[@!?]" using the following code:

str = "the <span>green</span> man told the <span>blue</span> man a secret"
res = re.sub(r"<span>.+<\/span>",'[@!?]',str)
print(res)

The output:

the [@!?] man a secret

Contrary to the desire to replace each span individually, the phrase replaced all of the text that started with the opening tag of the first span and ended with the closing tag of the last span because of the greedy behavior of the quantifier. To avoid this problem, we need to add a question mark "?" right after the quantifier "+", and by that we make the quantifier stop immediately after it finds a match.

Let's rewrite the expression by putting a question mark (?) right after the quantifier "+" to make the expression "lazy" instead of "greedy":

str = "the <span>green</span> man told the <span>blue</span> man a secret"
res = re.sub(r"<span>.+?<\/span>",'[@!?]',str)
print(res)

The result:

the [@!?] man told the [@!?] man a secret

Now, that we know how and why to make lazy expressions, let's move to matching different options.

To match options, separate them by using a pipe (|).

For example, the following expression matches file names with image extensions - png, gif, or jpg:

r"(png|gif|jpg)"

Let's see the code in action:

filename = "car.jpg"

if re.search(r"(png|gif|jp(e)?g)$",filename):
    print('%s is a valid file name' % filename)
else:
    print('%s is not a valid file name' % filename)

The result:

car.jpg is a valid file name

The expression jp(e)?g matches both jpg and jpeg
The special character $ looks for the match at the end of the string

Capturing groups and back references

When enclosing an expression in parentheses, it can be later back referenced. The first parentheses on the left of the match will be referred to as \1, the second as \2, and so on.

In the following example we format an American date (09-28-2019) to its European counterpart by putting the month at the second position instead of the first:

euro_date = re.sub(
    r"(\d{2})-(\d{2})-(\d{4})",
    r"\2-\1-\3",
    "09-28-2019"
)
print(euro_date)

The result:

28-09-2019

Finding matches in lists

To search for matching strings in a list we use the function re.search() inside a loop.

In the following list we'll find those strings that start with an "f":

foods = ["fish", "popcorn", "chips", "falafel", "humus", "pizza"]

First, we'll write the expression:

r"^f\w+"

The expression starts with an "f" followed by at least a single word character

foods = ["fish", "popcorn", "chips", "falafel", "humus", "pizza"]
f_foods = [item for item in foods if re.search(r"^f\w+",item)]
print(f_foods)

The result:

['fish', 'falafel']

How to proceed from here?

Regular expressions can help you in many situations, so you should keep learning them. Some sources that have helped me in particular are:

Python regex tutorial

The "re" Python module

The re.search() method

The re.findall method

Range of characters

Quantifiers

Special characters

Start of string and end of string anchors

Escaping the special characters

Python regex replacement

Lazy expressions

Capturing groups and back references

Finding matches in lists

How to proceed from here?

Like what you've read?