Python regex tutorial
One of the most important tools in any programming language is regular expressions that identify patterns in strings. The use of regular expressions allows us to write short code that does a lot. For example, to identify a phone number in a document you might be tempted to write a multi-line script that uses a loop with many conditions or you can simply use a single line of code that utilizes the power of regular expressions.
Regular expressions help us in performing 4 main tasks:
- In finding data that interests us within a large amount of information.
- In validating user inputs to forms such as an email address or a phone number.
- For replacing an expression. Eg., swear filters.
- In adjusting the format of data obtained from unreliable sources. For example, formatting dates that users enter into a form.
So if you want to improve your programming skills, and work less (much less) then you really should learn the subject of regular expressions. Let's get started!
The "re" Python module
To work with regular expressions in Python we will start by importing the module re (short for regular expression).
import re
The re.search() method
The re.search() method finds an expression in a string.
For example, let's find the expression "regex" inside the string "regex are awesome":
if re.search("regex", "regex are awesome"):
print("regex exists")
else:
print("no regex")
The result:
regex exists
As simple as that!
To understand how the method works, let's see the following examples:
print(re.search("regex", "all the regexes are awesome"))
The result:
re.Match object; span=(8, 13), match='regex'
- The method found a match for the expression between positions 8 and 13.
But what happens when the method doesn't find a match?
print(re.search("regex", "all the ... are awesome"))
The result:
None
The method re.search() finds the first match in the string and then stops searching:
print(re.search("regex", "all the regexes are awesome including the regex [a-z]"))
The result:
re.Match object; span=(8, 13), match='regex'
To find all the matches in a string use the method re.findall(), as we'll see in the following section.
What happens if we change even a single character in the expression? To test this let's change the first letter of the expression from lower case to capital: "regex" -> "Regex".
if re.search("Regex", "regex are awesome"):
print("expression exists")
else:
print("no match")
The regular expression engine can't find a match:
no match
We can also use a regular expression (=regex) to find a match regardless of the case. For example, by using the following expression:
r"[Rr]egex"
- Regular expressions (=regex) are enclosed in brackets and preceded by "r".
Our regex tries to find a string that starts with "r" ignoring the case and followed by the text "egex".
Let's give it a try:
if re.search(r"[Rr]egex", "regex are awesome"):
print("found a match")
else:
print("no match")
The result:
found a match
The regular expression is made up of characters, each of which represents part of the pattern we want to capture. We will immediately explain the special characters that make up the regular expression but first let's see another Python method for working with regular expressions.
The re.findall method
The re.findall() method is the most popular among Python's methods for finding regular expressions. While the re.search() method, which we saw in the previous section, finds only the first match, re.findall() returns an array that includes all the matches to the expression.
In the following example, we try to find all the people's names in the string:
str = 'George is 21 yrs. old and Kate is 19'
We know that English names start with a capital letter followed by at least one lowercase letter. The following expression can identify the pattern:
'[A-Z][a-z]+'
A regular expression enclosed in square brackets searches for a range of characters:
- [A-Z] to find a capital letter.
- [a-z]+ to search for at least one lowercase letter. The plus sign (+) matches one or more occurrences of the expression that it follows.
The following code finds all the names in the sentence:
str = 'George is 21 yrs. old and Kate is 19'
# re.findall() returns a list of all the matching names
names = re.findall(r'[A-Z][a-z]+', str)
print (names)
The result:
['George', 'Kate']
Range of characters
To find a match for a range of characters, we use square brackets and hyphens to separate the starting character from the ending one. For example, [A-Z] matches any uppercase letter.
Let's see some of the most used ranges:
The expression |
Matches |
[ab] |
a or b |
[abc] |
a, b or c |
[a-c] |
the range of characters a to c |
[a-d] |
the range of lower case characters a to d |
[A-Z] |
all the uppercase letters |
[a-z] |
all the lowercase letters |
[A-Za-z] |
all the letters, uppercase and lowercase |
[A-Za-z0-9_] |
all the letters including the underscore |
[0-9] |
all the digits |
For example, to match the range of letters a to d or the range of digits 2-5 use the expression:
[a-d2-5]
So far, so good!
To find the complement of a set add the carrot symbol (^) at the beginning of the brackets. For example, the regex:
[^a-d]
- Finds a match in any character that is not in the range a-d.
Let's see other examples to regular expressions that are complementary sets:
The expression |
Matches |
[^A-Z] |
any character that is not uppercase letter |
[^a-z] |
any character that is not lowercase letter |
[^A-Za-z] |
any non latin character |
[^0-9] |
anything that is not a digit |
[^A-Za-z0-9] |
anything that is not a Latin letter or a digit |
[^a-d] |
any character that is not a-d |
Quantifiers
Let us return for a moment to the example we have already seen:
str = 'George is 21 yrs. old and Kate is 19'
As we already know, the method re.findall() returns a list of all the matches to the expression that it can find:
names = re.findall(r'[A-Z][a-z]+', str)
We can see that the expression:
r'[A-Z][a-z]+'
Contains a range of characters and also the plus sign (+). The plus sign has a special meaning in a regex since it means that the expression that comes before it has to appear at least once.
The plus sign (+) is an example of quantifiers which are special signs that specify how many instances of a character or group of characters should be in the input to find a match.
The following table shows the quantifiers:
The quantifier |
Description |
+ |
matches one or more times |
* |
matches zero or more times |
? |
matches zero or 1 time |
{3} |
matches exactly 3 times |
{n} |
matches exactly n times |
{n, } |
matches at least n times |
{n, m} |
matches from n to m times |
{, m} |
matches at most m times |
Special characters
So far we have seen how to denote a range of characters, but there is even a shorter syntax to denote special ranges:
Character |
Description |
. |
A dot (.) matches any character except for line terminators |
\d |
matches any digit. equivalent of [0-9] |
\w |
matches any Latin character, including the underscore. equivalent to [A-Za-z0-9_] |
\s |
matches any white space character, including space, tab, carriage return, etc. |
The special characters are extremely useful. For example, to find all the ages in the following string:
str = "George is 21 yrs. old and Kate is 19"
ages = re.findall(r"\d+",str)
print(ages)
The result:
['21', '19']
The following code snippet returns an array of objects containing the people's names and ages:
str = "George is 21 yrs. old and Kate is 19"
ages = re.findall(r"\d+",str)
names = re.findall(r'[A-Z][a-z]+', str)
people = []
for i in range(0, len(names)):
people.append({'name':names[i], 'age':ages[i]})
Now, that we know that special ranges are denoted by lower case characters, it is time to learn that the complementary sets are indicated by an uppercase letters instead of the lowercase ones. For example, if the character \d indicates the range of digits 0-9 then to specify the complementary set use the character \D.
Character |
Description |
\D |
the set of characters that are outside the range 0-9 equivalent to [^0-9] |
\W |
the set of non Latin characters equivalent to [^A-Za-z0-9_] |
\S |
any non white space character |
Start of string and end of string anchors
Regex anchors don't match any character. Instead, they match a position before, after, or between characters.
Anchor |
description |
^ |
matches the start of a string |
$ |
matches the end of a string |
\b |
matches a single space between words |
The following code finds a match in any word as long as it comes at the beginning of a string and it is followed by a space:
print(re.findall(r"^\w+\b", "Everything you look for"))
The result:
['Everything']
Escaping the special characters
What happens when we want to find a match for a point or $ (even though they have a special meaning)? In such cases, we use a backslash (\), which causes the special meaning of the character to be lost.
Character |
description |
\. |
simply a dot |
\$ |
simply a $ sign |
\^ |
simply a ^ (carrot) sign |
\\ |
simply a backslash |
Python regex replacement
To replace an expression use the re.sub() method.
The following code replaces everything that is not a digit with an empty string:
only_numbers = re.sub(r'\D','','088-9123-4100')
print(only_numbers)
The result:
08891234100
Lazy expressions
Regex quantifiers tend to be greedy and find a broader match than intended.
For example, in the following string there are 2 people, and I would like to replace the people's color with "[@!?]" using the following code:
str = "the <span>green</span> man told the <span>blue</span> man a secret"
res = re.sub(r"<span>.+<\/span>",'[@!?]',str)
print(res)
The output:
the [@!?] man a secret
Contrary to the desire to replace each span individually, the phrase replaced all of the text that started with the opening tag of the first span and ended with the closing tag of the last span because of the greedy behavior of the quantifier. To avoid this problem, we need to add a question mark "?" right after the quantifier "+", and by that we make the quantifier stop immediately after it finds a match.
Let's rewrite the expression by putting a question mark (?) right after the quantifier "+" to make the expression "lazy" instead of "greedy":
str = "the <span>green</span> man told the <span>blue</span> man a secret"
res = re.sub(r"<span>.+?<\/span>",'[@!?]',str)
print(res)
The result:
the [@!?] man told the [@!?] man a secret
Now, that we know how and why to make lazy expressions, let's move to matching different options.
To match options, separate them by using a pipe (|).
For example, the following expression matches file names with image extensions - png, gif, or jpg:
r"(png|gif|jpg)"
Let's see the code in action:
filename = "car.jpg"
if re.search(r"(png|gif|jp(e)?g)$",filename):
print('%s is a valid file name' % filename)
else:
print('%s is not a valid file name' % filename)
The result:
car.jpg is a valid file name
- The expression jp(e)?g matches both jpg and jpeg
- The special character $ looks for the match at the end of the string
Capturing groups and back references
When enclosing an expression in parentheses, it can be later back referenced. The first parentheses on the left of the match will be referred to as \1, the second as \2, and so on.
In the following example we format an American date (09-28-2019) to its European counterpart by putting the month at the second position instead of the first:
euro_date = re.sub(
r"(\d{2})-(\d{2})-(\d{4})",
r"\2-\1-\3",
"09-28-2019"
)
print(euro_date)
The result:
28-09-2019
Finding matches in lists
To search for matching strings in a list we use the function re.search() inside a loop.
In the following list we'll find those strings that start with an "f":
foods = ["fish", "popcorn", "chips", "falafel", "humus", "pizza"]
First, we'll write the expression:
r"^f\w+"
- The expression starts with an "f" followed by at least a single word character
foods = ["fish", "popcorn", "chips", "falafel", "humus", "pizza"]
f_foods = [item for item in foods if re.search(r"^f\w+",item)]
print(f_foods)
The result:
['fish', 'falafel']
How to proceed from here?
Regular expressions can help you in many situations, so you should keep learning them. Some sources that have helped me in particular are:
- An extensive tutorial on regular expressions
- Regex cheat sheet
- Online tester for regular expressions