Regular Expressions in Python

Data is considered as the biggest asset in the world. All thanks to the businesses like Google and other tech companies who rely on data for their businesses. For any big corporation or an individual that has data in larger volumes, finding the required data might take quite some time if you don’t have the right tools.

For instance, you could take days to find the list of patients manually, whose names begin with ‘A’ from a list of million patients. You can perform it efficiently using regular expressions in Python. What are regular expressions?

Regular Expressions in Python

A regular expression is a string with unique characters and symbols that can find the required information from any data. You can define a pattern, search for words or strings with a similar pattern in the data, and use them for your additional requirements.

Python offers the module, ‘re‘, which stands for ‘Regular Expressions’ with methods like split(), findall(), match(), search(), and compile() that help find information in the data.

The compile() method

Let us look at a program to understand the compile() method.

#Python program to show how the compile() method works in Regex

#import the module
import re
patt = re.compile(r'c\w\w')
sentence = 'map cap bat pet cat'
outcome = patt.search(sentence)
print(outcome.group())

Output:

cap

We imported the module ‘re‘ to use the regular expressions in the above program.
We created a variable called ‘patt‘ and assigned to it the compile method with a regular expression in it. The regular expression here is ‘c\w\w‘. It means we are looking for a three-letter word that begins with ‘c‘. Since it is a raw expression, we prefix it with ‘r‘, as shown in the program above.
We tried searching for this pattern in a string called ‘sentence‘ that has five three-letter words separated by spaces. The code patt.search(sentence) does this search. We then assigned the result of this search to another variable called outcome.
Since the outcome is stored as an object, we use outcome.group() to print the output. You can see that it printed only ‘cat‘, the first occurrence of the word that begins with ‘c‘. This way can be helpful for you if you require to search only for the first occurrence of any pattern.

The search() method

Let us look at a program to understand the usage of the search() method.

#Python program to use the search() method in regular expressions

import re #import the re module

sentence = 'map cap bat pet cat'
outcome = re.search(r'c\w\w',sentence)

#print the result only if the pattern finds a match
if outcome:
  print(outcome.group())

Output:

cap

Unlike the compile() method, we declare the pattern directly in the search() method, as shown in the above program and pass within the method the string on which we want to find the match for the pattern. Here in our program, the string is in the variable, ‘sentence‘.
Eventually, we added a logic using the ‘if‘ statement to print the output only if the outcome does not return None.
The search() method also returned only the first occurrence of the word that matches the pattern. It printed only ‘cap‘ in the output and ignored ‘cat‘.

The match() method

The match() method returns the string that matches with the required pattern only if found at the beginning of the string.

#Python program to show the match() method in regular expressions

#import the regular expressions module
import re
sentence = 'cap map bat pet cat'
outcome = re.match(r'c\w\w',sentence)
print(outcome.group())

Output:

cap

In the program above, the match() method returned the pattern we are looking for in the string variable, sentence, only since it is found at the beginning of the string. If you replace the first word in the sentence with a word that begins with a letter other than ‘c‘, the outcome would be returned with None.

The split() method

Using the split() method in the regular expressions, you can split a string into different parts and return each part as elements in a list.

#Python program to show the splitmethod

#importing the regular expressions module
import re

sentence = 'Python is an amazing programming language'
outcome = re.split(r'\W+',sentence)
print(outcome)

Output:

['Python', 'is', 'an', 'amazing', 'programming', 'language']

In the program, we used the split method to separate each word as an element of a list. You can see that we used the regular expression, ‘\W+‘. While the ‘w‘ refers to the alphanumeric characters, the ‘\W‘ in caps refers to the non-alphanumeric characters. We used a ‘+‘ sign after ‘W‘ in the regular expression to refer to the non-alphanumeric characters’ occurrences.
In the case mentioned in the program above, since space is a non-alphanumeric character, the split method finds each space to separate words into elements.

The sub() method

We can use the sub() method to replace the string in a sentence with a new string. Let us look at a program to replace the strings using the sub() method.

#Python program to replace the string using sub() method

#import the regular expression module
import re

sentence = 'Apple is good company'
outcome = re.sub(r'good','best',sentence)
print(outcome)

Output:

Apple is best company

In the program above, we replaced the word ‘good’ with ‘best’ using the sub() method.

Sequence Characters of Regular Expressions in Python

Before looking at other methods like findall(), you need to understand the sequence characters of regular expressions. Please find below each sequence character and its corresponding description that mentions its usage.

Character	Description
\d	Refers to any digit between (0-9)
\D	Refers to any non-digit
\s	Refers to white space
\S	Refers to non-whitespace
\w	Refers to any alphanumeric
\W	Refers to non-alphanumeric
\b	Refers to space around the words
\A	Matches only at the beginning of the string
\Z	Matches only at the end of the string

The findall() method

Using the findall() method, you can scan from the left to right of the string and find the words that match your desired pattern. Let us look at an example program now.

#Python program to find the usage of findall() method

#import the regular expressions module
import re

sentence = 'baseball is the best game for any boy in this world'
outcome = re.findall(r'b[\w]*',sentence)

print("Please find below the matches we've found")
for every_word in outcome:
  print(every_word)

Output:

Please find below the matches we've found
baseball
best
boy

You can see in the program above that we provided the regular expression ‘b[\w]*‘. Here ‘b‘ in the beginning means any word that begins with the letter ‘b‘. The ‘\w‘ refers to any alphanumeric character, and since we have mentioned it as [\w]*, we are referring to zero or more alphanumeric characters.
It means that we are looking for any word that begins with the letter ‘b‘ which is followed by zero or more alphanumeric characters.
In the example sentence, three words begin with the letter ‘b‘. The findall() method successfully finds all those words, as shown in the output above.

To understand this concept better, let us look at one more program that finds all the words that begin with a digit.

#Python program to find all the letters that begin with a digit

#import the regular expressions module
import re

sentence = 'The salary cycle is usually calculated on 15th and 30th of every month'
outcome = re.findall(r'\d[\w]*',sentence)

for every_word in outcome:
  print(every_word)

Output:

15th
30th

In the above program, we provided the regular expression of ‘\d[\w]*‘. Here \d refers to any numerical digit, and [\w]* refers to zero or more alphanumeric characters. It means we are looking for the words that begin with numbers followed by alphabets or numerical characters. You can see in the output that it correctly displayed 15^th and 30^th, which matches our required pattern.

Quantifiers in Regular Expressions

You can notice that we have already used a quantifier ‘*‘ in the programs above to refer to one or more occurrences of its preceding regular expression. Let us look at all the quantifiers available in Python.

Quantifier	Description
+	One or more occurrences of the preceding regular expression
*	0 or more occurrences of the preceding regular expression
?	0 or 1 repetitions of the preceding regular expression
{m}	Refers to exactly the ‘m’ occurrences
{m,n}	From m to n, m defaults to 0, n to infinity.

The above quantifiers can be used in any regular expression statement based on your requirements.

Let us look at a program that can retrieve only the phone number from the string.

#Python program to find the phone number from a string.

#import the regular expressions module
import re

sentence = 'Sachin 9876543210'
outcome = re.search(r'\d+',sentence)

print(outcome.group())

Output:

9876543210

In the above program, we defined a regular expression that will look for one or more occurrences of digits using ‘\d+‘. You can notice in the output that it printed the entire mobile number as per our requirement.

Special Characters of Regular Expressions in Python

We have already looked at a special character ‘\‘ in the above programs. In the table below, let us look at all the special characters used for regular expressions in Python.

Special Character	Description
\	Used for escaping special characters, Also signals a special sequence
.	Matches any character except new line
^	Matches the beginning of the string
$	Matches ending of the string
[…]	Refers to a set of characters. For example, [7a-c] matches any characters’ 7′,’ a’,’ b’, or’ c’
[^..]	Matches every character except the ones inside brackets. For example, [^a-c5] matches any character except ‘a’,’ b’,’ c’ or ‘5’
(… )	Matches the regular expression inside the parentheses, and we can capture the result
R \| S	Matches either regex R or regex S

Let us look at an example that finds the string that begins with ‘apple‘.

#Python program to find the apple beginning a string

#import regular expressions module
import re

sentence = 'apple a day keeps doctor away'
outcome = re.search(r'^apple',sentence)
if outcome:
  print("sentence starts with apple")
else:
  print("String does not begin with apple")

Output:

sentence starts with apple

You can notice in the program above that we used the special character ‘^‘ to denote the beginning of the string. Since ‘apple‘ is mentioned in the regular expression after the special character, it searches for the occurrence of ‘apple‘ at the beginning of the string.

Similarly, you can tweak the quantifiers, special characters, sequence characters and all the methods mentioned above to find the desired pattern in any string.

Regular Expression on Files in Python

We will now see how to use Regular Expressions on Files in Python. Let us first create a text file with some text in it. We will then create a program to find the email addresses from the text in the file.

Save the below text as findemails.txt in your current directory.

This file is a sample file that contains an email address
samplefile@gmail.com
Let us add one more email address as asusboy@gmail.com

Let us now create a Python program that can fetch only the emails from the above-mentioned text file.

#Python program to find emails from a text file

#import regular expressions module
import re

#open the file for reading
f = open('findemails.txt','r')

#find emails in each line using for loop
for every_line in f:
  outcome = re.findall(r'\S+@\S+',every_line)
  #display only if the results are available
  if len(outcome)>0:
    print(outcome)

#close the file
f.close()

Output:

['samplefile@gmail.com']
['asusboy@gmail.com']

In the above program, we imported the text file, ‘findemails.txt‘, in the reading mode. We then tried reading each line from the text file. During each iteration, we tried assigning the result of the regular expression to the variable, outcome.

The regular expression we used is ‘\S+@\S+‘ which refers to the non-whitespace characters before and after the symbol ‘@‘. All the strings mentioned in the email format can be retrieved using this regular expression.

You can use regular expressions for more complex use cases to make everything easier in your project. Although it looks tricky at first glance, you can master regular expressions by practice. Please bookmark this post link if you need to revisit it again and again to learn Regular expressions in Python. We hope you have understood this concept, and let us know your views in the comments section below.

Regular Expressions in Python

Regular Expressions in Python

The compile() method

The search() method

The match() method

The split() method

The sub() method

Sequence Characters of Regular Expressions in Python

The findall() method

Quantifiers in Regular Expressions

Special Characters of Regular Expressions in Python

Regular Expression on Files in Python

About The Author

Derin

Regular Expressions in Python

The compile() method

The search() method

The match() method

The split() method

The sub() method

Sequence Characters of Regular Expressions in Python

The findall() method

Quantifiers in Regular Expressions

Special Characters of Regular Expressions in Python

Regular Expression on Files in Python

About The Author

Derin

Related Posts