Special Offer - Enroll Now and Get 2 Course at ₹25000/- Only Explore Now!

All Courses
Python RegEx or Regular Expressions

Python RegEx or Regular Expressions

July 19th, 2019

Python RegEx or Regular Expressions

RegEx or Regular Expression is a special sequence of string text that is used for describing a search pattern. It helps you find or match other strings using a special syntax held in a pattern.

Importing RegEx

In Python, there is a build-in package called “re” which helps you work with Regular

import re

Specifying patterns with RegEx

In order to specify regular expressions, metacharacters like ^ and $ are used. Below, we will find a list of metacharacters that are commonly used in RegEx.

Metacharacters in RegEx

^ (Caret)

The caret symbol ^ is used to check if a string begins with a specific character(s). For example, ^ca indicate that a string should begin with the characters ‘ca’ in order to be considered as a match. Strings such as cat, car, candle, etc. are examples of a successful match.

$ (Dollar)

The dollar symbol $ checks to see if a string ends with a specific character(s). For instance, e$ implies that a string should end with the character ‘e’ to be considered as a successful match. Strings like dare, imagine and glee are a few examples that are considered a successful match.

.(Dot)

A dot symbol is used to indicate any one character. Two dots indicate any two characters, 5 dots indicate any five characters and so forth. For instance, “h..t” indicates that the string should begin with the letter ‘h’ followed by any two characters (as two dots are given) and should end with the letter ‘t’ to become a successful match. Hint, hurt, hoot are all examples of a match.

* (Asterisk)

An asterisk symbol indicates that a string must have zero or more occurrences of the character that is on the left of the * sign. For instance, if the pattern is specified as “ca*r”, strings such as car, cr, caaar, the scar would return a match. However, something like cair would not return a match since it has an ‘i’ in-between ‘a’ and ‘r’.  So any string that has zero more occurrences of ‘a’ between c and r would be considered a match in this example.

+ (Plus)

The plus symbol indicates that a string must have one or more occurrences of the character that is on the immediate left of the + sign. For instance, if the pattern is specified as “ca+r”, strings such as car, caaaaar, scar would return a match. However, something like cair would not return a match since it has an ‘i’ in between ‘a’ and ‘r’.  So any string that has an ‘a’ or multiple ‘a’s in between c and r would be considered a match in this example.

? (Question mark) ­

Does the question mark indicates that a string must have zero or one occurrence of the character that is on the immediate left of the? sign. For instance, if the pattern is specified as “ca?r”, strings such as cr, car, the scar would return a match.  However, something like cair would not return a match since it has an ‘i’ in-between ‘a’ and ‘r’. So any string that has zero or one ‘a’ in-between c and r would be considered a match in this example.

\ (Backslash)

The backslash indicates a special sequence.

Character Description
\A Checks whether the specified characters appear at the start of a string.
\b Checks whether the specified characters appear at the beginning or end of a word.
\B Checks whether the specified characters are present in a word (but not at the beginning or the end of a word)
\d Checks if a string has digits (0-9).
\D Checks if a string does not contain digits.
\s Checks if there is whitespace in a string.
\S Checks if a string has characters excluding whitespace.
\w Checks if a string has any word characters like alphabets, numbers and underscore.
\W Checks if the string does not contain any word characters
\Z Checks if the specified characters are found at the end of a string

| (Vertical bar)

The vertical bar is used for alternation (OR) where a string needs to match at least one of many alternative patterns to return true. For example, in a given string “Let’s eat”, if the given alternation is something like “walk” | “eat”, it would return a match since the string contains “eat”. In this example, the string needs to have either “walk” or “eat” within the string to return a match.

{} (Curly brackets)

The curly brackets specify the occurrence of the character left of it. For instance, ad{3} specifies that the ‘d’ should occur at least 3 times and must have an ‘d’ before the first ‘d’ to return a match.

() (Parenthesis)

The parenthesis is used for grouping and capturing a regular expression. 

[] (Squared brackets)

Squared brackets are used to specify a set of characters that should be present in a string to be considered a match.  A set is a set of character that are specified inside the squared brackets. The following shows a number of sets that can be used.

Set Description
[a-e] This set checks to see if a string contains any characters between alphabets such as a to e which is equivalent to [a, b, c, d, e] or [a-e]. For instance, the string “be a good girl” has 4 matches in this example.
[0-14] This set specifies a range of numbers like [0-14] which is equivalent to [0,1,4] and returns a match if found within the string.
[anb] This set specifies that a string should contain a specific set of characters to return a match. For example, [anb] indicates that a string should contain ‘a’, ‘n’, or ‘b’ to return match.
[123] This set can be used to check if a set of numbers if found within a string. For instance, [123]  returns a match if a string has ‘1’, ‘2’ or ‘3’.
[^aed] You can use the caret syntax ^ in a set to match a list of characters. For instance, [^aed] specifies that the strings should contain any character except ‘a’, ‘e’ or ‘d’ to be counted as a match.
[^0-9] This set specifies that a string can contain any characters except for the given range of digits to return a match. For example, [^0-9] indicates that for a string to be considered a match, it needs to have any characters except for the digits,  0 to 9.
[0-4][2-5] This set checks if the string has any specified 2-digit numbers. For instance,  [0-4][2-5] indicates that the string should contain values between 02 to 45 to return a match.
[a-z][A-Z] This set checks for a string containing the specified character range for both lower case and uppercase strings. The set [a-z][A-Z] specifies that the string should contain any letters from lowercase a to z or uppercase  A to Z to return a match.

RegEx Functions

match() function

The match() function can be used to check whether a RegEx or RE  pattern matches with a string, with flags as an optional parameter. The first parameter of the match() function is the pattern which species the regular expression to be matched. The second parameter specifies the string which would be searched to see if it matches the specified pattern. The third and optional parameter specifies the flag by using bitwise.
In the following example, the pattern specifies that in order to match a string, it should begin with the letters ‘be’ and end with the letter ‘r’. The caret ^ syntax denotes the beginning of a string while the dollar $ sign denotes the end of a string. The dots in between indicate the number of letters or characters that should be present between ‘be’ and ‘r’. As such, for a string to be matched as successful, it should contain 6 letters that begin with ‘be’ and ends with ‘r’ in this example. Since the test string meets the criteria and is therefore true, it prints “Search successful” as shown.

import re
pattern = '^be...r$'
test_string = 'better'
result = re.match(pattern, test_string)
if result:
  print("Search successful.")
else:
  print("Search unsuccessful.")

Output

Search successful.

findall() function

The findall() function is used to return a list of all the matches found in a string. The list comes with the matches in the order that they are found. If there no match, then an empty list is returned.

import re
test_string = "The weather is nice "
x = re.findall("he", test_string)
print(x)
test_string2 = "The weather is nice "
y = re.findall("hn", test_string2)
print(y)

Output

['he', 'he']
[]

Search() function

The search() function is used to check the string for a match. If there is a match, it returns a match object. If there is more than one match, only the first match found would be returned. Note that although it is similar to a match() function, a search() function actually checks the whole string to look for a match. On th other hand, a match() function looks for a match only at the beginning of the string.

import re
test_string = "Mary wants a cup of tea and Jake wants a cup of coffee "
import re
test_string = "Mary wants a cup of tea and Jake wants a cup of coffee "
x = re.search("cup", test_string)
y = re.match("cup", test_string)
if x:
    print("Success search:", x.group())
else:
    print("No match found for x")
if y:
    print("Success match:", y.group())
else:
    print("No match found for y")
[/su_table

Output

Success search: cup
No match found for y

split() function

A split() function is used to specify a character that is used to split a string. For instance, \s which denotes whitespace can be specified as the character that is used to split a string.

import re
test_str = "The sky is clear"
x = re.split("\s", test_str)
print(x)

 Output

['The', 'sky', 'is', 'clear']

sub() function

The sub() function can be used to substitute a specified character(s) to another character(s) in a string.

import re
test_str = "The last dew drop"
x = re.sub("\s", ".", test_str)
print(x)

Output

The.last.dew.drop