Day 14 – Regular Expressions
Regular Expressions (RE), also known as regex, are commonly used to search for and replace strings that match certain patterns.
Regular Expression Syntax
When processing strings, it often involves finding strings that meet certain complex rules. Regular expressions are a language used to describe these rules, allowing for matching, searching, and replacing text.
In other words, regular expressions are code that records text rules.
Anchors
Anchors are used to describe the boundaries of a string.<span>^</span> and <span>$</span> are used to match the beginning and end of a string, respectively.
For example, <span>^abc</span> can match strings that start with “abc”, <span>$abc</span> can match strings that end with “abc”, <span>abc</span> can match any part of a string that contains “abc”, while <span>^abc$</span> can match the entire string “abc”.
Metacharacters
Metacharacters are special characters used to describe sets of characters. Common metacharacters include:
- •
<span>.</span>matches any character except a newline - •
<span>\w</span>matches letters, digits, underscores, or Chinese characters - •
<span>\s</span>matches whitespace characters, including spaces, tabs, newlines, etc. - •
<span>\d</span>matches digits - •
<span>\b</span>matches word boundaries - •
<span>^</span>matches the beginning of a string - •
<span>$</span>matches the end of a string
<span>^\d</span> indicates that it must start with a digit.
<span>\d$</span> indicates that it must end with a digit.
Quantifiers
Quantifiers specify the number of times a regular expression should match.
- •
<span>?</span>matches the preceding character 0 or 1 times – for example,<span>colou?r</span>can match “color” or “colour”. - •
<span>*</span>matches the preceding character 0 or more times – for example,<span>go*gle</span>can match “ggle”, “gogle”, “google”, … - •
<span>+</span>matches the preceding character 1 or more times – for example,<span>go+gle</span>can match “gogle”, “google”, “gooogle”, … - •
<span>{n}</span>matches the preceding character exactly n times – for example,<span>go{2}gle</span>matches only “google”. - •
<span>{n,}</span>matches the preceding character at least n times – for example,<span>go{2,}gle</span>can match “google”, “gooogle”, … - •
<span>{n,m}</span>matches the preceding character at least n times but not more than m times – for example,<span>go{2,4}gle</span>can match “google”, “gooogle”, “goooogle”
<span>\d{3}</span> indicates matching 3 digits.
<span>\s+</span> indicates at least one space.
Character Classes
Using regular expressions to find digits and letters is straightforward, but if you want to match characters that are not predefined or specific character combinations, you need to use character classes.
A character class is a set of characters enclosed in square brackets<span>[]</span>, used to match characters within a specified range.
For example, [aeiou] matches any vowel, [0-9] matches any digit, [a-zA-Z] matches any letter.
More advanced, <span>[0-9a-zA-Z\_]</span> can match a digit, letter, or underscore;
<span>[0-9a-zA-Z\_]+</span> can match a string consisting of at least one digit, letter, or underscore, such as ‘a100’, ‘0_Z’, ‘Py3000’, etc.;
<span>[a-zA-Z\_][0-9a-zA-Z\_]*</span> can match a string that starts with a letter or underscore, followed by any number of digits, letters, or underscores, which is a valid Python variable;
<span>[a-zA-Z\_][0-9a-zA-Z\_]{0, 19}</span> more precisely limits the variable length to 1-20 characters (1 character at the front + up to 19 characters at the back)
Exclusion of Characters
If you want to match characters other than certain characters, you can add <span>^</span> in front of the character class to exclude those characters.
For example, [^aeiou] can match any character except vowels.
Choice of Characters
If you want to match one of several characters, you can use <span>|</span> to separate those characters.
For example, <span>(^\d{17})(\d|X|x)$</span> can match ID numbers.
Escape Characters
If you want to match a metacharacter itself, you need to use <span>\</span> to escape it.
For example, <span>\d</span> matches digits, <span>\s</span> matches whitespace characters, <span>\.</span> matches a period.
Grouping
The first function of () is to create groups (sub-expressions), and the second function is to extract the text matched by the group.
For example, <span>(\.[0-9]{1,3}){3}</span> repeats the group <span>\.[0-9]{1,3}</span> three times and then extracts the matched text.
And (thir|four)th can match “third” or “fourth”; if the parentheses are removed, it matches “thir” or “fourth”.
Using Regular Expressions in Python
In Python, regular expressions are used as pattern strings.
For example, to convert a regular expression that matches a non-letter character into a pattern string, you can use the following code:
'[^a-zA-Z]'
However, if you want to convert a regular expression that matches words starting with the letter m into a pattern string, you cannot directly add quotes on both sides, as follows:
'\bm\w*\b'
This is incorrect; you should escape the <span>\</span> within it to <span>\\</span>, as follows:
'\\bm\\w*\\b'
However, there are many characters that need to be escaped in pattern strings, so in practice, it is recommended to rewrite it as a raw string, which is done by adding <span>r</span> before the quotes, as follows:
r'\bm\w*\b'
For convenience, we will use raw strings from now on.
Using the re Module to Implement Regular Expression Operations
The re module in Python provides support for regular expressions, including pattern matching, replacement, splitting, and other operations.
When implementing, you can use the methods provided by the re module for string processing, or convert the pattern string into a regular expression and then use the relevant methods of the regular expression object for string processing.
When using the re module, we should first apply the import statement to import it (we will also introduce the module later).
import re
Matching Strings
To match strings, you can use the methods provided by the re module such as match(), search(), findall(), etc.
match() Method
The match() method starts matching the pattern from the beginning of the string. If the match is successful, it returns a Match object; otherwise, it returns None. Even if there are multiple substrings in the string that match the regular expression, re.match() will only return the first match result.
The syntax format is as follows:
re.match(pattern, string, [flags])
- • pattern: The pattern string of the regular expression, converted from the regular expression to be matched
- • string: The string to be matched
- • flags: Optional parameter to control the matching method of the regular expression
- • A or ASCII: Only perform ASCII matching for \w, \W, \b, \B, \d, \D, \s, \S;
- • I or IGNORECASE: Ignore case;
- • M or MULTILINE: Multiline mode, changes the behavior of ‘^’ and ‘$’ to match the beginning and end of the string, as well as the beginning and end of each line;
- • S or DOTALL: The dot (.) matches all characters, including newline characters;
- • X or VERBOSE: Verbose mode, allows the use of comments, spaces, and newlines to improve readability.
For example, to check if a string starts with “mr_” without case sensitivity, you can use the following code:
import re
pattern=r'mr_\w+'
string='Mr_John_Doe mr_jane_smith'
match=re.match(pattern, string, re.IGNORECASE)
print(match)
string='项目 Mr_John_Doe mr_jane_smith 已完成'
match=re.match(pattern, string, re.IGNORECASE)
print(match)
Output:
<re.Match object; span=(0, 11), match='Mr_John_Doe'>
None
As can be seen, the beginning of the string “Mr_John_Doe” is a case-insensitive match for “mr_”, thus the match is successful. However, the beginning of the string “项目 Mr_John_Doe 已完成” is “项目 “, which does not match.
The Match object contains the position of the matched value and the matched data, where:
- • To get the starting position of the matched value, you can use the start() method of the Match object;
- • To get the ending position of the matched value, you can use the end() method of the Match object.
- • To return a tuple of the matched position, you can use the span() method of the Match object.
- • To return the matched string, you can use the strings attribute of the Match object.
- • To return the matched data, you can use the group() method of the Match object.
These flags can be used individually or combined using bitwise OR<span>|</span>. For example, <span>re.IGNORECASE | re.MULTILINE </span> means enabling both ignore case and multiline mode
import re
pattern=r'mr_\w+'
string='Mr_John_Doe mr_jane_smith'
match=re.match(pattern, string, re.IGNORECASE)
print('Start position of matched value:', match.start())
print('End position of matched value:', match.end())
print('Tuple of matched position:', match.span())
print('Matched string:', match.string)
print('Matched data:', match.group())
Output:
Start position of matched value: 0
End position of matched value: 11
Tuple of matched position: (0, 11)
Matched string: Mr_John_Doe mr_jane_smith
Matched data: Mr_John_Doe
search() Method
The search() method starts matching the pattern from any position in the string. If the match is successful, it returns a Match object; otherwise, it returns None. It will also only return the first match result.
The syntax format is as follows:
re.search(pattern, string, [flags])
- • pattern: The pattern string of the regular expression, converted from the regular expression to be matched
- • string: The string to be matched
- • flags: Optional parameter to control the matching method of the regular expression, same as in the match() method.
For example, to check if a string starts with “mr_” without case sensitivity, you can use the following code:
import re
pattern=r'mr_\w+'
string='Mr_John_Doe mr_jane_smith'
match=re.search(pattern, string, re.IGNORECASE)
print(match)
string='项目 Mr_John_Doe mr_jane_smith 已完成'
match=re.search(pattern, string, re.IGNORECASE)
print(match)
Output:
<re.Match object; span=(0, 11), match='Mr_John_Doe'>
<re.Match object; span=(3, 14), match='Mr_John_Doe'>
As can be seen, the search() method can match substrings starting with “mr_” at any position in the string.
findall() Method
The findall() method finds all substrings in the string that match the regular expression and returns a list. If no matching substrings are found, it returns an empty list.
The syntax format is as follows:
re.findall(pattern, string, [flags])
- • pattern: The pattern string of the regular expression, converted from the regular expression to be matched
- • string: The string to be matched
- • flags: Optional parameter to control the matching method of the regular expression, same as in the match() method.
For example, to search for all substrings starting with “mr_” in a string without case sensitivity, you can use the following code:
import re
pattern=r'mr_\w+'
string=' Mr_John_Doe mr_jane_smith'
matches=re.findall(pattern, string, re.I)
print(matches)
string='项目 Mr_John_Doe mr_jane_smith 已完成'
matches=re.findall(pattern, string) # Case sensitive
print(matches)
Output:
['Mr_John_Doe', 'mr_jane_smith']
['mr_jane_smith']
Grouping
In addition to simply checking for matches, regular expressions have the powerful function of extracting substrings. The groups to be extracted are indicated by () (Group).
<span>^(\d{3})-(\d{3,8})$</span> defines two groups, allowing direct extraction of the area code and local number from the matched string.
If groups are defined in the regular expression, you can use the group() method on the Match object to extract the substring.
You can use the group(num) or groups() methods to get the matched expressions.
- • The group(num) method returns the string matched by the num-th group; if num is 0, it returns the entire matched string.
- • The groups() method returns a tuple containing all the matched strings of the groups.
import re
pattern=r'^(\d{3})-(\d{3,8})$'
string='010-12345'
match=re.match(pattern, string)
print(match.group(1)) # Output area code 010
print(match.group(2)) # Output local number 12345
print(match.groups()) # Output ('010', '12345')
Replacing Strings
To replace strings, you can use the methods provided by the re module such as sub(), subn(), etc.
sub() Method
The sub() method is used to replace all substrings in the string that match the regular expression, returning the modified string.
The syntax format is as follows:
re.sub(pattern, repl, string, [count], [flags])
- • pattern: The pattern string of the regular expression, converted from the regular expression to be matched
- • repl: The replacement string, which can be a string or a function
- • string: The string to be matched
- • count: Optional parameter to specify the maximum number of replacements; the default is 0, indicating all replacements.
- • flags: Optional parameter to control the matching method of the regular expression, same as in the match() method.
For example, to hide phone numbers in a string, you can use the following code:
import re
pattern=r'-\d{8}'
string='我的电话号码是010-12345678'
new_string=re.sub(pattern, '-*******', string)
print(new_string) # Output '我的电话号码是010-*******'
subn() Method
The subn() method is similar to the sub() method, but it returns a tuple containing the modified string and the number of replacements made.
import re
pattern=r'-\d{8}'
string='我的电话号码是010-12345678'
new_string, count=re.subn(pattern, '-*******', string)
print(new_string) # Output '我的电话号码是010-*******'
print(count) # Output 1
Splitting Strings
To split strings, you can use the split() method provided by the re module. The split() method splits the string based on the substrings matched by the regular expression and returns a list.
The syntax format is as follows:
re.split(pattern, string, [maxsplit], [flags])
- • pattern: The pattern string of the regular expression, converted from the regular expression to be matched
- • string: The string to be matched
- • maxsplit: Optional parameter to specify the maximum number of splits; the default is 0, indicating all splits.
- • flags: Optional parameter to control the matching method of the regular expression, same as in the match() method.
For example, if you want to extract friends’ names after mentioning multiple friends with @, you can use the following code:
import re
pattern = r'@(\w+)'
string = '@shiqi @wztxy @Iris @Arlan'
list1 = re.split(pattern, string)
print(list1) # Output ['', 'shiqi', ' ', 'wztxy', ' ', 'Iris', ' ', 'Arlan', '']
for item in list1:
if item != ' ':
print(item) # Output shiqi wztxy Iris Arlan
Compilation
If you want to reuse the same regular expression, you can compile it into a regular expression object and then use the methods of that object for string processing.
The compile function is used to compile a regular expression, generating a regular expression (Pattern) object for use with the match() and search() functions.
The syntax format is as follows:
re.compile(pattern [,flags])
- • pattern: The pattern string of the regular expression, converted from the regular expression to be matched
- • flags: Optional parameter to control the matching method of the regular expression, same as in the match() method.
import re
pattern = r'(\d{3})-(\d{3,8})'
string = '010-12345'
pattern_obj = re.compile(pattern)
match = pattern_obj.match(string)
print(match.group(1)) # Output area code 010
print(match.group(2)) # Output local number 12345