In the world of Python programming, handling strings is an extremely common task, and regular expressions are a powerful tool for efficiently processing strings. The built-in re library (regular expression) provides developers with a complete set of interfaces for regular expression operations, enabling easy implementation of complex operations such as matching, searching, replacing, and splitting strings. Whether for data cleaning, log analysis, form validation, or text extraction, the re library plays an irreplaceable role.
Core Concepts of the re Library: Basics of Regular Expressions
Before learning the re library, it is essential to understand the essence of regular expressions — they are a special syntax used to describe string patterns, defining the “rules for matching strings” through specific character combinations. All functionalities of the re library are fundamentally based on this syntax to parse and process target strings.
1. Common Metacharacters: Building Blocks for Constructing Matching Rules
Metacharacters are the core components of regular expressions; they have special matching meanings and form the basis for constructing complex matching rules. Here are the most commonly used metacharacters in the re library and their functions:
- . (dot): Matches any single character except for a newline character (\n). For example, a.b can match aab, acb, but cannot match a\nb.
- ^ (caret): Matches the beginning position of a string. For example, ^hello can only match strings that start with “hello”, such as “hello world”, and cannot match “world hello”.
- $ (dollar sign): Matches the end position of a string. For example, world$ can only match strings that end with “world”, such as “hello world”, and cannot match “world hello”.
- * (asterisk): Matches the preceding character or subexpression 0 or more times. For example, ab* can match a, ab, abb, abbb, etc.
- + (plus): Matches the preceding character or subexpression 1 or more times (the difference from * is that it must match at least once). For example, ab+ can match ab, abb, but cannot match a.
- ? (question mark): Matches the preceding character or subexpression 0 or 1 time, and can also be used to change greedy matching to non-greedy matching. For example, ab? can match a or ab.
- {m} (curly braces): Matches the preceding character or subexpression exactly m times. For example, a{3} can match aaa, but cannot match aa or aaaa.
- {m,n}: Matches the preceding character or subexpression at least m times and at most n times (m≤n). For example, a{2,4} can match aa, aaa, aaaa.
- [] (square brackets): Defines a character set, matching any one character in the set. For example, [abc] can match a, b, or c; [0-9] matches any digit (equivalent to \d); [a-zA-Z] matches any uppercase or lowercase letter.
- | (pipe): Represents “or” logic, matching either the left or right subexpression. For example, abc|def can match abc or def.
- () (parentheses): Defines a subexpression (grouping) used to extract matching results or change the order of operations. For example, (ab)+ can match ab, abab, etc., and can extract ab through grouping.
2. Predefined Character Sets: Simplifying Common Matching Rules
To simplify common character matching scenarios, the re library provides predefined character sets, which are “shortcuts” for metacharacters. When using them, be careful to distinguish between uppercase and lowercase (uppercase usually indicates “not”):
- \d: Matches any digit, equivalent to [0-9];
- \D: Matches any non-digit, equivalent to [^0-9];
- \w: Matches any letter, digit, or underscore (i.e., “word character”), equivalent to [a-zA-Z0-9_];
- \W: Matches any non-word character, equivalent to [^a-zA-Z0-9_];
- \s: Matches any whitespace character (space, tab \t, newline \n, etc.);
- \S: Matches any non-whitespace character;
- \b: Matches a word boundary (i.e., the junction between a word and a non-word character), for example, \bhello\b can match “hello”, but cannot match “helloworld” or “hello_world”.
II. Core Functions of the re Library: Full Process from Matching to Processing
The re library provides a series of functions that cover the entire process from “matching validation” to “result extraction” and “string modification”. Here are the 6 most commonly used core functions; mastering them will enable you to handle most string processing scenarios.
1. re.match(): Match from the Beginning of the String
Function: Matches only from the beginning of the string; if the beginning does not satisfy the regular rule, it returns None; if the match is successful, it returns a Match object (containing the matching result).
Syntax: re.match(pattern, string, flags=0)
- pattern: Regular expression rule (in string form);
- string: The target string to be matched;
- flags: Matching mode (optional, such as re.IGNORECASE to ignore case, re.MULTILINE for multiline mode).
Example
import re
# Match strings starting with "hello"
result = re.match(r"hello", "hello world")
print(result) # Output: <re.Match object; span=(0, 5), match='hello'>
print(result.group()) # Output: hello (extracting the matching result through group())
# Case where the beginning does not match
result2 = re.match(r"hello", "world hello")
print(result2) # Output: None
Note: re.match() only matches the beginning; even if the subsequent part of the string meets the rule, as long as the beginning does not match, it will return None. This is the core difference from re.search().
2. re.search(): Match at Any Position in the String
Function: Searches the entire string for the first substring that conforms to the regular rule, regardless of whether the position is at the beginning; if found, it returns a Match object, otherwise it returns None.
Syntax: re.search(pattern, string, flags=0) (parameters are the same as re.match()).
Example
import re
# Find "world" at any position in the string
result = re.search(r"world", "hello world")
print(result) # Output: <re.Match object; span=(6, 11), match='world'>
print(result.span()) # Output: (6, 11) (getting the start and end index of the matching substring through span())
# Case insensitive matching
result2 = re.search(r"HELLO", "hello world", re.IGNORECASE)
print(result2.group()) # Output: hello
Applicable scenario: When you need to “find whether a substring exists in a string”, prefer using re.search() rather than re.match().
3. re.findall(): Extract All Matching Substrings
Function: Searches the entire string for all substrings that conform to the regular rule, returning a list (if no matches, returns an empty list). Unlike match() and search(), findall() directly returns the matching results, not a Match object.
Syntax: re.findall(pattern, string, flags=0).
Example
import re
# Extract all numbers from the string
text = "User A: 100 points, User B: 95 points, User C: 88 points"
scores = re.findall(r"\d+", text)
print(scores) # Output: ['100', '95', '88'] (note that the return is a list of strings, type conversion may be needed)
# Extract matching results with grouping (returns a list of tuples)
users_scores = re.findall(r"User(\w):(\d+) points", text)
print(users_scores) # Output: [('A', '100'), ('B', '95'), ('C', '88')]
Key detail: If the regular expression contains groups (()), findall() will prioritize extracting the content within the groups: if there is 1 group, it returns a list of strings; if there are multiple groups, it returns a list of tuples (each tuple corresponds to a set of matching group results).
4. re.sub(): Replace Matching Substrings
Function: Searches the string for all substrings that conform to the regular rule and replaces them with specified content, returning the new string after replacement (the original string remains unchanged).
Syntax: re.sub(pattern, repl, string, count=0, flags=0)
- repl: The content after replacement (can be a string or a function);
- count: Maximum number of replacements (default 0, indicating all matches will be replaced).
Example: Basic Replacement
import re
# Replace "apple" with "banana" in the string
text = "I like to eat apples, apples are sweet"
new_text = re.sub(r"apple", "banana", text)
print(new_text) # Output: I like to eat bananas, bananas are sweet
# Only replace the first match
new_text2 = re.sub(r"apple", "banana", text, count=1)
print(new_text2) # Output: I like to eat bananas, apples are sweet
Example: Dynamic Replacement Using a Function
import re
# Add 1 to the numbers in the string (e.g., "5"→"6", "10"→"11")
def add_one(match):
num = int(match.group()) # Get the matched number string and convert to integer
return str(num + 1)
text = "Version 1: V5, Version 2: V10"
new_text = re.sub(r"\d+", add_one, text)
print(new_text) # Output: Version 1: V6, Version 2: V11
Applicable scenario: Data cleaning (e.g., removing special characters), format unification (e.g., standardizing date formats), content desensitization (e.g., hiding 4 digits in a phone number), etc.
5. re.split(): Split Strings by Matching Rules
Function: Uses substrings that conform to the regular rule as “delimiters” to split the target string into a list, returning the split list.
Syntax: re.split(pattern, string, maxsplit=0, flags=0)
- maxsplit: Maximum number of splits (default 0, indicating all possible positions will be split).
Example
import re
# Split by any whitespace character (space, tab, newline)
text = "hello world\tpython\njava"
result = re.split(r"\s+", text)
print(result) # Output: ['hello', 'world', 'python', 'java']
# Split by comma or semicolon, and only split once
text2 = "a,b;c,d"
result2 = re.split(r"[,;]", text2, maxsplit=1)
print(result2) # Output: ['a', 'b;c,d']
Advantage: Compared to Python’s built-in str.split() (which only supports fixed delimiters), re.split() supports “flexible delimiter rules”, such as “splitting by any number of spaces” or “splitting by comma or semicolon”.
6. re.compile(): Compile Regular Expressions to Improve Efficiency
Function: Compiles the regular expression rule (pattern) into a Pattern object, which can be reused to call match(), search(), findall(), etc.
Syntax: re.compile(pattern, flags=0).
Example
import re
# Compile regular expression (matching email)
email_pattern = re.compile(r"[a-zA-Z0-9_-]+@[a-zA-Z0-9_-]+\.[a-zA-Z]+")
# Reuse the compiled Pattern object
text1 = "My email is [email protected]"
text2 = "Please contact [email protected] for help"
print(email_pattern.findall(text1)) # Output: ['[email protected]']
print(email_pattern.findall(text2)) # Output: ['[email protected]']
Core value: When the same regular rule needs to be used multiple times (e.g., processing 1000 pieces of data that all need to match emails), re.compile() can avoid repeated parsing of the regular expression, significantly improving runtime efficiency. If the regular rule is only used once, there is no need to compile.
III. Match Object: The “Key” to Parsing Matching Results
When re.match() or re.search() matches successfully, a Match object is returned, which contains detailed information about the matching result. The commonly used methods of the Match object are as follows:
- group(num=0): Extracts the matching result. num=0 (default) indicates the entire matched substring; num=1,2,… indicates extracting the nth group (content within ()); if num does not exist, an IndexError is raised.
- groups(default=None): Returns all group matching results as a tuple; if a group does not match, it returns the default value.
- span(num=0): Returns the start and end index of the matching result (or specified group) in the original string, formatted as (start, end) (left closed, right open interval).
- start(num=0): Returns the starting index of the matching result (or specified group);
- end(num=0): Returns the ending index of the matching result (or specified group).
Example
import re
# Match phone number (group 1: carrier number segment, group 2: middle 4 digits, group 3: last 4 digits)
result = re.search(r"(\d{3})-(\d{4})-(\d{4})", "My phone number is 138-1234-5678")
print(result.group()) # Output: 138-1234-5678 (entire matching result)
print(result.group(1)) # Output: 138 (1st group)
print(result.group(2, 3)) # Output: ('1234', '5678') (2nd and 3rd groups)
print(result.span()) # Output: (7, 18) (index of the entire matching result)
print(result.start(2)) # Output: 11 (starting index of the 2nd group)
IV. Common Pitfalls and Tips for Avoiding Mistakes in the re Library
Handling Escape Characters
In regular expressions, \ is an escape character (e.g., \d represents a digit), while in Python strings, \ is also an escape character (e.g., \n represents a newline). To avoid errors caused by double escaping, it is recommended to use raw strings (prefix the string with r) to define regular rules, such as r”\d+” (no need to write as “\d+”).
Greedy Matching vs. Non-Greedy Matching
*, +, ?, {m,n} default to “greedy matching” (i.e., matching the longest possible substring). If you need “non-greedy matching” (i.e., matching the shortest possible substring), you need to add ? after the metacharacters.
For example
import re
text = "aabbaa"
print(re.findall(r"a.*a", text)) # Output: ['aabbaa'] (greedy matching, from the first a to the last a)
print(re.findall(r"a.*?a", text)) # Output: ['aa', 'aa'] (non-greedy matching, from the first a to the nearest a)
Using Multiline Mode
By default, ^ only matches the beginning of the entire string, and $ only matches the end of the entire string. If you need to match “the beginning/end of each line”, you need to enable re.MULTILINE (or re.M) mode.
For example
import re
text = "line1: hello\nline2: world"
# Without enabling multiline mode: only matches "line1" at the beginning of the entire string
print(re.findall(r"^line\d+", text)) # Output: ['line1']
# Enabling multiline mode: matches "line1" and "line2" at the beginning of each line
print(re.findall(r"^line\d+", text, re.MULTILINE)) # Output: ['line1', 'line2']
That concludes today’s content; I hope it helps you!
Feel free to like, view, follow, and share.