Strings are the most commonly used data type in Python, yet 90% of programmers only utilize 20% of string operations. This article introduces 20 high-frequency, efficient string operation techniques. Mastering them can enhance your code efficiency by tenfold.
1. Basic String Search and Replace (5 Operations)
1. find() vs index() — Finding the Position of a Substring
These two methods appear to have the same functionality, but there are key differences.
text = "Python is awesome, Python is powerful"
# find(): returns index if found, -1 if not found (does not raise an error)
pos1 = text.find("Python")
print(pos1) # Output: 0
pos2 = text.find("Java")
print(pos2) # Output: -1 (not found, returns -1)
# index(): returns index if found, raises an error if not found
pos3 = text.index("Python")
print(pos3) # Output: 0
pos4 = text.index("Java")
# Raises: ValueError: substring not found
# Find the position of the second occurrence
second_pos = text.find("Python", 1) # Start searching from position 1
print(second_pos) # Output: 26
Key Difference:
<span>find()</span>returns -1 if not found, which is safer<span>index()</span>raises an error if not found, requiring exception handling
Best Practice: It is recommended to use <span>find()</span> to avoid the overhead of exception handling.
2. replace() — Replacing Substrings
text = "hello world, hello python"
# Basic replacement: replace all matches
result1 = text.replace("hello", "hi")
print(result1)
# Output: hi world, hi python
# Replace a specified number of times: only replace the first n occurrences
result2 = text.replace("hello", "hi", 1) # Only replace the 1st occurrence
print(result2)
# Output: hi world, hello python
# Case-sensitive
text2 = "Hello world, hello python"
result3 = text2.replace("hello", "hi")
print(result3)
# Output: Hello world, hi python (the first H is not replaced)
Performance Pitfall:
# ❌ Bad practice (inefficient)
text = "a" * 1000000 # 1 million 'a'
for i in range(100):
text = text.replace("a", "b") # Traverses the entire string each time
# ✅ Good practice (efficient)
import string
text = "a" * 1000000
result = text.translate(str.maketrans("a", "b")) # One traversal does it
3. count() — Counting the Occurrences of a Substring
text = "the quick brown fox jumps over the lazy dog"
# Basic counting
count1 = text.count("the")
print(count1) # Output: 2
# Count in a specified range (from index 5 to 30)
count2 = text.count("the", 5, 30)
print(count2) # Output: 1
# Count the frequency of different characters
stats = {}
for char in text:
if char != ' ':
stats[char] = stats.get(char, 0) + 1
print(stats)
# Output: {'t': 2, 'h': 2, 'e': 3, ...}
# More efficient way: use Counter
from collections import Counter
char_count = Counter(text.replace(" ", ""))
print(char_count.most_common(3)) # Output the 3 most common characters
Production Use Case:
# Count the occurrences of keywords in logs
log_text = """
ERROR: Database connection failed
WARNING: Memory usage high
ERROR: Timeout error
INFO: Server restarted
ERROR: Authentication failed
"""
error_count = log_text.count("ERROR")
warning_count = log_text.count("WARNING")
print(f"Error count: {error_count}, Warning count: {warning_count}")
4. startswith() and endswith() — Prefix and Suffix Checking
filename = "document.pdf"
url = "https://www.example.com"
# Check suffix
if filename.endswith((".pdf", ".doc", ".docx")):
print("This is a document file")
# Check prefix
if url.startswith(("http://", "https://")):
print("This is a URL")
# Practical application: file filtering
import os
def get_python_files(directory):
"""Get all Python files in the directory"""
python_files = []
for file in os.listdir(directory):
if file.endswith('.py'):
python_files.append(file)
return python_files
# More Pythonic way
def get_python_files_v2(directory):
"""More efficient version"""
return [f for f in os.listdir(directory) if f.endswith('.py')]
Performance Comparison:
# ❌ Bad practice
if filename.endswith('.pdf') or filename.endswith('.doc'):
pass
# ✅ Good practice (3 times faster)
if filename.endswith(('.pdf', '.doc')):
pass
5. strip() / lstrip() / rstrip() — Removing Whitespace
text = " hello world \n"
# strip(): removes whitespace from both ends
result1 = text.strip()
print(f"'{result1}'") # Output: 'hello world'
# lstrip(): removes whitespace from the left end only
result2 = text.lstrip()
print(f"'{result2}'") # Output: 'hello world \n'
# rstrip(): removes whitespace from the right end only
result3 = text.rstrip()
print(f"'{result3}'") # Output: ' hello world'
# ⚠️ Key Pitfall: does not only remove one space!
text2 = "---hello---"
print(text2.strip("-")) # Output: hello (all consecutive - are removed)
# Custom characters to remove
text3 = "xxxhelloyyy"
print(text3.strip("xy")) # Output: hello
print(text3.strip("xyhel")) # Output: o (removes any character contained)
# Practical application: cleaning CSV data
csv_line = " 张三 , 25 , 北京 \n"
fields = [f.strip() for f in csv_line.split(',')]
print(fields)
# Output: ['张三', '25', '北京']
# Handling user input
user_input = input("Please enter your name:").strip()
# Automatically remove excess whitespace to avoid data inconsistency
Common Errors:
# ❌ Error: strip removes a set of characters, not the string itself
text = "hello"
print(text.strip("lo")) # Output: he (not hello)
# ✅ Correct practice: if you want to remove a string prefix
if text.startswith("lo"):
text = text[2:]
2. Advanced Splitting and Joining (4 Operations)
6. split() and rsplit() — The Art of Splitting Strings
# Basic splitting
text = "apple,banana,cherry,date"
parts1 = text.split(",")
print(parts1)
# Output: ['apple', 'banana', 'cherry', 'date']
# Limit the number of splits
parts2 = text.split(",", 2) # Only split 2 times
print(parts2)
# Output: ['apple', 'banana', 'cherry,date']
# rsplit(): splits from the right
parts3 = text.rsplit(",", 2) # Split 2 times from the right
print(parts3)
# Output: ['apple,banana', 'cherry', 'date']
# Split with multiple delimiters (using regex)
import re
text2 = "apple, banana; cherry: date"
parts4 = re.split(r'[,;:]', text2)
print(parts4)
# Output: ['apple', ' banana', ' cherry', ' date']
# Practical application 1: parsing URLs
url = "https://www.example.com/path/to/resource?key=value&foo=bar"
protocol, rest = url.split("://", 1)
domain, rest = rest.split("/", 1)
path, query = rest.split("?", 1)
print(f"Protocol: {protocol}, Domain: {domain}, Path: {path}, Query: {query}")
# Practical application 2: parsing CSV lines
csv_line = 'John,"Smith, Jr.",30,New York'
# Simple split will fail, need to use csv module
import csv
reader = csv.reader([csv_line])
fields = next(reader)
print(fields)
# Output: ['John', 'Smith, Jr.', '30', 'New York']
Performance Comparison:
# ❌ Inefficient: multiple splits
text = "a:b:c:d:e"
parts = text.split(":")
result = parts[2] # Get the 3rd element
# ✅ Efficient: only split the needed parts
result = text.split(":", 3)[2]
7. join() — Joining Strings
# Basic joining
words = ["hello", "world", "python"]
result1 = " ".join(words)
print(result1) # Output: hello world python
# Joining numbers (need to convert)
numbers = [1, 2, 3, 4, 5]
result2 = "-".join(str(n) for n in numbers)
print(result2) # Output: 1-2-3-4-5
# Practical application 1: generating SQL IN statement
ids = [1, 2, 3, 4, 5]
sql = f"SELECT * FROM users WHERE id IN ({','.join(map(str, ids))})"
print(sql)
# Practical application 2: generating URL path
path_parts = ["api", "v1", "users", "123"]
path = "/" + "/".join(path_parts)
print(path) # Output: /api/v1/users/123
# Practical application 3: generating CSV line
data = ["张三", 25, "北京", "[email protected]"]
csv_line = ",".join(map(str, data))
print(csv_line)
# ⚠️ Performance Pitfall: do not use + to concatenate multiple strings
# ❌ Bad practice (creates a new string each time, O(n²) complexity)
result = ""
for word in words:
result = result + " " + word
# ✅ Good practice (one-time join, O(n) complexity)
result = " ".join(words)
Large Scale Data Comparison:
import time
# Generate 10,000 strings
data = ["word"] * 10000
# Using + to join (time-consuming)
start = time.time()
result = ""
for word in data:
result += word + ","
time1 = time.time() - start
# Using join (fast)
start = time.time()
result = ",".join(data)
time2 = time.time() - start
print(f"+ method: {time1:.4f}s, join method: {time2:.4f}s")
# Output example: + method: 0.1234s, join method: 0.0012s (100 times faster!)
8. partition() and rpartition() — Three-Way Split
# partition(): splits into three parts at the first delimiter
text = "name=John;age=30;city=NYC"
head, sep, tail = text.partition(";")
print(f"Before: {head}, Separator: {sep}, After: {tail}")
# Output: Before: name=John, Separator: ;, After: age=30;city=NYC
# Practical application: parsing key=value format
def parse_key_value(text):
key, sep, value = text.partition("=")
return key.strip(), value.strip() if sep else None
result = parse_key_value("timeout = 3000")
print(result) # Output: ('timeout', '3000')
# rpartition(): splits from the right
head, sep, tail = text.rpartition(";")
print(f"Before: {head}, Separator: {sep}, After: {tail}")
# Output: Before: name=John;age=30, Separator: ;, After: city=NYC
# Practical application: getting file extension
def get_file_info(filename):
name, sep, ext = filename.rpartition(".")
return name, ext if sep else ""
print(get_file_info("document.pdf")) # Output: ('document', 'pdf')
print(get_file_info("archive.tar.gz")) # Output: ('archive.tar', 'gz')
3. Formatting and Conversion (5 Operations)
9. format() and f-string — The Evolution of String Formatting
name = "张三"
age = 25
salary = 15000.5
# Method 1: % formatting (deprecated)
result1 = "Name: %s, Age: %d, Salary: %.2f" % (name, age, salary)
# Method 2: format() method (good compatibility)
result2 = "Name: {}, Age: {}, Salary: {:.2f}".format(name, age, salary)
# Method 3: f-string (Python 3.6+, recommended)
result3 = f"Name: {name}, Age: {age}, Salary: {salary:.2f}"
print(result3)
# Output: Name: 张三, Age: 25, Salary: 15000.50
# Powerful feature of f-string: can directly execute expressions
print(f"Next year's salary: {salary * 1.1:.2f}") # Output: Next year's salary: 16500.55
# Alignment and padding
numbers = [1, 12, 123, 1234]
for num in numbers:
print(f"Number: {num:>5}")
# Output:
# Number: 1
# Number: 12
# Number: 123
# Number: 1234
# Base conversion
num = 255
print(f"Decimal: {num}, Hexadecimal: {num:x}, Binary: {num:b}")
# Output: Decimal: 255, Hexadecimal: ff, Binary: 11111111
# Percentage format
rate = 0.8567
print(f"Completion: {rate:.2%}") # Output: Completion: 85.67%
# Number separator (Python 3.6+)
large_num = 1234567890
print(f"Large number: {large_num:,}") # Output: Large number: 1,234,567,890
Performance Comparison:
import time
name = "Python"
age = 10
# Compare the performance of three methods
iterations = 1000000
# % formatting
start = time.time()
for _ in range(iterations):
result = "%s is %d years old" % (name, age)
time1 = time.time() - start
# format() method
start = time.time()
for _ in range(iterations):
result = "{} is {} years old".format(name, age)
time2 = time.time() - start
# f-string
start = time.time()
for _ in range(iterations):
result = f"{name} is {age} years old"
time3 = time.time() - start
print(f"% formatting: {time1:.3f}s")
print(f"format(): {time2:.3f}s")
print(f"f-string: {time3:.3f}s")
# Output example: f-string is the fastest, % is the slowest
10. upper() / lower() / title() / swapcase() — Case Conversion
text = "Hello World Python"
# All uppercase
print(text.upper()) # Output: HELLO WORLD PYTHON
# All lowercase
print(text.lower()) # Output: hello world python
# Title case (first letter capitalized)
print(text.title()) # Output: Hello World Python
# Swap case
print(text.swapcase()) # Output: hELLO wORLD pYTHON
# capitalize(): first letter capitalized, others lowercase
print(text.capitalize()) # Output: Hello world python
# Practical application 1: normalizing user input
user_email = input("Please enter your email:").strip().lower()
# Prevent issues caused by case differences
# Practical application 2: generating URL slug
def slugify(text):
"""Convert text to a URL-safe format"""
return text.lower().replace(" ", "-")
print(slugify("Hello World Python")) # Output: hello-world-python
# Practical application 3: checking password complexity
def check_password_strength(password):
has_upper = any(c.isupper() for c in password)
has_lower = any(c.islower() for c in password)
has_digit = any(c.isdigit() for c in password)
return len(password) >= 8 and has_upper and has_lower and has_digit
print(check_password_strength("Secure123")) # Output: True
11. isdigit() / isalpha() / isalnum() — Character Validation
# Check if all are digits
print("12345".isdigit()) # Output: True
print("123a5".isdigit()) # Output: False
# Check if all are letters
print("hello".isalpha()) # Output: True
print("hello123".isalpha()) # Output: False
# Check if all are letters or digits
print("hello123".isalnum()) # Output: True
print("hello-123".isalnum()) # Output: False
# Check if all are spaces
print(" ".isspace()) # Output: True
# Check if valid identifier (variable name)
print("var_name".isidentifier()) # Output: True
print("123var".isidentifier()) # Output: False
# Check if all uppercase/lowercase
print("HELLO".isupper()) # Output: True
print("hello".islower()) # Output: True
# Practical application 1: validating user input
def validate_username(username):
if len(username) < 3 or len(username) > 20:
return False, "Username length must be between 3-20 characters"
if not username[0].isalpha():
return False, "Username must start with a letter"
if not username.replace("_", "").isalnum():
return False, "Username can only contain letters, numbers, and underscores"
return True, "Username is valid"
print(validate_username("user_123")) # Output: (True, 'Username is valid')
print(validate_username("123user")) # Output: (False, 'Username must start with a letter')
# Practical application 2: data type recognition
def detect_type(value_str):
"""Recognize the data type represented by the string"""
if value_str.isdigit():
return "Integer"
elif value_str.isalpha():
return "String"
elif value_str.isalnum():
return "Mixed type"
else:
return "Other"
print(detect_type("123")) # Output: Integer
12. zfill() and center() — Padding and Centering
# zfill(): pads with 0 on the left
num_str = "123"
print(num_str.zfill(5)) # Output: 00123
# Practical application 1: generating order number
def generate_order_id(order_num):
return f"ORD{order_num:0>6d}"
print(generate_order_id(123)) # Output: ORD000123
# center(): centers (pads on both sides)
text = "Python"
print(text.center(15)) # Output: " Python "
print(text.center(15, "*")) # Output: "****Python*****"
# ljust() and rjust(): left and right align
print(text.ljust(15, "-")) # Output: Python---------
print(text.rjust(15, "-")) # Output: ---------Python
# Practical application 2: printing tables
def print_table(rows):
"""Print aligned table"""
for row in rows:
print("|".join(cell.center(15) for cell in row))
rows = [
["Name", "Age", "City"],
["张三", "25", "北京"],
["李四", "30", "上海"],
]
print_table(rows)
4. Regular Expressions and Advanced Operations (6 Operations)
13. Basics of Regular Expressions — match() / search() / findall()
import re
# match(): matches from the beginning
text = "Python 3.9"
if re.match(r"Python", text):
print("Match successful")
# search(): searches throughout the text
if re.search(r"\d+\.\d+", text):
print("Version number found")
# findall(): finds all matches
emails = "contact us at [email protected] or [email protected]"
found = re.findall(r"\b[\w.-]+@[\w.-]+\.\w+\b", emails)
print(found)
# Output: ['[email protected]', '[email protected]']
# Extracting grouped content
text = "Price: $99.99, Tax: $7.50"
matches = re.findall(r"\$(\d+\.\d+)", text)
print(matches)
# Output: ['99.99', '7.50']
# Practical application 1: extracting phone numbers
def extract_phone_numbers(text):
"""Extract phone numbers from text"""
pattern = r"\b(?:\+?1[-.\s]?)?\(?([0-9]{3})\)?[-.\s]?([0-9]{3})[-.\s]?([0-9]{4})\b"
return re.findall(pattern, text)
text = "Call me at 123-456-7890 or (098) 765 4321"
print(extract_phone_numbers(text))
# Practical application 2: extracting URLs
def extract_urls(text):
"""Extract all URLs from text"""
pattern = r"https?://(?:www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b(?:[-a-zA-Z0-9()@:%_\+.~#?&//=]*)"
return re.findall(pattern, text)
text = "Visit https://www.example.com or http://test.org for more info"
print(extract_urls(text))
14. sub() and subn() — Regular Replacement
import re
# sub(): replaces all matches
text = "The price is $99.99 and tax is $7.50"
result = re.sub(r"\$(\d+\.\d+)", r"¥\1*7", text)
print(result)
# Output: The price is ¥99.99*7 and tax is ¥7.50*7
# subn(): replaces and returns the number of replacements
text = "apple, apple, apple"
result, count = re.subn(r"apple", "orange", text)
print(f"Replaced {count} times")
print(result)
# Using a function for dynamic replacement
def replace_func(match):
"""Increase price by 10%"""
price = float(match.group(1))
return f"${price * 1.1:.2f}"
text = "Item 1: $100, Item 2: $50"
result = re.sub(r"\$(\d+(?:\.\d+)?)", replace_func, text)
print(result)
# Output: Item 1: $110.00, Item 2: $55.00
# Practical application 1: date format conversion
def convert_date_format(text):
"""Convert 2024-01-15 to 15/01/2024"""
pattern = r"(\d{4})-(\d{2})-(\d{2})"
return re.sub(pattern, r"\3/\2/\1", text)
print(convert_date_format("Today is 2024-01-15"))
# Output: Today is 15/01/2024
# Practical application 2: removing HTML tags
def remove_html_tags(text):
"""Extract plain text from HTML"""
return re.sub(r"<[^>]+>", "", text)
html = "<p>Hello <b>World</b></p>"
print(remove_html_tags(html))
# Output: Hello World
15. compile() — Precompiling Regular Expressions (Performance Optimization)
import re
# ❌ Bad practice (compiles every time)
def validate_email_slow(email):
for _ in range(1000):
if re.match(r"^[\w\.-]+@[\w\.-]+\.\w+$", email):
return True
return False
# ✅ Good practice (compile once)
email_pattern = re.compile(r"^[\w\.-]+@[\w\.-]+\.\w+$")
def validate_email_fast(email):
for _ in range(1000):
if email_pattern.match(email):
return True
return False
# Performance comparison
import time
email = "[email protected]"
start = time.time()
validate_email_slow(email)
time1 = time.time() - start
start = time.time()
validate_email_fast(email)
time2 = time.time() - start
print(f"Uncompiled: {time1:.4f}s, Compiled: {time2:.4f}s")
# Precompilation is usually 2-3 times faster
# Practical application: creating a validator class
class Validator:
"""Validator using precompiled regular expressions"""
EMAIL_PATTERN = re.compile(r"^[\w\.-]+@[\w\.-]+\.\w+$")
PHONE_PATTERN = re.compile(r"^\d{10,11}$")
URL_PATTERN = re.compile(r"^https?://")
@classmethod
def is_valid_email(cls, email):
return cls.EMAIL_PATTERN.match(email) is not None
@classmethod
def is_valid_phone(cls, phone):
return cls.PHONE_PATTERN.match(phone) is not None
@classmethod
def is_valid_url(cls, url):
return cls.URL_PATTERN.match(url) is not None
print(Validator.is_valid_email("[email protected]")) # True
print(Validator.is_valid_phone("13800138000")) # True
print(Validator.is_valid_url("https://example.com")) # True
16. translate() — Efficient Character Replacement
# Create a translation table
translation_table = str.maketrans("aeiou", "12345")
text = "hello world"
result = text.translate(translation_table)
print(result)
# Output: h2ll4 w4rld
# Remove specified characters
delete_table = str.maketrans("", "", "aeiou")
text = "hello world"
result = text.translate(delete_table)
print(result)
# Output: hll wrld
# Practical application 1: removing punctuation
import string
text = "Hello, World! How are you?"
remove_punctuation = str.maketrans("", "", string.punctuation)
result = text.translate(remove_punctuation)
print(result)
# Output: Hello World How are you
# Practical application 2: numbers to Chinese
chinese_map = str.maketrans("0123456789", "零一二三四五六七八九")
text = "My phone is 13800138000"
result = text.translate(chinese_map)
print(result)
# Output: My phone is 一三八零零一三八零零
# Performance comparison: translate vs replace
import time
text = "hello world" * 10000
iterations = 10000
# Method 1: using replace
start = time.time()
for _ in range(iterations):
result = text.replace("o", "0").replace("e", "3")
time1 = time.time() - start
# Method 2: using translate
trans_table = str.maketrans("oe", "03")
start = time.time()
for _ in range(iterations):
result = text.translate(trans_table)
time2 = time.time() - start
print(f"replace method: {time1:.4f}s, translate method: {time2:.4f}s")
# translate is usually 3-5 times faster
17. expandtabs() — Handling Tabs
# Convert tabs to spaces
text = "name\tage\tcity\nJohn\t25\tNYC"
print(text.expandtabs(15))
# Output aligned table
# Practical application: handling indentation in log files
log_text = "Error:\t\tConnection failed\nWarning:\t\tMemory high"
formatted = log_text.expandtabs(20)
print(formatted)
# Get the position of tabs
text = "Line1\tColumn1\nLine2\tColumn2"
print(text.expandtabs(10))
18. encode() and decode() — Character Encoding Conversion
# Encoding: string → bytes
text = "Hello 世界 🌍"
# Encode to UTF-8
encoded_utf8 = text.encode("utf-8")
print(encoded_utf8)
# Output: b'Hello \xe4\xb8\x96\xe7\x95\x8c \xf0\x9f\x8c\x8d'
# Encode to GB2312 (Simplified Chinese)
encoded_gb = text.encode("gb2312", errors="ignore")
print(encoded_gb)
# Decoding: bytes → string
decoded = encoded_utf8.decode("utf-8")
print(decoded)
# Output: Hello 世界 🌍
# Handling encoding errors
text = "测试"
try:
# Attempt to encode with ASCII (will fail)
encoded = text.encode("ascii")
except UnicodeEncodeError as e:
print(f"Encoding error: {e}")
# Using error handling strategies
# 'strict': raises an error for unencodable characters (default)
# 'ignore': ignores unencodable characters
# 'replace': replaces unencodable characters with ?
# 'xmlcharrefreplace': replaces with XML character references
text = "Hello 世界"
print(text.encode("ascii", errors="ignore"))
# Output: b'Hello '
print(text.encode("ascii", errors="replace"))
# Output: b'Hello ?'
print(text.encode("ascii", errors="xmlcharrefreplace"))
# Output: b'Hello &#19990;&#30028;'
# Practical application 1: handling file encoding issues
def safe_read_file(filepath):
"""Safely read a file, automatically handling encoding issues"""
encodings = ["utf-8", "gbk", "gb2312", "ascii"]
for encoding in encodings:
try:
with open(filepath, "r", encoding=encoding) as f:
return f.read()
except (UnicodeDecodeError, UnicodeEncodeError):
continue
raise ValueError("Unable to read file, encoding unknown")
# Practical application 2: handling network data
import json
json_str = '{"name":"张三","age":25}'
json_bytes = json_str.encode("utf-8")
decoded_str = json_bytes.decode("utf-8")
data = json.loads(decoded_str)
print(data)
19. Advanced Usage of ljust() / rjust() / center()
# Basic usage
text = "Python"
print(text.ljust(15, "-")) # Output: Python---------
print(text.rjust(15, "-")) # Output: ---------Python
print(text.center(15, "-")) # Output: ----Python-----
# Practical application 1: creating a progress bar
def progress_bar(percent, width=20):
"""Create a text progress bar"""
filled = int(width * percent / 100)
bar = "█" * filled + "░" * (width - filled)
return f"[{bar}] {percent}%"
for i in range(0, 101, 10):
print(progress_bar(i))
# Practical application 2: aligned output (like a table)
def print_aligned_table(data):
"""Print aligned table"""
# Calculate the maximum width of each column
max_widths = [max(len(str(row[i])) for row in data)
for i in range(len(data[0]))]
for row in data:
aligned_row = [str(cell).ljust(width)
for cell, width in zip(row, max_widths)]
print(" | ".join(aligned_row))
data = [
["Name", "Age", "City"],
["张三", "25", "北京"],
["李四的昵称", "30", "上海"],
]
print_aligned_table(data)
20. casefold() — Aggressive Case Folding
# casefold(): more aggressive lowercase conversion
# Suitable for international characters and different languages
text = "ß" # German letter
print(text.lower()) # Output: ß (unchanged)
print(text.casefold()) # Output: ss (converted to two s)
# Practical application 1: case-insensitive string comparison
def case_insensitive_compare(str1, str2):
"""Case-insensitive comparison (including international characters)"""
return str1.casefold() == str2.casefold()
print(case_insensitive_compare("Straße", "STRASSE")) # Output: True
print(case_insensitive_compare("hello", "HELLO")) # Output: True
# Practical application 2: search functionality
def search_case_insensitive(text, query):
"""Case-insensitive search"""
return query.casefold() in text.casefold()
print(search_case_insensitive("Hello World", "hello")) # Output: True
print(search_case_insensitive("Naïve", "naive")) # Output: True
# Performance comparison: casefold vs lower
import time
text = ("Hello World Python " * 1000).casefold()
query = "world"
iterations = 100000
# Using lower()
start = time.time()
for _ in range(iterations):
query.lower() in text
time1 = time.time() - start
# Using casefold()
start = time.time()
for _ in range(iterations):
query.casefold() in text
time2 = time.time() - start
print(f"lower(): {time1:.4f}s, casefold(): {time2:.4f}s")
5. Comprehensive Practice: Complete Data Processing Workflow
Comprehensive Case 1: Parsing and Validating User Data
import re
from collections import defaultdict
def parse_and_validate_user_data(csv_data):
"""Parse and validate CSV formatted user data
Input format:
name,email,phone,age
张三,[email protected],13800138000,25
李四,[email protected],15900139000,30
"""
lines = csv_data.strip().split("\n")
headers = [h.strip() for h in lines[0].split(",")]
users = []
errors = []
for i, line in enumerate(lines[1:], start=2):
fields = [f.strip() for f in line.split(",")]
if len(fields) != len(headers):
errors.append(f"Line {i}: Field count mismatch")
continue
user = dict(zip(headers, fields))
# Validate email
email_pattern = re.compile(r"^[\w\.-]+@[\w\.-]+\.\w+$")
if not email_pattern.match(user["email"]):
errors.append(f"Line {i}: Invalid email format - {user['email']}")
continue
# Validate phone
if not user["phone"].isdigit() or len(user["phone"]) != 11:
errors.append(f"Line {i}: Invalid phone format - {user['phone']}")
continue
# Validate age
try:
age = int(user["age"])
if not 18 <= age <= 100:
errors.append(f"Line {i}: Age must be between 18-100")
continue
except ValueError:
errors.append(f"Line {i}: Age must be a number - {user['age']}")
continue
user["age"] = age
users.append(user)
return {
"valid_users": users,
"errors": errors,
"summary": f"Success: {len(users)} records, Failed: {len(errors)} records"
}
# Usage example
csv_data = """
name,email,phone,age
张三,[email protected],13800138000,25
李四,invalid-email,15900139000,30
王五,[email protected],159001390,35
赵六,[email protected],18600136000,120
"""
result = parse_and_validate_user_data(csv_data)
print(result["summary"])
for error in result["errors"]:
print(f" ❌ {error}")
for user in result["valid_users"]:
print(f" ✅ {user['name']} - {user['email']}")
Comprehensive Case 2: Log Analysis and Statistics
import re
from collections import Counter
def analyze_log_file(log_text):
"""Analyze log files and extract key information
Log format:
[2024-01-15 10:30:45] INFO: Server started
[2024-01-15 10:30:50] ERROR: Connection failed
"""
# Define log pattern
log_pattern = re.compile(
r"\[(?P<timestamp>.*?)\]\s+(?P<level>\w+):\s+(?P<message>.*)"
)
logs = []
level_count = Counter()
for line in log_text.strip().split("\n"):
match = log_pattern.match(line)
if not match:
continue
log_entry = match.groupdict()
logs.append(log_entry)
level_count[log_entry["level"]] += 1
# Find error messages
errors = [log for log in logs if log["level"] == "ERROR"]
# Statistics
return {
"total_logs": len(logs),
"level_distribution": dict(level_count),
"errors": errors,
"error_count": len(errors),
"error_types": Counter(e["message"].split(":")[0] for e in errors)
}
# Usage example
log_text = """
[2024-01-15 10:30:45] INFO: Server started
[2024-01-15 10:30:50] ERROR: Connection failed
[2024-01-15 10:31:00] WARNING: Memory usage high
[2024-01-15 10:31:05] ERROR: Connection failed
[2024-01-15 10:31:10] INFO: Request processed
"""
result = analyze_log_file(log_text)
print(f"Total logs: {result['total_logs']}")
print(f"Log level distribution: {result['level_distribution']}")
print(f"Error count: {result['error_count']}")
print(f"Error types: {result['error_types']}")
Comprehensive Case 3: URL Parsing and Cleaning
import re
from urllib.parse import urlparse, parse_qs
def analyze_urls(url_list):
"""Analyze and clean a list of URLs"""
url_pattern = re.compile(
r"https?://(?:www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b(?:[-a-zA-Z0-9()@:%_\+.~#?&//=]*)"
)
valid_urls = []
domains = Counter()
for url in url_list:
# Extract URL
url = url.strip()
if not url_pattern.match(url):
continue
# Parse URL
parsed = urlparse(url)
domain = parsed.netloc.replace("www.", "")
domains[domain] += 1
# Parse query parameters
params = parse_qs(parsed.query)
valid_urls.append({
"url": url,
"domain": domain,
"path": parsed.path,
"params": params
})
return {
"total_urls": len(valid_urls),
"unique_domains": len(domains),
"top_domains": domains.most_common(5),
"urls": valid_urls
}
# Usage example
urls = [
"https://www.example.com/path?key=value",
"http://test.org/api/users?id=123&type=admin",
"invalid-url",
"https://github.com/repository"
]
result = analyze_urls(urls)
print(f"Valid URLs: {result['total_urls']}")
print(f"Unique domains: {result['unique_domains']}")
print(f"Top domains: {result['top_domains']}")
6. Performance Optimization Summary
Scenario 1: Large Scale String Concatenation
# ❌ Bad (time complexity O(n²))
result = ""
for i in range(10000):
result += f"Item {i}, "
# ✅ Good (time complexity O(n))
result = ", ".join(f"Item {i}" for i in range(10000))
# Performance improvement: over 100 times
Scenario 2: Multiple Replacement Operations
# ❌ Bad (traverses the string each time)
text = "a" * 1000000
for char in "abcdefg":
text = text.replace(char, "x")
# ✅ Good (using translate, one traversal)
trans = str.maketrans("abcdefg", "xxxxxxx")
text = text.translate(trans)
# Performance improvement: over 10 times
Scenario 3: Frequent Regular Matching
# ❌ Bad (compiles each time)
import re
for email in emails:
if re.match(r"^[\w\.-]+@[\w\.-]+\.\w+$", email):
pass
# ✅ Good (precompile)
pattern = re.compile(r"^[\w\.-]+@[\w\.-]+\.\w+$")
for email in emails:
if pattern.match(email):
pass
# Performance improvement: 2-3 times
7. Quick Reference for 20 Operations
| Index | Operation | Usage | Complexity | Commonality |
|---|---|---|---|---|
| 1 | find() / index() | Finding Substrings | O(n*m) | ⭐⭐⭐⭐⭐ |
| 2 | replace() | Replacing Substrings | O(n*m) | ⭐⭐⭐⭐⭐ |
| 3 | count() | Counting Occurrences | O(n) | ⭐⭐⭐⭐ |
| 4 | startswith/endswith | Prefix and Suffix Checking | O(m) | ⭐⭐⭐⭐⭐ |
| 5 | strip() | Removing Whitespace | O(n) | ⭐⭐⭐⭐⭐ |
| 6 | split() | Splitting Strings | O(n) | ⭐⭐⭐⭐⭐ |
| 7 | join() | Joining Strings | O(n) | ⭐⭐⭐⭐⭐ |
| 8 | partition() | Three-Way Split | O(n) | ⭐⭐⭐ |
| 9 | format / f-string | String Formatting | O(n) | ⭐⭐⭐⭐⭐ |
| 10 | upper/lower/title | Case Conversion | O(n) | ⭐⭐⭐⭐ |
| 11 | isdigit/isalpha | Character Validation | O(n) | ⭐⭐⭐⭐ |
| 12 | zfill / center | Padding and Centering | O(n) | ⭐⭐⭐ |
| 13 | match / search | Regular Matching | O(n*m) | ⭐⭐⭐⭐⭐ |
| 14 | findall | Finding All Matches | O(n*m) | ⭐⭐⭐⭐⭐ |
| 15 | sub / subn | Regular Replacement | O(n*m) | ⭐⭐⭐⭐⭐ |
| 16 | compile | Precompiled Regex | O(m) | ⭐⭐⭐⭐ |
| 17 | translate | Character Mapping | O(n) | ⭐⭐⭐ |
| 18 | expandtabs | Tab Handling | O(n) | ⭐ |
| 19 | encode / decode | Encoding Conversion | O(n) | ⭐⭐⭐⭐ |
| 20 | casefold | Aggressive Lowercase | O(n) | ⭐⭐ |
8. Best Practice Recommendations
✅ Do These Things
- Use f-string — The latest, fastest, and most readable
- Use join() for concatenation — Never use + to concatenate multiple strings
- Precompile regex — Must precompile for frequent matches
- Use strip() — Clean user input data
- Choose appropriate validation methods — isdigit, isalpha, etc.
- Use translate — Most efficient for large-scale character replacements
- Standardize encoding — Preferably use UTF-8
- Validate input — Always validate external input
Summary
These 20 string operations cover 95% of practical application scenarios in Python. The key is to understand:
- Basic Operations (1-5): are the foundation of all string processing
- Efficient Operations (6-7): join and split are key to performance
- Validation Operations (11): ensure data quality
- Regular Expressions (13-16): powerful tools for handling complex matches
- Performance Optimization (translate, compile): essential for handling large-scale data