In computer science, string matching is an important area of research. We often need to find the position of a pattern string within a text string. While there are various methods to achieve this, today we will focus on an efficient algorithm—the KMP (Knuth-Morris-Pratt) algorithm.
Introduction to KMP Algorithm
The KMP algorithm was proposed by Donald Knuth, Vaughan Pratt, and James H. Morris in 1977. Its main advantage is that it can complete the matching in O(n + m) time complexity, where n is the length of the text string and m is the length of the pattern string. This is much faster than the traditional brute-force search method, which typically requires O(n * m) time complexity.
Principle of KMP Algorithm
The KMP algorithm avoids unnecessary comparisons by preprocessing the pattern string. When a character mismatch is found, it does not backtrack to the text string but instead uses the information from the partial match to skip some characters, thus improving efficiency.
1. Prefix Table (LPS Array)
To achieve this, we first need to construct a prefix table, also known as the LPS (Longest Prefix which is also Suffix) array. This array records the length of the longest equal prefix and suffix for each substring before a given position. For example, for the pattern “ABABC”, its LPS array is [0, 0, 1, 2, 0].
- LPS[0] = 0, because there are no characters.
- LPS[1] = 0, because “A” has no matching prefix and suffix.
- LPS[2] = 1, because “AB” has a matching prefix and suffix “A”.
- LPS[3] = 2, because “ABA” has a matching prefix and suffix “AB”.
- LPS[4] = 0, because “ABAB” has no matching prefix and suffix.
2. Matching Process
Once we have constructed the LPS array, we can begin the actual string matching. During the comparison process, if a mismatch occurs, we will use the LPS array to determine the next position to compare, rather than simply moving one position to the right. This greatly reduces the number of repeated comparisons and improves efficiency.
KMP Algorithm Code Example
Below is a piece of code implementing the KMP algorithm and its related functions in C:
#include <stdio.h>
#include <string.h>
// Function declarations
void computeLPSArray(char* pattern, int M, int* lps);
void KMPSearch(char* text, char* pattern);
int main() {
char text[] = "ABABDABACDABABCABAB";
char pattern[] = "ABABC";
KMPSearch(text, pattern);
return 0;
}
// Build LPS array
void computeLPSArray(char* pattern, int M, int* lps) {
int len = 0; // Length of the current longest prefix
lps[0] = 0; // The first element is always zero
int i = 1;
while (i < M) {
if (pattern[i] == pattern[len]) {
len++;
lps[i] = len;
i++;
} else { // On mismatch
if (len != 0) {
len = lps[len - 1]; // Use the previously computed value
} else {
lps[i] = 0;
i++;
}
}
}
}
// KMP search function
void KMPSearch(char* text, char* pattern) {
int M = strlen(pattern);
int N = strlen(text);
// Create lps[] and fill it
int lps[M];
computeLPSArray(pattern, M, lps);
int i = 0; // Text pointer
int j = 0; // Pattern pointer
while (N - i >= M - j) {
if (pattern[j] == text[i]) {
i++;
j++;
}
if (j == M) {
printf("Pattern '%s' found at index %d\n", pattern, i - j);
j = lps[j - 1];
} else if (i < N && pattern[j] != text[i]) {
if (j != 0)
j = lps[j - 1];
else
i++;
}
}
}
How to Run the Code
Copy the above code into your C development environment and compile it. You will see the program output the position where the pattern is found, for example:
Pattern 'ABABC' found at index 10
Conclusion
The KMP algorithm is an efficient and practical method for solving pattern matching problems in strings. By constructing and utilizing the LPS array, it significantly reduces repeated comparisons and improves performance. In practical applications, this technique is widely used in text editors, search engines, and other scenarios that require fast substring searching. If you are interested in string processing, delving deeper into KMP and other related techniques will be very beneficial!