My Take on the KMP Algorithm
My Take on the KMP Algorithm
- The Knuth-Morris-Pratt (KMP) algorithm is an advanced string matching algorithm that efficiently finds all occurrences of a pattern string
pat
within a text stringtext
. - It avoids redundant comparisons by leveraging a Longest Prefix-Suffix (LPS) array
- LPS: Longest Proper Prefix which is also a Suffix.
- Before understanding LPS, let us look what are proper prefixes? . They are prefix that excludes the full string itself.
Understanding proper prefixes using an example for string abc
1
2
3
4
- String s = "abc"
- Prefixes: `["", "a", "ab", "abc"]`
- Proper Prefixes: `["", "a", "ab"]` (does not include full string "abc")
- Suffixes: `["", "c", "bc", "abc"]`
LPS Array
Each value
lps[i]
stores the length of the longest proper prefix inpat(0...i)
that is also a suffix inpat(0...i)
.For Example
1
2
3
4
5
Consider pat = "abcdabca"
index => 0 1 2 3 4 5 6 7
pat => a b c d a b c a
lps => [ 0 0 0 0 1 2 3 1 ]
But, what do the lps values actually mean?
- Say
lps[5]
.lps[5]
says “I have the longest proper prefix inpat(0...5)
that is also a suffix inpat(0...5)
” pat(0...5)
is pattern from index0
to5
.pat(0...5) = "acbdab"
.- Let us write the proper prefixes and suffixes of this string to understand clearly.
1
2
3
4
5
pat(0...5) : "abcdab"
X
Proper Prefixes : ["", "a", "ab","abc","abcd","abcda"]
X
Suffixes : ["", "b", "ab", "dab", "cdab", "bcdab", "abcdab"]
- Clearly, the prefix
"ab"
(marked with an X) is the longest proper prefix that is also a suffix! - I hope that makes much more sense now, doesn’t it?
- Next step is computing this
LPS
array
Computing the LPS Array
Process
- Initialize
lps[0] = 0
,len = 0
, andi = 1
(start from the second character). - Traverse until
i < pat.length
:- If
pat[len] == pat[i]
(characters match):- Increment
len
(len++
) and assignlps[i] = len
. - Move to the next character (
i++
).
- Increment
- If
pat[len] != pat[i]
(characters mismatch):- If
len == 0
, setlps[i] = 0
and incrementi
. - Otherwise, update
len = lps[len - 1]
and recheck.
- If
- If
Example for computing LPS Array
- Consider pat = “abcdabcad”
- Let us find the
lps
array
RECOMMENDATION: If you are trying to understand this properly, it is recommended that you do this parallelly on pen and paper as well.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
pat = “abcdabcad”
Remember, we are trying to match characters at `len` and `i` at each iteration
Start
-------------------------------------------
i=1, len=0
len
index => 0 1 2 3 4 5 6 7 8
i
pat => a b c d a b c a d
lps => [ 0 ]
character mismatch and (len == 0) => set `lps[i] = 0`, increment `i`
-------------------------------------------
i=2, len=0
len
index => 0 1 2 3 4 5 6 7 8
i
pat => a b c d a b c a d
lps => [ 0 0 ]
character mismatch and (len == 0) => set `lps[i] = 0`, increment `i`
-------------------------------------------
i=3, len=0
len
index => 0 1 2 3 4 5 6 7 8
i
pat => a b c d a b c a d
lps => [ 0 0 0 ]
character mismatch and (len == 0) => set `lps[i] = 0`, increment `i`
-------------------------------------------
i=4, len=0
len
index => 0 1 2 3 4 5 6 7 8
i
pat => a b c d a b c a d
lps => [ 0 0 0 0 ]
character match => len++, lps[i] = len, then i++
-------------------------------------------
i=5, len=1
len
index => 0 1 2 3 4 5 6 7 8
i
pat => a b c d a b c a d
lps => [ 0 0 0 0 1 ]
character match => len++, lps[i] = len, then i++
-------------------------------------------
i=6, len=2
len
index => 0 1 2 3 4 5 6 7 8
i
pat => a b c d a b c a d
lps => [ 0 0 0 0 1 2 ]
character match => len++, lps[i] = len, then i++
-------------------------------------------
i=7, len=3
len
index => 0 1 2 3 4 5 6 7 8
i
pat => a b c d a b c a d
lps => [ 0 0 0 0 1 2 3 ]
character mismatch => len!=0 => set len = lps[len - 1]
-------------------------------------------
i=7, len=0
len
index => 0 1 2 3 4 5 6 7 8
i
pat => a b c d a b c a d
lps => [ 0 0 0 0 1 2 3 ]
character match => len++, lps[i] = len, then i++
-------------------------------------------
i=8, len=1
len
index => 0 1 2 3 4 5 6 7 8
i
pat => a b c d a b c a d
lps => [ 0 0 0 0 1 2 3 1 ]
character mismatch => len!=0 => set len = lps[len - 1]
-------------------------------------------
i=8, len=0
len
index => 0 1 2 3 4 5 6 7 8
i
pat => a b c d a b c a d
lps => [ 0 0 0 0 1 2 3 1 ]
character mismatch and (len == 0) => set `lps[i] = 0`, increment `i`
-------------------------------------------
i=9, len=1
len
index => 0 1 2 3 4 5 6 7 8 9
i
pat => a b c d a b c a d
lps => [ 0 0 0 0 1 2 3 1 0 ]
condition violated => (i < pat.length()) => loop breaks
-------------------------------------------
END
-------------------------------------------
Final state
lps => [0, 0, 0, 0, 1, 2, 3, 1, 0]
Pattern Matching Using LPS
- Observe closely that the pattern matching code is strikingly similar to computing the LPS array.
Process
- Initialize pointers
i = 0
(text) andj = 0
(pattern). - Traverse the text while
i < text.length
:- If
text[i] == pat[j]
(match):- Increment both
i
andj
. - If
j == pat.length
(pattern found):- Record the occurrence start index:
i - j
. - Reset
j = lps[j - 1]
for further matches.
- Record the occurrence start index:
- Increment both
- If
text[i] != pat[j]
(mismatch):- If
j == 0
, incrementi
(move text pointer). - Otherwise, reset
j = lps[j - 1]
(shift pattern pointer).
- If
- If
- Dry-run an example(like we did earlier while computing
lps
array) and observe how the characters are being matched.
Complexity Analysis
- Time Complexity: O(n + m)
- O(m): For computing the LPS array.
- O(n): For searching the pattern in the text.
- where,
n
is the text length andm
is the pattern length.-
- Space Complexity: O(m)
- LPS array has size equal to pattern length, i.e.
m
.
- LPS array has size equal to pattern length, i.e.
Problem: Search Pattern (KMP-Algorithm)
Given two strings, one is a text string txt
and the other is a pattern string pat
. The task is to print the indexes of all the occurrences of the pattern string in the text string. Use 0-based
indexing while returning the indices and return an empty list in case of no occurrences of pattern.
Code
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
class Solution {
// find all occurrences of pattern `pat` in text `txt`
ArrayList<Integer> search(String pat, String txt) {
int txtLen = txt.length(); // text length
int patLen = pat.length(); // pattern length
// if the pattern is longer than the text, we can never find it in `text`
if(txtLen < patLen){
return new ArrayList<>();
}
int[] lps = new int[patLen]; // Array to store the LPS (Longest Prefix Suffix) values
// compute the LPS array for the pattern
computeLPS(pat, lps);
// list to store indices of occurences
ArrayList<Integer> res = new ArrayList<Integer>();
// Call match function to find all occurrences of pattern in the text
match(pat, txt, lps, res);
return res; // return the answer list
}
// pattern matching using LPS array
private void match(String pat, String txt, int[] lps, ArrayList<Integer> res){
int txtLen = txt.length();
int patLen = pat.length();
int i = 0; // Pointer for the text
int j = 0; // Pointer for the pattern
// traverse the text until all characters are checked
while(i < txtLen){
// character match, increment both `i` and `j`
if(txt.charAt(i) == pat.charAt(j)){
i++;
j++;
// if entire pattern is matched, add the start index `i-j` to result
if(j == patLen){
res.add(i - j);
j = lps[j - 1]; // use LPS array to skip unnecessary comparisons
}
}
else{
// character mismatch, check if the `len` is at the start
if(j == 0){
i++; // if `len` at the start of the pattern, move the `i` forward
}
else{
j = lps[j - 1]; // otherwise, move the `len` based on the LPS array
}
}
}
}
// compute the LPS array for the given pattern
private void computeLPS(String pat, int[] lps){
int patLen = pat.length();
int len = 0;
int i = 1; // start checking from the second character
lps[0] = 0; // the first element of LPS array is always `0`
// traverse the pattern and fill the LPS array
while(i < patLen){
// characters match, extend the prefix-suffix length
if(pat.charAt(i) == pat.charAt(len)){
len++; // Increase the length of current matching prefix
lps[i] = len; // Update the LPS array with the new length
i++; // Move to the next character in the pattern
}
else{
// characters mismatch
if(len != 0){
// use the LPS array to skip and try matching with a smaller prefix
len = lps[len - 1];
}
else{
lps[i] = 0; // if no match, set LPS value to `0` and move to the next character
i++;
}
}
}
}
}
More Resources
- Additional resources to learn about the KMP Algorithm:
- Few questions to practice KMP algorithm are:
This post is licensed under CC BY 4.0 by the author.