Post

My Take on the KMP Algorithm

My Take on the KMP Algorithm
  • The Knuth-Morris-Pratt (KMP) algorithm is an advanced string matching algorithm that efficiently finds all occurrences of a pattern string pat within a text string text.
  • It avoids redundant comparisons by leveraging a Longest Prefix-Suffix (LPS) array
  • LPS: Longest Proper Prefix which is also a Suffix.
  • Before understanding LPS, let us look what are proper prefixes? . They are prefix that excludes the full string itself.

Understanding proper prefixes using an example for string abc

1
2
3
4
- String s = "abc"
- Prefixes: `["", "a", "ab", "abc"]`  
- Proper Prefixes: `["", "a", "ab"]`  (does not include full string "abc")
- Suffixes: `["", "c", "bc", "abc"]`

LPS Array

  • Each value lps[i] stores the length of the longest proper prefix in pat(0...i) that is also a suffix in pat(0...i).

  • For Example

1
2
3
4
5
Consider pat = "abcdabca"

index =>   0 1 2 3 4 5 6 7
pat   =>   a b c d a b c a     
lps   => [ 0 0 0 0 1 2 3 1 ]

But, what do the lps values actually mean?

  • Say lps[5]. lps[5] says “I have the longest proper prefix in pat(0...5) that is also a suffix in pat(0...5)
  • pat(0...5) is pattern from index 0 to 5. pat(0...5) = "acbdab".
  • Let us write the proper prefixes and suffixes of this string to understand clearly.
1
2
3
4
5
pat(0...5) :  "abcdab"
                              X
Proper Prefixes : ["", "a", "ab","abc","abcd","abcda"]
                              X
Suffixes        : ["", "b", "ab", "dab", "cdab", "bcdab", "abcdab"]
  • Clearly, the prefix "ab"(marked with an X) is the longest proper prefix that is also a suffix!
  • I hope that makes much more sense now, doesn’t it?
  • Next step is computing this LPS array

Computing the LPS Array

Process

  • Initialize lps[0] = 0, len = 0, and i = 1 (start from the second character).
  • Traverse until i < pat.length:
    • If pat[len] == pat[i] (characters match):
      • Increment len (len++) and assign lps[i] = len.
      • Move to the next character (i++).
    • If pat[len] != pat[i] (characters mismatch):
      • If len == 0, set lps[i] = 0 and increment i.
      • Otherwise, update len = lps[len - 1] and recheck.

Example for computing LPS Array

  • Consider pat = “abcdabcad”
  • Let us find the lps array

RECOMMENDATION: If you are trying to understand this properly, it is recommended that you do this parallelly on pen and paper as well.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
pat = “abcdabcad”
Remember, we are trying to match characters at `len` and `i` at each iteration

Start
-------------------------------------------

i=1, len=0

          len
index =>   0   1   2   3   4   5   6   7  8
               i
pat   =>   a   b   c   d   a   b   c   a  d   

lps   => [ 0                               ]

character mismatch and (len == 0) => set `lps[i] = 0`, increment `i`

-------------------------------------------

i=2, len=0

          len
index =>   0   1   2   3   4   5   6   7  8
                   i
pat   =>   a   b   c   d   a   b   c   a  d     

lps   => [ 0   0                           ]

character mismatch and (len == 0) => set `lps[i] = 0`, increment `i`

-------------------------------------------

i=3, len=0

          len
index =>   0   1   2   3   4   5   6   7  8
                       i
pat   =>   a   b   c   d   a   b   c   a  d     

lps   => [ 0   0   0                       ]


character mismatch and (len == 0) => set `lps[i] = 0`, increment `i`

-------------------------------------------

i=4, len=0

          len
index =>   0   1   2   3   4   5   6   7  8
                           i
pat   =>   a   b   c   d   a   b   c   a  d  

lps   => [ 0   0   0   0                   ]


character match => len++, lps[i] = len, then i++

-------------------------------------------

i=5, len=1

              len
index =>   0   1   2   3   4   5   6   7  8
                               i
pat   =>   a   b   c   d   a   b   c   a  d  

lps   => [ 0   0   0   0   1               ]


character match => len++, lps[i] = len, then i++

-------------------------------------------

i=6, len=2

                  len
index =>   0   1   2   3   4   5   6   7  8
                                   i
pat   =>   a   b   c   d   a   b   c   a  d  

lps   => [ 0   0   0   0   1   2           ]


character match => len++, lps[i] = len, then i++


-------------------------------------------

i=7, len=3

                      len
index =>   0   1   2   3   4   5   6   7   8
                                       i
pat   =>   a   b   c   d   a   b   c   a   d  

lps   => [ 0   0   0   0   1   2   3        ]


character mismatch => len!=0 => set len = lps[len - 1]

-------------------------------------------

i=7, len=0

          len
index =>   0   1   2   3   4   5   6   7  8
                                       i
pat   =>   a   b   c   d   a   b   c   a  d    

lps   => [ 0   0   0   0   1   2   3       ]


character match => len++, lps[i] = len, then i++


-------------------------------------------

i=8, len=1

              len
index =>   0   1   2   3   4   5   6   7  8
                                          i
pat   =>   a   b   c   d   a   b   c   a  d   

lps   => [ 0   0   0   0   1   2   3   1   ]


character mismatch => len!=0 => set len = lps[len - 1]

-------------------------------------------

i=8, len=0

          len
index =>   0   1   2   3   4   5   6   7  8
                                          i
pat   =>   a   b   c   d   a   b   c   a  d   

lps   => [ 0   0   0   0   1   2   3   1   ]


character mismatch and (len == 0) => set `lps[i] = 0`, increment `i`

-------------------------------------------

i=9, len=1

          len
index =>   0   1   2   3   4   5   6   7  8  9
                                             i
pat   =>   a   b   c   d   a   b   c   a  d   

lps   => [ 0   0   0   0   1   2   3   1  0 ]

condition violated => (i < pat.length()) => loop breaks

-------------------------------------------
END
-------------------------------------------
Final state

lps => [0, 0, 0, 0, 1, 2, 3, 1, 0]

Pattern Matching Using LPS

  • Observe closely that the pattern matching code is strikingly similar to computing the LPS array.

Process

  • Initialize pointers i = 0 (text) and j = 0 (pattern).
  • Traverse the text while i < text.length:
    • If text[i] == pat[j] (match):
      • Increment both i and j.
      • If j == pat.length (pattern found):
        • Record the occurrence start index: i - j.
        • Reset j = lps[j - 1] for further matches.
    • If text[i] != pat[j] (mismatch):
      • If j == 0, increment i (move text pointer).
      • Otherwise, reset j = lps[j - 1] (shift pattern pointer).
  • Dry-run an example(like we did earlier while computing lps array) and observe how the characters are being matched.

Complexity Analysis

  • Time Complexity: O(n + m)
    • O(m): For computing the LPS array.
    • O(n): For searching the pattern in the text.
    • where, n is the text length and m is the pattern length.-
  • Space Complexity: O(m)
    • LPS array has size equal to pattern length, i.e. m.

Problem: Search Pattern (KMP-Algorithm)

Given two strings, one is a text string txt and the other is a pattern string pat. The task is to print the indexes of all the occurrences of the pattern string in the text string. Use 0-based indexing while returning the indices and return an empty list in case of no occurrences of pattern.

Code

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
class Solution {
    // find all occurrences of pattern `pat` in text `txt`
    ArrayList<Integer> search(String pat, String txt) {
        int txtLen = txt.length();  // text length
        int patLen = pat.length();  // pattern length
        
        // if the pattern is longer than the text, we can never find it in `text`
        if(txtLen < patLen){
            return new ArrayList<>();
        }
        
        int[] lps = new int[patLen];  // Array to store the LPS (Longest Prefix Suffix) values
        
        // compute the LPS array for the pattern
        computeLPS(pat, lps);

        // list to store indices of occurences
        ArrayList<Integer> res = new ArrayList<Integer>();  
        
        // Call match function to find all occurrences of pattern in the text
        match(pat, txt, lps, res);
        
        return res;  // return the answer list 
   }
    
    // pattern matching using LPS array
    private void match(String pat, String txt, int[] lps, ArrayList<Integer> res){
        int txtLen = txt.length(); 
        int patLen = pat.length();  
        
        int i = 0;  // Pointer for the text
        int j = 0;  // Pointer for the pattern
        
        // traverse the text until all characters are checked
        while(i < txtLen){
            
            // character match, increment both `i` and `j`
            if(txt.charAt(i) == pat.charAt(j)){
                i++;
                j++;
                
                // if entire pattern is matched, add the start index `i-j` to result
                if(j == patLen){
                    res.add(i - j); 
                    j = lps[j - 1];  // use LPS array to skip unnecessary comparisons
                }
            }
            else{
                // character mismatch, check if the `len` is at the start
                if(j == 0){
                    i++;  // if `len` at the start of the pattern, move the `i` forward
                }
                else{
                    j = lps[j - 1];  // otherwise, move the `len` based on the LPS array
                }
            }
        }
    }
    
    // compute the LPS array for the given pattern
    private void computeLPS(String pat, int[] lps){
        int patLen = pat.length();  
        int len = 0;
        int i = 1;  // start checking from the second character
        
        lps[0] = 0;  // the first element of LPS array is always `0`
        
        // traverse the pattern and fill the LPS array
        while(i < patLen){
            
            // characters match, extend the prefix-suffix length
            if(pat.charAt(i) == pat.charAt(len)){
                len++;         // Increase the length of current matching prefix
                lps[i] = len;  // Update the LPS array with the new length
                i++;           // Move to the next character in the pattern
            }
            else{
                // characters mismatch
                if(len != 0){
                    // use the LPS array to skip and try matching with a smaller prefix
                    len = lps[len - 1];
                }
                else{
                    lps[i] = 0;  // if no match, set LPS value to `0` and move to the next character
                    i++;
                }
            }
        }
    }
}

More Resources

This post is licensed under CC BY 4.0 by the author.