|
| 1 | +# Knuth-Morris-Pratt (KMP) Algorithm |
| 2 | +KMP is a linear time string matching algorithm. The problem involves finding |
| 3 | +all occurences where a string pattern matches a substring in a text. |
| 4 | +The naive approach to string matching involves |
| 5 | +looping over all indicies over a text string and finding the indicies |
| 6 | +where the pattern p matches the substring starting at the index. |
| 7 | + |
| 8 | +s.t. pattern[0 ... m - 1] = text[idx ... idx + m - 1], where idx is some offset. |
| 9 | + |
| 10 | +The worst case of this approach is O(m*(n-m+1)), where m is |p| and n is |text|. |
| 11 | + |
| 12 | +The main drawback of the naive appraoch is that it handles overlaps |
| 13 | +poorly. Since it will go deep into the second nested loop when checking |
| 14 | +whether the substring and the pattern matches. When it hits a mismatch |
| 15 | +it will start over from the next increment, thereby redoing some of its |
| 16 | +comparisons. |
| 17 | + |
| 18 | +The KMP algorithm solves this problem by relying on some clever preprocessing, |
| 19 | +thereby reaching a linear time performance. The clever preprocessing is simply |
| 20 | +creating an array that contains information calculated by a prefix function, and this information describes how the pattern matches against shifts of itself. |
| 21 | +We use this array to avoid the worst case situation of the naive approach by reusing previously performed comparisons. |
| 22 | + |
| 23 | +The prefix-function(i) is the longest prefix of p that is also a suffix of p[1 ... i]. The whole idea of finding these substrings in the |
| 24 | +pattern which are both prefixes and suffixes, is that they determine from what index in the pattern and text we should start from next, hence |
| 25 | +avoiding having to start all the way at the start index of the pattern and only one index further in the text each time we hit a |
| 26 | +character miss match. |
| 27 | + |
| 28 | +KMP runs in O(n + m). Note, KMP is only necessary when there are many overlapping parts, since it is only in such |
| 29 | +situations where the prefix-suffix array helps. However, the worst case linear time efficiency is guaranteed, meaning |
| 30 | +the KMP algorithm is useful in general cases aswell. |
| 31 | + |
| 32 | +## Pseudocode |
| 33 | +Where t is the text string and p is the pattern. |
| 34 | + |
| 35 | + KMP-Matcher(t, p): |
| 36 | + n = len(t) |
| 37 | + m = len(p) |
| 38 | + prefix-arr = Compute-Prefix-Function(p) |
| 39 | + q = 0 |
| 40 | + for i = 0 to n: |
| 41 | + while q > 0 and p[q] != t[i]: |
| 42 | + q = prefix-arr[q] |
| 43 | + if p[q] == t[i]: |
| 44 | + q++ |
| 45 | + if q == m: |
| 46 | + patterns occurs at index i - m |
| 47 | + q = prefix-arr[q] |
| 48 | + |
| 49 | + Compute-Prefix-Function(p) |
| 50 | + m = len(p) |
| 51 | + new arr[0 ... m-1] |
| 52 | + arr[0] = 0 |
| 53 | + k = 0 |
| 54 | + for i = 1 to m: |
| 55 | + while k > 0 and p[k] != p[i]: |
| 56 | + k = arr[k] |
| 57 | + if p[k] == p[i]: |
| 58 | + k++ |
| 59 | + arr[i] = k |
| 60 | + |
| 61 | + return arr |
| 62 | + |
| 63 | +CLRS[p. 1006]. |
| 64 | + |
| 65 | + |
| 66 | + |
0 commit comments