Algorithm Processing time Matching timeKnuth-Morris-Pratt O(m) O(n) Algorithm• The Knuth-Morris-Pratt (KMP) string searching algorithm differs from the brute-force algorithm by keeping track of information gained from previous comparisons. • Afailure function (f) is computed that indicates how much of the last comparison can be reused if it fails. • Specifically, f is defined to be the longest prefix of the pattern P[0,..,j] that is also a suffix of P[1,..,j] - Note: not a suffix of P[0,..,j] • Example: - value of the KMP failure function: j 0 1 2 3 4 5 P[j] a b a b a c f(j) 0 0 1 2 3 0 • This shows how much of the beginning of the string matches up to the portion immediately preceding a failed comparison. - if the comparison fails at (4), we know the a,b in positions 2,3 is identical to positions 0,1 Pros1. Its searching complexity is O(m+n) which is faster than brute force and Rabin-Karp 2. It’s fairly easy to implement Cons1. It needs additional space and time – O(m) for pre-processing 2. It can be optimized a bit (Knuth-Morris-Pratt) Final WordsObviously this algorithm is quite useful because it improves in some very elegant manner the brute force matching. In the other hand you must know that there are faster string searching algorithms like the Boyer-Moore algorithm. However the Morris-Pratt algorithm can be quite useful in many cases, so understanding its principles can be very handy. ## Worked example of the search algorithmTo illustrate the algorithm's details, we work through a (relatively artificial) run of the algorithm. At any given time, the algorithm is in a state determined by two integers,
We proceed by comparing successive characters of
We quickly obtain a nearly complete match
This search fails immediately, however, as the pattern still does not contain a space, so as in the first trial, we return to the beginning of
Once again we immediately hit upon a match
This time we are able to complete the match, whose first character is ## [edit]Description of and pseudocode for the search algorithmThe above example contains all the elements of the algorithm. For the moment, we assume the existence of a "partial match" table
## [edit]Efficiency of the search algorithmAssuming the prior existence of the table Using this fact, we will show that the loop can execute at most Thus the loop executes at most Here is another way to think about the runtime: Let us say we begin to match W and S at position i and p, if W exists as a substring of S at p, then W[0 through m] == S[p through p+m]. Upon success, that is, the word and the text matched at the positions(W[i] == S[p+i]), we increase i by 1 (i++). Upon failure, that is, the word and the text does not match at the positions(W[i] != S[p+i]), the text pointer is kept still, while the word pointer roll-back a certain amount(i = T[i], where T is the jump table) And we attempt to match W[T[i]] with S[p+i]. The maximum number of roll-back of i is bounded by i, that is to say, for any failure, we can only roll-back as much as we have progressed up to the failure. Then it is clear the runtime is 2k. ## [edit]"Partial match" table (also known as "failure function")The goal of the table is to allow the algorithm not to match any character of We want to be able to look up, for each position in ## [edit]Worked example of the table-building algorithmWe consider the example of Continuing to Therefore we need not even concern ourselves with substrings having length 2, and as in the previous case the sole one with length 1 fails, so We pass to the subsequent Considering now the next character, Finally, we see that the next character in the ongoing segment starting at Therefore we compile the following table:
Other example more interesting and complex:
## [edit]Description of pseudocode for the table-building algorithmThe example above illustrates the general technique for assembling the table with a minimum of fuss. The principle is that of the overall search: most of the work was already done in getting to the current position, so very little needs to be done in leaving it. The only minor complication is that the logic which is correct late in the string erroneously gives non-proper substrings at the beginning. This necessitates some initialization code.
## [edit]Efficiency of the table-building algorithmThe complexity of the table algorithm is ## [edit]Efficiency of the KMP algorithmSince the two portions of the algorithm have, respectively, complexities of These complexities are the same, no matter how many repetitive patterns are in ## [edit]VariantsA real-time version of KMP can be implemented using a separate failure function table for each character in the alphabet. If a mismatch occurs on character in the text, the failure function table for character is consulted for the index in the pattern at which the mismatch took place. This will return the length of the longest substring ending at matching a prefix of the pattern, with the added condition that the character after the prefix is . With this restriction, character in the text need not be checked again in the next phase, and so only a constant number of operations are executed between the processing of each index of the text. This satisfies the real-time computing restriction. The Booth algorithm uses a modified version of the KMP preprocessing function to find the lexicographically minimal string rotation. The failure function is progressively calculated as the string is rotated. |

Algorithms > String Matching >