Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Musser-Nishanov generic sequence-matching algorithm(s) #25

Open
wants to merge 110 commits into
base: develop
Choose a base branch
from

Conversation

jeremy-murphy
Copy link
Contributor

@jeremy-murphy jeremy-murphy commented Sep 11, 2016

Introduction

In 1997, David R. Musser and Gor V. Nishanov wrote a hitherto unpublished paper and accompanying code, A Fast Generic Sequence Matching Algorithm.

It struck me as odd that this algorithm had not been widely adopted, as Musser's introsort had been.
So I contacted the authors and with their permission am attempting to bring it to a wider audience.

I provide the abstract from their paper here for completeness:

A string matching—and more generally, sequence matching—algorithm is presented
that has a linear worst-case computing time bound, a low worst-case bound on the
number of comparisons (2n), and sublinear average-case behavior that is better than
that of the fastest versions of the Boyer-Moore algorithm. The algorithm retains its
efficiency advantages in a wide variety of sequence matching problems of practical
interest, including traditional string matching; large-alphabet problems (as in Unicode
strings); and small-alphabet, long-pattern problems (as in DNA searches). Since it is
expressed as a generic algorithm for searching in sequences over an arbitrary type T , it
is well suited for use in generic software libraries such as the C ++ Standard Template
Library. The algorithm was obtained by adding to the Knuth-Morris-Pratt algorithm
one of the pattern-shifting techniques from the Boyer-Moore algorithm, with provision
for use of hashing in this technique. In situations in which a hash function or random
access to the sequences is not available, the algorithm falls back to an optimized version
of the Knuth-Morris-Pratt algorithm.

In the benchmarks that I have run, it is faster than Boyer-Moore, but equivalent to Boyer-Moore-Horspool, on 8-bit text.

On '2-bit' text (DNA sequence) however, the custom search traits boosts the performance to 5-10x faster than the nearest rival. I have demonstrated this in a new search test, no. 5.

What is also interesting is that a simpler, fallback algorithm is provided for corpus iterators that are not random access.

Design

I have not changed the fundamental algorithms in a material way, they should be recognizable from the original code apart from some simplifications.

What I have done personally is to restructure the interface to match the existing Boost.Algorithm searchers. The Musser-Nishanov search strategy is complicated by having two algorithms, accelerated linear (AL) and hashed accelerated linear (HAL), that are mostly but not entirely independent. If the corpus iterator is not random-access or the (static) suffix size is zero, then AL is selected at compile time. Otherwise, a HAL search object is created, but this does not guarantee that HAL will be used. If the pattern size (known at run time) is smaller than the suffix size, then the HAL search object must fall back to AL.

The user-facing API is a two-part musser_nishanov class that does the initial static choice between AL and HAL. Since HAL uses the same data structures as AL plus one more it is easy to implement HAL in terms of AL. This justified making AL a distinct class that is inherited publicly by the non-random-access musser_nishanov class, and inherited privately by the HAL class so that it could be used as the fallback algorithm. The HAL class currently is the musser_nishanov class for random access iterators, as opposed to being a distinct class.

The following is a crude UML diagram of the class hierarchy, showing the private inheritance in brown and public inheritance in blue, following Doxygen style.
image

Since the HAL search object makes its final decision about algorithm at run time, I had to represent this choice somehow. The simplest, though not necessarily most efficient, way is with bind() and function<>. It also chooses a null searcher if the pattern is empty. This method of storing which
algorithm to call might be causing a slight performance penalty, but it requires more detailed tests.

Selected Benchmarks

I whittled the benchmark output down for the purpose of displaying here. These values should be taken with a grain of 5-10% salt: Musser-Nishanov is not consistently faster than Boyer-Moore-Horspool for example.

8-bit random characters

This is an edited search_test2 output:

Corpus  is 2756252 entries long
---- Middle -----
Pattern is 105 entries long
                       std::search 0.4848 seconds     100%        484781
                boyer_moore_search 0.1171 seconds   24.16%        117117
       boyer_moore_horspool_search 0.1076 seconds    22.2%        107602
            musser_nishanov_search 0.1033 seconds   21.31%        103296
------ End ------
Pattern is 43 entries long
                       std::search 0.6206 seconds     100%        620587
                boyer_moore_search 0.2075 seconds   33.44%        207515
       boyer_moore_horspool_search 0.1742 seconds   28.07%        174218
            musser_nishanov_search 0.1686 seconds   27.17%        168633
--- Not found ---
Pattern is 91 entries long
                       std::search  1.028 seconds     100%       1028228
                boyer_moore_search 0.1971 seconds   19.17%        197079
       boyer_moore_horspool_search 0.1825 seconds   17.75%        182502
            musser_nishanov_search 0.1702 seconds   16.56%        170235

DNA sequences

And this is an edited search_test5 output, which is '2-bit' (DNA) sequence data, for which Musser-Nishanov has a few customized search traits, but I included just one here. The number --- here --- is the pattern size, search is timed for finding all matches (which may be zero -- this benchmark needs some work).
Just to be clear, there is no musser_nishanov_dna class, that is just shorthand for specialization.

Corpus  is 997532 entries long
--- 10 ---
matches: 2
                       std::search  2.239 seconds     100%       2239480
                boyer_moore object 0.6368 seconds   28.43%        636794
       boyer_moore_horspool object  0.664 seconds   29.65%        663994
            musser_nishanov object 0.7033 seconds    31.4%        703286
        musser_nishanov_dna object 0.3932 seconds   17.56%        393151
--- 20 ---
matches: 0
                       std::search   1.99 seconds     100%       1989855
                boyer_moore object 0.9167 seconds   46.07%        916750
       boyer_moore_horspool object  1.006 seconds   50.54%       1005750
            musser_nishanov object 0.9499 seconds   47.74%        949912
        musser_nishanov_dna object 0.1749 seconds   8.791%        174933
--- 40 ---
matches: 0
                       std::search  1.881 seconds     100%       1880573
                boyer_moore object 0.7201 seconds   38.29%        720095
       boyer_moore_horspool object 0.8477 seconds   45.08%        847738
            musser_nishanov object 0.8151 seconds   43.34%        815128
        musser_nishanov_dna object 0.09294 seconds  4.942%         92943
--- 150 ---
matches: 11
                       std::search  1.983 seconds     100%       1982931
                boyer_moore object 0.6318 seconds   31.86%        631754
       boyer_moore_horspool object 0.6806 seconds   34.32%        680592
            musser_nishanov object 0.6594 seconds   33.25%        659393
        musser_nishanov_dna object 0.04976 seconds  2.509%         49759

Caveats (or, what is unfinished)

I did not want to let perfect be the enemy of good, but more importantly I wanted to get a conversation about this code underway as soon as possible, so I have opened this PR well before everything is complete.

  • Documentation.
  • 'wide char' benchmark.
  • Non-random-access corpus benchmark.
  • Worst-case benchmarks.
  • The benchmark suite for search test 5 needs to be either trimmed or better utilized.

Distinct function to read the corpus because it is a multi-line file.
The pattern files are multi-line too, but we're treating them as one per
line.
@zamazan4ik
Copy link
Contributor

@mclow can you check this PR please?

@jeremy-murphy
Copy link
Contributor Author

@zamazan4ik it's OK, I'll call for Marshall's attention when it's finished. The list of caveats/unfinished things is real; I'll tick them off or remove them when they're done or remove them completely if I change my mind about them. No need to keep drawing his attention until then, but I'm always ready to engage in a conversation about what is currently here.

@mclow
Copy link
Collaborator

mclow commented Dec 15, 2020

Very, very odd. I just got an email about three new commits. But when I come here, I see they were made 16 days ago.

@jeremy-murphy
Copy link
Contributor Author

Very, very odd. I just got an email about three new commits. But when I come here, I see they were made 16 days ago.

Some of the commits might be that old, I only just pushed for the first time in ages.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants