-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Musser-Nishanov generic sequence-matching algorithm(s) #25
base: develop
Are you sure you want to change the base?
Musser-Nishanov generic sequence-matching algorithm(s) #25
Conversation
Distinct function to read the corpus because it is a multi-line file. The pattern files are multi-line too, but we're treating them as one per line.
It's very slow; the default search trait is much better.
@mclow can you check this PR please? |
@zamazan4ik it's OK, I'll call for Marshall's attention when it's finished. The list of caveats/unfinished things is real; I'll tick them off or remove them when they're done or remove them completely if I change my mind about them. No need to keep drawing his attention until then, but I'm always ready to engage in a conversation about what is currently here. |
Very, very odd. I just got an email about three new commits. But when I come here, I see they were made 16 days ago. |
Some of the commits might be that old, I only just pushed for the first time in ages. |
# Conflicts: # test/search_test1.cpp
This fixes a bug with using the DNA type_traits in C++11, which maybe worked in C++14.
Also use std::begin/end in place of boost::begin/end.
Introduction
In 1997, David R. Musser and Gor V. Nishanov wrote a hitherto unpublished paper and accompanying code, A Fast Generic Sequence Matching Algorithm.
It struck me as odd that this algorithm had not been widely adopted, as Musser's introsort had been.
So I contacted the authors and with their permission am attempting to bring it to a wider audience.
I provide the abstract from their paper here for completeness:
In the benchmarks that I have run, it is faster than Boyer-Moore, but equivalent to Boyer-Moore-Horspool, on 8-bit text.
On '2-bit' text (DNA sequence) however, the custom search traits boosts the performance to 5-10x faster than the nearest rival. I have demonstrated this in a new search test, no. 5.
What is also interesting is that a simpler, fallback algorithm is provided for corpus iterators that are not random access.
Design
I have not changed the fundamental algorithms in a material way, they should be recognizable from the original code apart from some simplifications.
What I have done personally is to restructure the interface to match the existing Boost.Algorithm searchers. The Musser-Nishanov search strategy is complicated by having two algorithms, accelerated linear (AL) and hashed accelerated linear (HAL), that are mostly but not entirely independent. If the corpus iterator is not random-access or the (static) suffix size is zero, then AL is selected at compile time. Otherwise, a HAL search object is created, but this does not guarantee that HAL will be used. If the pattern size (known at run time) is smaller than the suffix size, then the HAL search object must fall back to AL.
The user-facing API is a two-part
musser_nishanov
class that does the initial static choice between AL and HAL. Since HAL uses the same data structures as AL plus one more it is easy to implement HAL in terms of AL. This justified making AL a distinct class that is inherited publicly by the non-random-accessmusser_nishanov
class, and inherited privately by the HAL class so that it could be used as the fallback algorithm. The HAL class currently is themusser_nishanov
class for random access iterators, as opposed to being a distinct class.The following is a crude UML diagram of the class hierarchy, showing the private inheritance in brown and public inheritance in blue, following Doxygen style.
Since the HAL search object makes its final decision about algorithm at run time, I had to represent this choice somehow. The simplest, though not necessarily most efficient, way is with
bind()
andfunction<>
. It also chooses a null searcher if the pattern is empty. This method of storing whichalgorithm to call might be causing a slight performance penalty, but it requires more detailed tests.
Selected Benchmarks
I whittled the benchmark output down for the purpose of displaying here. These values should be taken with a grain of 5-10% salt: Musser-Nishanov is not consistently faster than Boyer-Moore-Horspool for example.
8-bit random characters
This is an edited
search_test2
output:DNA sequences
And this is an edited
search_test5
output, which is '2-bit' (DNA) sequence data, for which Musser-Nishanov has a few customized search traits, but I included just one here. The number--- here ---
is the pattern size, search is timed for finding all matches (which may be zero -- this benchmark needs some work).Just to be clear, there is no
musser_nishanov_dna
class, that is just shorthand for specialization.Caveats (or, what is unfinished)
I did not want to let perfect be the enemy of good, but more importantly I wanted to get a conversation about this code underway as soon as possible, so I have opened this PR well before everything is complete.