Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Musser-Nishanov generic sequence-matching algorithm(s) #25

Open
wants to merge 110 commits into
base: develop
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
110 commits
Select commit Hold shift + click to select a range
80af2da
Add DNA corpus and complete DNA test pattern set.
jeremy-murphy Sep 2, 2016
5df3227
Reorganize patterns into files by size.
jeremy-murphy Sep 2, 2016
1de242a
DNA search test.
jeremy-murphy Sep 2, 2016
36e8870
Remove some comments.
jeremy-murphy Sep 3, 2016
2a67f57
Add Musser-Nishanov search algorithm.
jeremy-murphy Sep 3, 2016
ba11299
Add HAL search and dna[234] variations to search_test_5.
jeremy-murphy Sep 3, 2016
ab26b58
Simplify test running slightly with a typedef.
jeremy-murphy Sep 3, 2016
c6dc97d
Remove redundant pattern_size variable.
jeremy-murphy Sep 4, 2016
fbde8e4
Make compute_skip more debug friendly.
jeremy-murphy Sep 4, 2016
753d55c
Deal with empty patterns correctly.
jeremy-murphy Sep 4, 2016
7be81d7
Fix index type and comment and what still needs doing.
jeremy-murphy Sep 4, 2016
b3dad73
Return unsigned value of the same size as char type from hash function.
jeremy-murphy Sep 4, 2016
2d4eaa3
Add musser-nishanov-HAL to search test 1.
jeremy-murphy Sep 4, 2016
68b10ce
Remove this->, it seems a bit strange.
jeremy-murphy Sep 4, 2016
016ed5b
Add musser-nishanov-HAL to search test 2.
jeremy-murphy Sep 4, 2016
3fb9d7d
Include Boost assertion headers.
jeremy-murphy Sep 4, 2016
7230137
Simplify test to just find all matches of pattern in corpus.
jeremy-murphy Sep 7, 2016
4914d9e
Move HAL and AL into detail directory.
jeremy-murphy Sep 7, 2016
479b858
Most of skeleton of musser_nishanov search class.
jeremy-murphy Sep 9, 2016
0f47bc6
Bind and assign the right search algorithm to search member function.
jeremy-murphy Sep 10, 2016
12265cd
Return something from AL/HAL.
jeremy-murphy Sep 10, 2016
11a2d7a
Add HAL initialization on first use; fill in operator()s.
jeremy-murphy Sep 10, 2016
77506ad
Static assert that corpus and pattern iterator value types are same.
jeremy-murphy Sep 10, 2016
55348dd
Use base_of instead of same in light of C++17 contiguous iterator.
jeremy-murphy Sep 10, 2016
58e4af7
compute_next and compute_skip.
jeremy-murphy Sep 10, 2016
46d8b1b
Remove template argument from constructor.
jeremy-murphy Sep 10, 2016
c74c313
Split searcher class on corpus iterator category.
jeremy-murphy Sep 10, 2016
44065bf
Add AL stub.
jeremy-murphy Sep 10, 2016
624e77b
Add AL and tweak to the Boost interface; rename next to next_.
jeremy-murphy Sep 10, 2016
cd4caf6
Test for empty pattern in AL and move j variable inside loop.
jeremy-murphy Sep 10, 2016
94856ed
Remove template and iterator category enforcement from operator().
jeremy-murphy Sep 10, 2016
4e9b743
Evaluate Trait::suffix_size statically.
jeremy-murphy Sep 10, 2016
44f6277
Remove HAL include.
jeremy-murphy Sep 10, 2016
ae68b05
Put the HAL code in.
jeremy-murphy Sep 10, 2016
91fd4c5
Pass search object by reference to search member function.
jeremy-murphy Sep 10, 2016
7dbcde6
Move j variable inside loop.
jeremy-murphy Sep 10, 2016
68dc830
Assertion on k helps me remember logic of algorithm.
jeremy-murphy Sep 10, 2016
d9ebd0a
Update headers.
jeremy-murphy Sep 10, 2016
d8d79e6
Handle empty pattern in HAL.
jeremy-murphy Sep 10, 2016
125b8c1
Pass this as a pointer, not a reference.
jeremy-murphy Sep 10, 2016
851a491
Remove some explicit std::.
jeremy-murphy Sep 10, 2016
41b2991
Search function for empty pattern.
jeremy-murphy Sep 10, 2016
1f2f843
Add search function interface for Boost API.
jeremy-murphy Sep 10, 2016
43e9478
Whoops, this make_pair needs std::.
jeremy-murphy Sep 10, 2016
ce837fc
Update test1 to use new search code.
jeremy-murphy Sep 10, 2016
f262883
Update search_test2 to use new code.
jeremy-murphy Sep 10, 2016
2138175
Update search_test5.
jeremy-murphy Sep 10, 2016
7b8f0b3
Check bidirectional iterators too in search_test1.
jeremy-murphy Sep 10, 2016
599ed67
Remove previous implementation files.
jeremy-murphy Sep 10, 2016
2b5f330
Keep the "not implemented" exception as a warning/reminder to myself.
jeremy-murphy Sep 10, 2016
dcec570
Split AL functionality into separate class.
jeremy-murphy Sep 10, 2016
0429c34
Make AL function public on accelerated linear; inherit privately.
jeremy-murphy Sep 10, 2016
aaf01d4
musser_nishanov inherits accelerated_linear as a / implemented in ter…
jeremy-murphy Sep 10, 2016
94f18ce
Ramp up search_test2 to 500 reps.
jeremy-murphy Sep 10, 2016
bb4ee6d
Comments, copyright, remove old debugging.
jeremy-murphy Sep 10, 2016
e829755
Copyright notices.
jeremy-murphy Sep 10, 2016
2f705f8
Move test for termination edge case slightly earlier.
jeremy-murphy Sep 11, 2016
b23b79e
Accelerated Linear does not actually use the Trait template.
jeremy-murphy Sep 11, 2016
1b69fc9
Simplify return type of unspecialized search trait template.
jeremy-murphy Sep 11, 2016
8ed94de
Add missing type trait include.
jeremy-murphy Sep 11, 2016
20eab2a
Add some assertions to make preconditions clear.
jeremy-murphy Sep 11, 2016
9e110c6
Replace dna2 with dna5 in the benchmarks.
jeremy-murphy Sep 11, 2016
42633e0
Qualify use of next().
jeremy-murphy Sep 13, 2016
b454eaf
Use private inheritance for HAL-->AL.
jeremy-murphy Sep 13, 2016
0e75455
Fall back to AL if pattern length is one.
jeremy-murphy Sep 14, 2016
76e74da
Break HAL out as a separate class.
jeremy-murphy Sep 26, 2016
cbde8fc
Overloads and internal API changes to allow const search object.
jeremy-murphy Sep 27, 2016
4db5af9
Dave -> David.
jeremy-murphy Sep 27, 2016
2e28f35
Minor simplification to variable initialization of p1.
jeremy-murphy Mar 6, 2017
0df5e13
Reorganize accelerated_linear constructor.
jeremy-murphy May 18, 2017
1760acf
Remove default value for CorpusIter on accelerated_linear.
jeremy-murphy May 18, 2017
7ec2275
operator() overload for Range.
jeremy-murphy May 18, 2017
573af82
musser_nishanov_search() overloads for Range.
jeremy-murphy May 18, 2017
1a35890
make_musser_nishanov() overloads for pattern, corpus and search trait.
jeremy-murphy May 18, 2017
bb323a5
Add Musser-Nishanov to search_test4.
jeremy-murphy May 18, 2017
b2515e8
Add Musser-Nishanov to search_test3.
jeremy-murphy May 18, 2017
048f3a2
Make operator() publicly visible in hashed_accelerated_linear.
jeremy-murphy May 18, 2017
66e21ca
Add an experimental search_trait<std::string> using Boost.Functional/…
jeremy-murphy May 18, 2017
f4fede3
Started to write Musser-Nishanov documentation (based on Boyer-Moore).
jeremy-murphy May 27, 2017
0ecc3dd
Update main docs slightly.
jeremy-murphy Sep 30, 2017
363e160
Add assert on k_pattern_length to accelerated_linear::compute_next.
jeremy-murphy Sep 30, 2017
a293f6b
Merge branch 'develop' into musser-nishanov-search
jeremy-murphy Oct 17, 2017
fe91d33
Merge branch 'develop' into musser-nishanov-search
jeremy-murphy Jan 13, 2018
5f9bd61
Merge branch 'mn_docs' into musser-nishanov-search
jeremy-murphy Jan 28, 2018
333c52c
A short unsigned search trait class.
jeremy-murphy May 18, 2017
e17a191
Make some updates to the documentation.
jeremy-murphy Feb 1, 2018
688d571
[Musser-Nishanov] Remove the short unsigned search trait specialization.
jeremy-murphy Feb 3, 2018
6260b4d
Merge branch 'develop' into musser-nishanov-search
jeremy-murphy Nov 28, 2020
27ff8f3
Substantial refactor and update to C++14; use Boost.Variant2.
jeremy-murphy Dec 15, 2020
423513f
Merge branch 'develop' into musser-nishanov-search
jeremy-murphy Dec 15, 2020
c988c9e
Add error reporting to test4.
jeremy-murphy Dec 15, 2020
5d0dcec
Merge branch 'develop' into musser-nishanov-search
Aug 28, 2024
279299f
Minor non-functional improvements: debugging and performance
jeremy-murphy Aug 29, 2024
1de025a
hashed_accelerated_linear: Simplify interface
jeremy-murphy Aug 30, 2024
35d0f48
Simplify, whitespace, etc
jeremy-murphy Aug 30, 2024
e205291
Replace k with corpus_first
jeremy-murphy Aug 30, 2024
a9e91c2
Test for single-char search and empty pattern in empty haystack
jeremy-murphy Aug 30, 2024
d73c8e1
Update copyright years
jeremy-murphy Aug 31, 2024
3b06d2c
Replace MPL with MP11 and C++11 type_traits
jeremy-murphy Sep 5, 2024
872e83a
More auto
jeremy-murphy Sep 5, 2024
1e8658b
Less foo
jeremy-murphy Sep 5, 2024
34485f5
Move hashable predicate to detail namespace
jeremy-murphy Sep 5, 2024
dd44402
Remove redundant <vector>
jeremy-murphy Sep 5, 2024
f101f67
Use std::is_same
jeremy-murphy Sep 5, 2024
0dbc5db
Replace C++14 generic lambda with hand-written class
jeremy-murphy Sep 5, 2024
ba9228b
Change all/remaining make_pair(a, b) to {a, b}
jeremy-murphy Sep 11, 2024
780c36a
Remove superfluous boost:: and boost::algorithm:: ns qualification
jeremy-murphy Sep 11, 2024
5602930
std::find is fine
jeremy-murphy Sep 11, 2024
bace96b
Prefer pattern_length and next_.size() over distance(next_)
jeremy-murphy Sep 11, 2024
e5977de
Remove unnecessary recalculation of the result's last iterator
jeremy-murphy Sep 12, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/algorithm.qbk
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,7 @@ Thanks to all the people who have reviewed this library and made suggestions for
[include boyer_moore.qbk]
[include boyer_moore_horspool.qbk]
[include knuth_morris_pratt.qbk]
[include musser_nishanov.qbk]
[endsect]


Expand Down
105 changes: 105 additions & 0 deletions doc/musser_nishanov.qbk
Original file line number Diff line number Diff line change
@@ -0,0 +1,105 @@
[/ QuickBook Document version 1.5 ]

[section:MusserNishanov Musser-Nishanov Search]

[/license

Copyright (c) 2017 Jeremy W. Murphy

Distributed under the Boost Software License, Version 1.0.
(See accompanying file LICENSE_1_0.txt or copy at
http://www.boost.org/LICENSE_1_0.txt)
]


[heading Overview]

The header file 'musser_nishanov.hpp' contains an implementation of the Musser-Nishanov algorithm for sequence matching.

This algorithm was designed to be a generic sequence matching algorithm, extensible to any element type or character frequency by specialization of a traits class.
The algorithm and original code was written in 1997 by David Musser and Gor Nishanov.
Their paper is available from arXiv ([@https://arxiv.org/abs/0810.0264]) or from Dave Musser's homepage at Rensselaer Polytechnic Institute ([@http://www.cs.rpi.edu/~musser/gp/gensearch1.pdf]).

It is based on Knuth-Morris-Pratt (KMP) with the addition of a hash-coded form of the skip loop from Boyer-Moore. It has the same worst-case bound of 2n on the number of comparisons as KMP. The average case performance appears to be slightly better than Boyer-Moore-Horspool (which is in turn slightly better than better than Boyer-Moore).
It also features a fallback algorithm for when the corpus does not provide random-access iterators.

However, as is the case with the other accelerated search algorithms, it cannot be used with comparison predicates like `std::search`.

Nomenclature: I refer to the sequence being searched for as the "pattern", and the sequence being searched in as the "corpus".

[heading Interface]

For flexibility, the algorithm has two interfaces; an object-based interface and a procedural one. The object-based interface builds the tables in the constructor, and uses operator () to perform the search. The procedural interface builds the table and does the search all in one step. If you are going to be searching for the same pattern in multiple corpora, then you should use the object interface, and only build the tables once.

Here is the object interface:
``
template <typename patIter>
class musser_nishanov {
public:
musser_nishanov ( patIter first, patIter last );
~musser_nishanov ();

template <typename corpusIter>
corpusIter operator () ( corpusIter corpus_first, corpusIter corpus_last );
};
``

and here is the corresponding procedural interface:

``
template <typename patIter, typename corpusIter>
corpusIter musser_nishanov_search (
corpusIter corpus_first, corpusIter corpus_last,
patIter pat_first, patIter pat_last );
``

Each of the functions is passed two pairs of iterators. The first two define the corpus and the second two define the pattern. Note that the two pairs need not be of the same type, but they do need to "point" at the same type. In other words, `patIter::value_type` and `corpusIter::value_type` need to be the same type.

The return value of the function is an iterator pointing to the start of the pattern in the corpus. If the pattern is not found, it returns the end of the corpus (`corpus_last`).

[heading Performance]

The algorithm has the same 2['n] worst-case bound on the number of comparisons as
KMP. Similar to Boyer-Moore, the actual number of comparisons in practice will typically be much fewer, due to the skip loop feature, which provides the most dramatic effect on performance.

Execution time on UTF-8 string matching is comparable to Boyer-Moore-Horspool.
A traits class is provided for matching DNA sequences that optimizes the skip loop such that it searches 10x faster than it otherwise would. This traits class must be specified manually, so it is unfair to compare this result to another search algorithm without similar specialization.

Performance on types other than char has also been tested.
When the element is an unsigned short, performance with the default search traits is equivalent to std::search, whereas the Boyer-Moore family of algorithms tend to be 5-6x slower.


[heading Memory Use]

The algorithm allocates two internal tables. The first one is proportional to the length of the pattern; the second one has one entry for each member of the "alphabet" in the pattern. For (8-bit) character types, this table contains 256 entries.

[heading Complexity]

The worst-case performance to find a pattern in the corpus is ['O(N)] (linear) time; that is, proportional to the length of the corpus being searched. In general, the search is sub-linear; not every entry in the corpus need be checked.

[heading Exception Safety]

Both the object-oriented and procedural versions of the Boyer-Moore algorithm take their parameters by value and do not use any information other than what is passed in. Therefore, both interfaces provide the strong exception guarantee.

[heading Notes]

* When using the object-based interface, the pattern must remain unchanged for during the searches; i.e, from the time the object is constructed until the final call to operator () returns.

* The Boyer-Moore algorithm requires random-access iterators for both the pattern and the corpus.

[heading Customization points]

The Boyer-Moore object takes a traits template parameter which enables the caller to customize how one of the precomputed tables is stored. This table, called the skip table, contains (logically) one entry for every possible value that the pattern can contain. When searching 8-bit character data, this table contains 256 elements. The traits class defines the table to be used.

The default traits class uses a `boost::array` for small 'alphabets' and a `tr1::unordered_map` for larger ones. The array-based skip table gives excellent performance, but could be prohibitively large when the 'alphabet' of elements to be searched grows. The unordered_map based version only grows as the number of unique elements in the pattern, but makes many more heap allocations, and gives slower lookup performance.

To use a different skip table, you should define your own skip table object and your own traits class, and use them to instantiate the Boyer-Moore object. The interface to these objects is described TBD.


[endsect]

[/ File musser_nishanov.qbk
Copyright 2017 Jeremy Murphy
Distributed under the Boost Software License, Version 1.0.
(See accompanying file LICENSE_1_0.txt or copy at http://www.boost.org/LICENSE_1_0.txt).
]
147 changes: 147 additions & 0 deletions include/boost/algorithm/searching/accelerated_linear.hpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,147 @@
/*
Copyright (c) David R. Musser & Gor V. Nishanov 1997.
Copyright (c) Jeremy W. Murphy 2024.

Distributed under the Boost Software License, Version 1.0. (See accompanying
file LICENSE_1_0.txt or copy at http://www.boost.org/LICENSE_1_0.txt)

For more information, see http://www.boost.org
*/

#include <boost/config.hpp>
#include <boost/range.hpp>
#include <boost/static_assert.hpp>
#include <boost/type_traits/is_base_of.hpp>
#include <boost/type_traits/is_same.hpp>

#include <iterator>
#include <vector>

namespace boost { namespace algorithm {

/**
* @brief Accelerated Linear search.
*
* Accelerated Linear (AL) search by Musser & Nishanov.
*
*/
template <typename PatIter, typename CorpusIter>
class accelerated_linear
{
BOOST_STATIC_ASSERT (( boost::is_same<
typename std::iterator_traits<PatIter>::value_type,
typename std::iterator_traits<CorpusIter>::value_type>::value ));
public:
typedef typename std::iterator_traits<PatIter>::difference_type pattern_difference_type;
typedef typename std::iterator_traits<CorpusIter>::difference_type corpus_difference_type;

protected:
PatIter pat_first, pat_last;
std::vector<corpus_difference_type> next_;
pattern_difference_type pattern_length;

private:
void compute_next()
{
BOOST_ASSERT(pattern_length > 0);
pattern_difference_type j = 0, t = -1;
next_.reserve(pattern_length);
next_.push_back(-1);
while (j < pattern_length - 1)
{
while (t >= 0 && pat_first[j] != pat_first[t])
t = next_[t];
j++;
t++;
next_.push_back(pat_first[j] == pat_first[t] ? next_[t] : t);
}
}

public:
accelerated_linear(PatIter pat_first, PatIter pat_last)
: pat_first(pat_first), pat_last(pat_last),
pattern_length(std::distance(pat_first, pat_last))
{
if (pattern_length > 0)
compute_next();
}

std::pair<CorpusIter, CorpusIter>
operator()(CorpusIter corpus_first, CorpusIter corpus_last) const
{
BOOST_ASSERT(std::distance(pat_first, pat_last) == pattern_length);
BOOST_ASSERT(pattern_length == next_.size());

if (pat_first == pat_last)
return {corpus_first, corpus_first};

if (pattern_length == 1)
{
auto result_first = std::find(corpus_first, corpus_last, *pat_first);
auto result_last = result_first == corpus_last ? corpus_last : boost::next(result_first);
return {result_first, result_last};
}

PatIter p1 = pat_first;
++p1;

while (corpus_first != corpus_last)
{
corpus_first = std::find(corpus_first, corpus_last, *pat_first);
if (corpus_first == corpus_last)
return {corpus_last, corpus_last};
CorpusIter hold = corpus_first;
if (++corpus_first == corpus_last)
return {corpus_last, corpus_last};
PatIter p = p1;
pattern_difference_type j = 1;
while (*corpus_first == *p)
{
++corpus_first;
if (++p == pat_last)
return {hold, corpus_first};
if (corpus_first == corpus_last)
return {corpus_last, corpus_last};
++j;
}

for (;;)
{
j = next_[j];
if (j < 0)
{
++corpus_first;
break;
}
if (j == 0)
break;
p = pat_first + j;
while (*corpus_first == *p)
{
corpus_first++;
p++;
j++;
if (p == pat_last)
{
CorpusIter succesor = hold;
std::advance(succesor, pattern_length);
while (succesor != corpus_first)
++succesor, ++hold; // TODO: Change to for loop?
return {hold, succesor};
}
if (corpus_first == corpus_last)
return {corpus_last, corpus_last};
}
}
}
return {corpus_last, corpus_last};
}

template <typename Range>
std::pair<CorpusIter, CorpusIter> operator()(Range const &corpus) const
{
return (*this)(boost::begin(corpus), boost::end(corpus));
}
};

}}
116 changes: 116 additions & 0 deletions include/boost/algorithm/searching/detail/mn_traits.hpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,116 @@
/*
Copyright (c) David R. Musser & Gor V. Nishanov 1997.
Copyright (c) Jeremy W. Murphy 2024.

Distributed under the Boost Software License, Version 1.0. (See accompanying
file LICENSE_1_0.txt or copy at http://www.boost.org/LICENSE_1_0.txt)

For more information, see http://www.boost.org
*/

#ifndef BOOST_ALGORITHM_SEARCH_DETAIL_MN_TRAITS
#define BOOST_ALGORITHM_SEARCH_DETAIL_MN_TRAITS

#include <boost/functional/hash.hpp>

#include <string>

namespace boost { namespace algorithm {

template <class T>
struct search_trait {
enum {hash_range_max = 0};
enum {suffix_size = 0};
template <class RandomAccessIterator>
inline static
T hash(RandomAccessIterator) {
return 0;
}
};


template <> struct search_trait<char> {
enum {hash_range_max = 256};
enum {suffix_size = 1};
template <class RandomAccessIterator>
inline static
char unsigned hash(RandomAccessIterator i) {
return *i;
}
};

template <> struct search_trait<char signed> {
enum {hash_range_max = 256};
enum {suffix_size = 1};
template <class RandomAccessIterator>
inline static
char unsigned hash(RandomAccessIterator i) {
return *i;
}
};

template <> struct search_trait<char unsigned> {
enum {hash_range_max = 256};
enum {suffix_size = 1};
template <class RandomAccessIterator>
inline static
char unsigned hash(RandomAccessIterator i) {
return *i;
}
};

// NOTE: This std::string specialization is experimental.
// It simply fills the gap that would otherwise be here.
template <> struct search_trait<std::string> {
enum {hash_range_max = 256};
enum {suffix_size = 1};
template <class RandomAccessIterator>
inline static
int hash(RandomAccessIterator i) {
static boost::hash<std::string> string_hash;
return string_hash(*i) % hash_range_max;
}
};

struct search_trait_dna2 {
enum {hash_range_max = 64};
enum {suffix_size = 2};
template <class RAI>
inline static unsigned int hash(RAI i) {
return (*(i-1) + ((*i) << 3)) & 63;
}
};

struct search_trait_dna3 {
enum {hash_range_max = 512};
enum {suffix_size = 3};
template <class RAI>
inline static unsigned short int hash(RAI i) {
return (*(i-2) + (*(i-1) << 3) + ((*i) << 6)) & 511;
}
};

struct search_trait_dna4 {
enum {hash_range_max = 256};
enum {suffix_size = 4};
template <class RAI>
inline static unsigned int hash(RAI i) {
return (*(i-3) + (*(i-2) << 2) + (*(i-1) << 4)
+ ((*i) << 6)) & 255;
}
};

struct search_trait_dna5 {
enum {hash_range_max = 256};
enum {suffix_size = 5};
template <class RAI>
inline static unsigned int hash(RAI i) {
return (*(i-4) + (*(i-3) << 2) + (*(i-2) << 4)
+ (*(i-1) << 6) + ((*i) << 8)) & 255;
}
};

}} // namespace boost::algorithm

#endif

Loading