boostorg · jeremy-murphy · Sep 2, 2016 · Sep 2, 2016 · Sep 2, 2016 · Sep 3, 2016
diff --git a/doc/algorithm.qbk b/doc/algorithm.qbk
@@ -44,6 +44,7 @@ Thanks to all the people who have reviewed this library and made suggestions for
 [include boyer_moore.qbk]
 [include boyer_moore_horspool.qbk]
 [include knuth_morris_pratt.qbk]
+[include musser_nishanov.qbk]
 [endsect]
 
 

diff --git a/doc/musser_nishanov.qbk b/doc/musser_nishanov.qbk
@@ -0,0 +1,105 @@
+[/ QuickBook Document version 1.5 ]
+
+[section:MusserNishanov Musser-Nishanov Search]
+
+[/license
+
+Copyright (c) 2017 Jeremy W. Murphy
+
+Distributed under the Boost Software License, Version 1.0.
+(See accompanying file LICENSE_1_0.txt or copy at
+http://www.boost.org/LICENSE_1_0.txt)
+]
+
+
+[heading Overview]
+
+The header file 'musser_nishanov.hpp' contains an implementation of the Musser-Nishanov algorithm for sequence matching. 
+
+This algorithm was designed to be a generic sequence matching algorithm, extensible to any element type or character frequency by specialization of a traits class.
+The algorithm and original code was written in 1997 by David Musser and Gor Nishanov.
+Their paper is available from arXiv ([@https://arxiv.org/abs/0810.0264]) or from Dave Musser's homepage at Rensselaer Polytechnic Institute ([@http://www.cs.rpi.edu/~musser/gp/gensearch1.pdf]).
+
+It is based on Knuth-Morris-Pratt (KMP) with the addition of a hash-coded form of the skip loop from Boyer-Moore. It has the same worst-case bound of 2n on the number of comparisons as KMP. The average case performance appears to be slightly better than Boyer-Moore-Horspool (which is in turn slightly better than better than Boyer-Moore). 
+It also features a fallback algorithm for when the corpus does not provide random-access iterators.
+
+However, as is the case with the other accelerated search algorithms, it cannot be used with comparison predicates like `std::search`.
+
+Nomenclature: I refer to the sequence being searched for as the "pattern", and the sequence being searched in as the "corpus".
+
+[heading Interface]
+
+For flexibility, the algorithm has two interfaces; an object-based interface and a procedural one. The object-based interface builds the tables in the constructor, and uses operator () to perform the search. The procedural interface builds the table and does the search all in one step. If you are going to be searching for the same pattern in multiple corpora, then you should use the object interface, and only build the tables once.
+
+Here is the object interface:
+``
+template <typename patIter>
+class musser_nishanov {
+public:
+    musser_nishanov ( patIter first, patIter last );
+    ~musser_nishanov ();
+
+    template <typename corpusIter>
+    corpusIter operator () ( corpusIter corpus_first, corpusIter corpus_last );
+    };
+``
+
+and here is the corresponding procedural interface:
+
+``
+template <typename patIter, typename corpusIter>
+corpusIter musser_nishanov_search (
+        corpusIter corpus_first, corpusIter corpus_last, 
+        patIter pat_first, patIter pat_last );
+``
+
+Each of the functions is passed two pairs of iterators. The first two define the corpus and the second two define the pattern. Note that the two pairs need not be of the same type, but they do need to "point" at the same type. In other words, `patIter::value_type` and `corpusIter::value_type` need to be the same type.
+
+The return value of the function is an iterator pointing to the start of the pattern in the corpus. If the pattern is not found, it returns the end of the corpus (`corpus_last`).
+
+[heading Performance]
+
+The algorithm has the same 2['n] worst-case bound on the number of comparisons as
+KMP. Similar to Boyer-Moore, the actual number of comparisons in practice will typically be much fewer, due to the skip loop feature, which provides the most dramatic effect on performance.
+
+Execution time on UTF-8 string matching is comparable to Boyer-Moore-Horspool.
+A traits class is provided for matching DNA sequences that optimizes the skip loop such that it searches 10x faster than it otherwise would. This traits class must be specified manually, so it is unfair to compare this result to another search algorithm without similar specialization.
+
+Performance on types other than char has also been tested.
+When the element is an unsigned short, performance with the default search traits is equivalent to std::search, whereas the Boyer-Moore family of algorithms tend to be 5-6x slower.
+
+
+[heading Memory Use]
+
+The algorithm allocates two internal tables. The first one is proportional to the length of the pattern; the second one has one entry for each member of the "alphabet" in the pattern. For (8-bit) character types, this table contains 256 entries.
+
+[heading Complexity]
+
+The worst-case performance to find a pattern in the corpus is ['O(N)] (linear) time; that is, proportional to the length of the corpus being searched. In general, the search is sub-linear; not every entry in the corpus need be checked.
+
+[heading Exception Safety]
+
+Both the object-oriented and procedural versions of the Boyer-Moore algorithm take their parameters by value and do not use any information other than what is passed in. Therefore, both interfaces provide the strong exception guarantee.
+
+[heading Notes]
+
+* When using the object-based interface, the pattern must remain unchanged for during the searches; i.e, from the time the object is constructed until the final call to operator () returns.
+
+* The Boyer-Moore algorithm requires random-access iterators for both the pattern and the corpus.
+
+[heading Customization points] 
+
+The Boyer-Moore object takes a traits template parameter which enables the caller to customize how one of the precomputed tables is stored. This table, called the skip table, contains (logically) one entry for every possible value that the pattern can contain. When searching 8-bit character data, this table contains 256 elements. The traits class defines the table to be used. 
+
+The default traits class uses a `boost::array` for small 'alphabets' and a `tr1::unordered_map` for larger ones.  The array-based skip table gives excellent performance, but could be prohibitively large when the 'alphabet' of elements to be searched grows. The unordered_map based version only grows as the number of unique elements in the pattern, but makes many more heap allocations, and gives slower lookup performance. 
+
+To use a different skip table, you should define your own skip table object and your own traits class, and use them to instantiate the Boyer-Moore object. The interface to these objects is described TBD.
+
+
+[endsect]
+
+[/ File musser_nishanov.qbk
+Copyright 2017 Jeremy Murphy
+Distributed under the Boost Software License, Version 1.0.
+(See accompanying file LICENSE_1_0.txt or copy at http://www.boost.org/LICENSE_1_0.txt).
+]
diff --git a/include/boost/algorithm/searching/accelerated_linear.hpp b/include/boost/algorithm/searching/accelerated_linear.hpp
@@ -0,0 +1,147 @@
+/*
+   Copyright (c) David R. Musser & Gor V. Nishanov 1997.
+   Copyright (c) Jeremy W. Murphy 2024.
+
+   Distributed under the Boost Software License, Version 1.0. (See accompanying
+   file LICENSE_1_0.txt or copy at http://www.boost.org/LICENSE_1_0.txt)
+
+    For more information, see http://www.boost.org
+*/
+
+#include <boost/config.hpp>
+#include <boost/range.hpp>
+#include <boost/static_assert.hpp>
+#include <boost/type_traits/is_base_of.hpp>
+#include <boost/type_traits/is_same.hpp>
+
+#include <iterator>
+#include <vector>
+
+namespace boost { namespace algorithm {
+
+/**
+ * @brief Accelerated Linear search.
+ *
+ * Accelerated Linear (AL) search by Musser & Nishanov.
+ *
+ */
+template <typename PatIter, typename CorpusIter>
+class accelerated_linear
+{
+    BOOST_STATIC_ASSERT (( boost::is_same<
+    typename std::iterator_traits<PatIter>::value_type,
+    typename std::iterator_traits<CorpusIter>::value_type>::value ));
+public:
+    typedef typename std::iterator_traits<PatIter>::difference_type pattern_difference_type;
+    typedef typename std::iterator_traits<CorpusIter>::difference_type corpus_difference_type;
+
+protected:
+    PatIter pat_first, pat_last;
+    std::vector<corpus_difference_type> next_;
+    pattern_difference_type pattern_length;
+
+private:
+    void compute_next()
+    {
+        BOOST_ASSERT(pattern_length > 0);
+        pattern_difference_type j = 0, t = -1;
+        next_.reserve(pattern_length);
+        next_.push_back(-1);
+        while (j < pattern_length - 1)
+        {
+            while (t >= 0 && pat_first[j] != pat_first[t])
+                t = next_[t];
+            j++;
+            t++;
+            next_.push_back(pat_first[j] == pat_first[t] ? next_[t] : t);
+        }
+    }
+
+public:
+    accelerated_linear(PatIter pat_first, PatIter pat_last)
+      : pat_first(pat_first), pat_last(pat_last),
+        pattern_length(std::distance(pat_first, pat_last))
+    {
+        if (pattern_length > 0)
+            compute_next();
+    }
+
+    std::pair<CorpusIter, CorpusIter>
+    operator()(CorpusIter corpus_first, CorpusIter corpus_last) const
+    {
+        BOOST_ASSERT(std::distance(pat_first, pat_last) == pattern_length);
+        BOOST_ASSERT(pattern_length == next_.size());
+
+        if (pat_first == pat_last)
+            return {corpus_first, corpus_first};
+
+        if (pattern_length == 1)
+        {
+            auto result_first = std::find(corpus_first, corpus_last, *pat_first);
+            auto result_last = result_first == corpus_last ? corpus_last : boost::next(result_first);
+            return {result_first, result_last};
+        }
+
+        PatIter p1 = pat_first;
+        ++p1;
+
+        while (corpus_first != corpus_last)
+        {
+            corpus_first = std::find(corpus_first, corpus_last, *pat_first);
+            if (corpus_first == corpus_last)
+                return {corpus_last, corpus_last};
+            CorpusIter hold = corpus_first;
+            if (++corpus_first == corpus_last)
+                return {corpus_last, corpus_last};
+            PatIter p = p1;
+            pattern_difference_type j = 1;
+            while (*corpus_first == *p)
+            {
+                ++corpus_first;
+                if (++p == pat_last)
+                    return {hold, corpus_first};
+                if (corpus_first == corpus_last)
+                    return {corpus_last, corpus_last};
+                ++j;
+            }
+
+            for (;;)
+            {
+                j = next_[j];
+                if (j < 0)
+                {
+                    ++corpus_first;
+                    break;
+                }
+                if (j == 0)
+                    break;
+                p = pat_first + j;
+                while (*corpus_first == *p)
+                {
+                    corpus_first++;
+                    p++;
+                    j++;
+                    if (p == pat_last)
+                    {
+                        CorpusIter succesor = hold;
+                        std::advance(succesor, pattern_length);
+                        while (succesor != corpus_first)
+                            ++succesor, ++hold; // TODO: Change to for loop?
+                        return {hold, succesor};
+                    }
+                    if (corpus_first == corpus_last)
+                        return {corpus_last, corpus_last};
+                }
+            }
+        }
+        return {corpus_last, corpus_last};
+    }
+
+    template <typename Range>
+    std::pair<CorpusIter, CorpusIter> operator()(Range const &corpus) const
+    {
+        return (*this)(boost::begin(corpus), boost::end(corpus));
+    }
+};
+
+}}
diff --git a/include/boost/algorithm/searching/detail/mn_traits.hpp b/include/boost/algorithm/searching/detail/mn_traits.hpp
@@ -0,0 +1,116 @@
+/* 
+   Copyright (c) David R. Musser & Gor V. Nishanov 1997.
+   Copyright (c) Jeremy W. Murphy 2024.
+
+   Distributed under the Boost Software License, Version 1.0. (See accompanying
+   file LICENSE_1_0.txt or copy at http://www.boost.org/LICENSE_1_0.txt)
+
+    For more information, see http://www.boost.org
+*/
+
+#ifndef BOOST_ALGORITHM_SEARCH_DETAIL_MN_TRAITS
+#define BOOST_ALGORITHM_SEARCH_DETAIL_MN_TRAITS
+
+#include <boost/functional/hash.hpp>
+
+#include <string>
+
+namespace boost { namespace algorithm {
+
+template <class T>
+struct search_trait {
+    enum {hash_range_max = 0};
+    enum {suffix_size = 0};
+    template <class RandomAccessIterator>
+    inline static
+    T hash(RandomAccessIterator) {
+        return 0;
+    }
+};
+
+
+template <> struct search_trait<char> {
+    enum {hash_range_max = 256};
+    enum {suffix_size = 1};
+    template <class RandomAccessIterator>
+    inline static 
+    char unsigned hash(RandomAccessIterator i) {
+        return *i;
+    }
+};
+
+template <> struct search_trait<char signed> {
+    enum {hash_range_max = 256};
+    enum {suffix_size = 1};
+    template <class RandomAccessIterator>
+    inline static 
+    char unsigned hash(RandomAccessIterator i) {
+        return *i;              
+    }
+};
+
+template <> struct search_trait<char unsigned> {
+    enum {hash_range_max = 256};
+    enum {suffix_size = 1};
+    template <class RandomAccessIterator>
+    inline static 
+    char unsigned hash(RandomAccessIterator i) {
+        return *i;              
+    }
+};
+
+// NOTE: This std::string specialization is experimental.
+// It simply fills the gap that would otherwise be here.
+template <> struct search_trait<std::string> {
+    enum {hash_range_max = 256};
+    enum {suffix_size = 1};
+    template <class RandomAccessIterator>
+    inline static
+    int hash(RandomAccessIterator i) {
+        static boost::hash<std::string> string_hash;
+        return string_hash(*i) % hash_range_max;
+    }
+};
+
+struct search_trait_dna2 {
+    enum {hash_range_max = 64};
+    enum {suffix_size = 2};
+    template <class RAI>
+    inline static unsigned int hash(RAI i) {
+        return (*(i-1) + ((*i) << 3)) & 63;
+    }
+};
+
+struct search_trait_dna3 {
+    enum {hash_range_max = 512};
+    enum {suffix_size = 3};
+    template <class RAI>
+    inline static unsigned short int hash(RAI i) {
+        return (*(i-2) + (*(i-1) << 3) + ((*i) << 6)) & 511;
+    }
+};
+
+struct search_trait_dna4 {
+    enum {hash_range_max = 256};
+    enum {suffix_size = 4};
+    template <class RAI>
+    inline static unsigned int hash(RAI i) {
+        return (*(i-3) + (*(i-2) << 2) + (*(i-1) << 4)
+        + ((*i) << 6)) & 255;
+    }
+};
+
+struct search_trait_dna5 {
+    enum {hash_range_max = 256};
+    enum {suffix_size = 5};
+    template <class RAI>
+    inline static unsigned int hash(RAI i) {
+        return (*(i-4) + (*(i-3) << 2) + (*(i-2) << 4)
+        + (*(i-1) << 6) + ((*i) << 8)) & 255;
+    }
+};
+
+}} // namespace boost::algorithm
+
+#endif
+