diff --git a/CHANGELOG.md b/CHANGELOG.md index adf33f7..2630ba0 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -2,6 +2,31 @@ All notable changes to this project will be documented in this file. +## [0.5.4] - 2026-04-11 + +### Changed +- Switched Tantivy dependency from git (commit 51f340f) to crates.io release 0.26.0 +- Extended internal FieldDef tuple from 4 to 6 elements (added fast, tokenizer) + +### Added +- **Bytes field type**: `Schema.add_bytes_field/3` for binary data storage and retrieval +- **Custom tokenizers**: Per-field tokenizer option for text fields (`default`, `raw`, `en_stem`, `whitespace`) +- **Fast fields**: `fast: true` option on numeric/bool/text fields for columnar storage +- **Count collector**: `Searcher.count/3` for lightweight document counting without retrieval +- **Regex queries**: `Searcher.search_regex/4` for programmatic regex pattern matching on text fields +- **MoreLikeThis queries**: `Searcher.search_more_like_this/3` for finding similar documents by term distribution +- **Sort by field value**: `Searcher.search_query_sorted/5` for sorting results by fast field instead of BM25 score +- **Aggregations**: Full aggregation framework with JSON pass-through NIF + - `Searcher.aggregate/5` for executing aggregations over search results + - `Muninn.Aggregation` builder DSL with `new/0`, `add/3`, `sub/3` + - `Muninn.Aggregation.Bucket` — terms, range, histogram, filter bucket aggregations + - `Muninn.Aggregation.Metric` — avg, sum, min, max, stats, count, cardinality, percentiles + +### Tantivy 0.26.0 Highlights (since previous git pin) +- **Bugfixes**: Fixed phrase query prefixed with `*`, vint buffer overflow during index creation, integer overflow in `ExpUnrolledLinkedList` for large datasets, integer overflow in segment sorting and merge policy truncation, merging of intermediate aggregation results, deduplicate doc counts in term aggregation for multi-valued fields, lenient elastic range queries with trailing closing parentheses +- **Features**: Filter aggregation, composite aggregation, include/exclude filtering for term aggregations, regex support in query parser, TermQuery fallback for non-indexed fast fields, fast field support for Bytes values, natural-order-with-none-highest in TopDocs ordering, stemming behind feature flag +- **Performance**: High cardinality aggregation speed improvements, saturated posting list optimization, lazy scorers, union performance improvements, seek_danger for efficient intersections + ## [0.5.3] - 2026-02-16 ### Changed diff --git a/README.md b/README.md index eb169eb..58904ab 100644 --- a/README.md +++ b/README.md @@ -37,15 +37,20 @@ This library embodies that spirit: it flies through your documents, indexes what - **Fast**: Rust-powered search via native NIFs - **Full-text search**: Text indexing with customizable tokenization -- **Multiple field types**: text, u64, i64, f64, bool -- **Flexible schemas**: Define stored and indexed fields -- **Advanced queries**: Field-specific search, boolean operators, phrase matching, range queries +- **Multiple field types**: text, u64, i64, f64, bool, bytes +- **Custom tokenizers**: Per-field tokenizer support (`default`, `raw`, `en_stem`, `whitespace`) +- **Flexible schemas**: Define stored, indexed, and fast fields +- **Advanced queries**: Field-specific search, boolean operators, phrase matching, range queries, regex - **Range queries**: Numeric range filtering with flexible boundaries - **Fuzzy matching**: Error-tolerant search with Levenshtein distance for handling typos +- **MoreLikeThis**: Find similar documents by term distribution +- **Aggregations**: Terms, range, histogram buckets + avg, sum, stats, cardinality metrics with nesting +- **Sort by field**: Order results by fast field value instead of relevance score +- **Count queries**: Lightweight document counting without retrieval - **Highlighting**: HTML snippets with highlighted matching words - **Autocomplete**: Prefix search for typeahead functionality (with fuzzy support) - **Thread-safe**: Concurrent index operations supported -- **Production-ready**: Comprehensive error handling and 175+ tests +- **Production-ready**: Comprehensive error handling and 229+ tests ## Installation @@ -61,7 +66,7 @@ end **Requirements:** - Elixir ~> 1.18 -- Rust ~> 1.85 (for compilation, Tantivy 0.25 requires Edition 2024) +- Rust ~> 1.92 (for compilation, Tantivy 0.26 + Rustler 0.37.2 require Rust 1.91+) ## Quick Start @@ -71,9 +76,11 @@ end alias Muninn.Schema schema = Schema.new() - |> Schema.add_text_field("title", stored: true, indexed: true) + |> Schema.add_text_field("title", stored: true, indexed: true, tokenizer: "en_stem") |> Schema.add_text_field("body", stored: true, indexed: true) - |> Schema.add_u64_field("views", stored: true, indexed: true) + |> Schema.add_text_field("category", stored: true, tokenizer: "raw", fast: true) + |> Schema.add_u64_field("views", stored: true, indexed: true, fast: true) + |> Schema.add_f64_field("price", stored: true, fast: true) |> Schema.add_bool_field("published", stored: true, indexed: true) ``` @@ -295,6 +302,135 @@ Handle spelling errors and typos automatically using Levenshtein distance: - **Distance=2**: ~5-50x slower than exact search (use for suggestions only) - Transposition cost enabled by default (more intuitive for users) +### Regex Search + +Search with regular expressions on text fields: + +```elixir +# Programmatic regex query +{:ok, results} = Searcher.search_regex(searcher, "title", "elix.*", limit: 10) + +# Also supported via query parser syntax +{:ok, results} = Searcher.search_query(searcher, "/elix.*/", ["title"]) +``` + +### MoreLikeThis (Find Similar Documents) + +Find documents similar to a reference document by analyzing term distributions: + +```elixir +{:ok, results} = Searcher.search_more_like_this( + searcher, + %{"title" => "Elixir programming", "body" => "Functional programming with Elixir"}, + min_doc_freq: 1, + min_term_freq: 1, + max_query_terms: 25, + limit: 5 +) +``` + +### Count Queries + +Efficiently count matching documents without retrieving them: + +```elixir +{:ok, count} = Searcher.count(searcher, "elixir AND phoenix", ["title", "body"]) +# Returns: {:ok, 42} +``` + +### Sort by Field Value + +Sort results by a fast field instead of relevance score: + +```elixir +# Sort by price ascending +{:ok, results} = Searcher.search_query_sorted( + searcher, + "category:electronics", + ["title"], + "price" +) + +# Sort by views descending +{:ok, results} = Searcher.search_query_sorted( + searcher, + "*", + ["title"], + "views", + reverse: true, + limit: 20 +) + +# Results include sort_value instead of score: +# %{"sort_value" => 5000, "doc" => %{"title" => "Popular Item", ...}} +``` + +> **Note:** Sort fields must be numeric (u64, i64, f64) with `fast: true` in the schema. + +### Aggregations + +Compute analytics over search results using the aggregation framework: + +```elixir +alias Muninn.Aggregation +alias Muninn.Aggregation.{Bucket, Metric} + +# Simple metric aggregation +aggs = Aggregation.new() + |> Aggregation.add("avg_price", Metric.avg("price")) + |> Aggregation.add("price_stats", Metric.stats("price")) + +{:ok, results} = Searcher.aggregate(searcher, "*", ["title"], aggs) +# results["avg_price"]["value"] => 381.66 +# results["price_stats"] => %{"count" => 6, "min" => 15.0, "max" => 999.0, ...} + +# Terms aggregation (group by category) +aggs = Aggregation.new() + |> Aggregation.add("by_category", Bucket.terms("category", size: 10)) + +{:ok, results} = Searcher.aggregate(searcher, "*", ["title"], aggs) +# results["by_category"]["buckets"] => [ +# %{"key" => "electronics", "doc_count" => 3}, +# %{"key" => "clothing", "doc_count" => 2}, +# ... +# ] + +# Nested aggregation (stats per category) +aggs = Aggregation.new() + |> Aggregation.add("by_category", + Bucket.terms("category", size: 10) + |> Aggregation.sub("price_stats", Metric.stats("price")) + ) + +# Range buckets +aggs = Aggregation.new() + |> Aggregation.add("price_ranges", + Bucket.range("price", [ + %{"to" => 50.0}, + %{"from" => 50.0, "to" => 500.0}, + %{"from" => 500.0} + ]) + ) + +# Histogram +aggs = Aggregation.new() + |> Aggregation.add("price_hist", Bucket.histogram("price", 100.0)) + +# Scoped to a query (only aggregate matching docs) +{:ok, results} = Searcher.aggregate( + searcher, + "category:electronics", + ["title", "category"], + aggs +) +``` + +> **Note:** Aggregated fields must have `fast: true` in the schema. For text field aggregation (e.g., terms), use `tokenizer: "raw"` with `fast: true`. + +**Available Bucket Aggregations:** `Bucket.terms/2`, `Bucket.range/2`, `Bucket.histogram/3`, `Bucket.filter/1` + +**Available Metric Aggregations:** `Metric.avg/1`, `Metric.sum/1`, `Metric.min/1`, `Metric.max/1`, `Metric.stats/1`, `Metric.count/1`, `Metric.cardinality/2`, `Metric.percentiles/2` + ## Field Types | Type | Description | Example Use Case | @@ -304,12 +440,15 @@ Handle spelling errors and typos automatically using Levenshtein distance: | `i64` | Signed 64-bit integers | Scores, offsets, differences | | `f64` | 64-bit floating point | Prices, ratings, coordinates | | `bool` | Boolean values | Flags, states (published, active) | +| `bytes` | Arbitrary binary data | Embeddings, serialized data, hashes | **Field Options:** - `stored: true/false` - Store the original value (retrievable in search results) - `indexed: true/false` - Index the field for searching/filtering +- `fast: true/false` - Enable columnar storage (required for sorting and aggregations) +- `tokenizer: "name"` - Tokenizer for text fields: `"default"`, `"raw"`, `"en_stem"`, `"whitespace"` -**Defaults:** `stored: false`, `indexed: true` +**Defaults:** `stored: false`, `indexed: true`, `fast: false`, `tokenizer: nil` (uses `"default"`) ## Examples @@ -320,6 +459,7 @@ See the `examples/` directory for complete working examples: - `highlighting_demo.exs` - Highlighted snippets and prefix search - `range_functions_demo.exs` - Range queries (QueryParser vs dedicated functions) - `fuzzy_search_demo.exs` - Fuzzy matching for typo tolerance +- `aggregation_demo.exs` - Aggregations, sorting, and analytics - `complete_search_demo.exs` - Full feature showcase - `comparison_demo.exs` - Side-by-side comparison of search methods @@ -332,49 +472,35 @@ mix run examples/complete_search_demo.exs ### Core Modules -- `Muninn.Schema` - Define index schema with field types +- `Muninn.Schema` - Define index schema with field types and options - `Muninn.Index` - Create and open indices - `Muninn.IndexWriter` - Add, update documents, commit/rollback - `Muninn.IndexReader` - Read access to index -- `Muninn.Searcher` - Execute search queries +- `Muninn.Searcher` - Execute search queries, sorting, counting, and aggregations - `Muninn.Query` - Build search queries +- `Muninn.Aggregation` - Builder DSL for aggregation requests +- `Muninn.Aggregation.Bucket` - Bucket aggregation builders (terms, range, histogram, filter) +- `Muninn.Aggregation.Metric` - Metric aggregation builders (avg, sum, min, max, stats, etc.) ### Search Methods -**Basic Term Search** - Simple, direct term matching: -```elixir -query = Query.term("field", "value") -Searcher.search(searcher, query, limit: 10) -``` - -**Query Parser** - Natural syntax with boolean operators: -```elixir -Searcher.search_query(searcher, "field:value AND other", ["field", "other"]) -``` - -**With Snippets** - Highlighted search results: -```elixir -Searcher.search_with_snippets(searcher, query, search_fields, snippet_fields, opts) -``` - -**Prefix Search** - Autocomplete functionality: -```elixir -Searcher.search_prefix(searcher, "field", "prefix", limit: 10) -``` - -**Range Queries** - Numeric filtering with flexible boundaries: -```elixir -Searcher.search_range_u64(searcher, "views", 100, 1000, inclusive: :both) -Searcher.search_range_i64(searcher, "temperature", -10, 30) -Searcher.search_range_f64(searcher, "price", 10.0, 100.0) -``` - -**Fuzzy Search** - Error-tolerant matching with Levenshtein distance: -```elixir -Searcher.search_fuzzy(searcher, "title", "elixr", distance: 1) -Searcher.search_fuzzy_prefix(searcher, "author", "jse", distance: 1) -Searcher.search_fuzzy_with_snippets(searcher, "content", "elixr", ["content"]) -``` +| Method | Description | +|--------|-------------| +| `Searcher.search/3` | Term query — direct term matching | +| `Searcher.search_query/4` | Query parser — boolean operators, phrase queries, field-specific | +| `Searcher.search_with_snippets/5` | Query parser + highlighted HTML snippets | +| `Searcher.search_prefix/4` | Prefix matching for autocomplete | +| `Searcher.search_range_u64/5` | Numeric range query (u64) | +| `Searcher.search_range_i64/5` | Numeric range query (i64) | +| `Searcher.search_range_f64/5` | Numeric range query (f64) | +| `Searcher.search_fuzzy/4` | Fuzzy matching with Levenshtein distance | +| `Searcher.search_fuzzy_prefix/4` | Fuzzy prefix matching for autocomplete with typo tolerance | +| `Searcher.search_fuzzy_with_snippets/5` | Fuzzy matching + highlighted snippets | +| `Searcher.search_regex/4` | Regex pattern matching on text fields | +| `Searcher.search_more_like_this/3` | Find similar documents by term distribution | +| `Searcher.search_query_sorted/5` | Query with results sorted by fast field value | +| `Searcher.count/3` | Count matching documents without retrieval | +| `Searcher.aggregate/5` | Execute aggregations over matching documents | ## Architecture @@ -416,12 +542,15 @@ mix test --cover mix test test/muninn/searcher_test.exs ``` -**Test Coverage:** 175+ tests covering: -- Schema and index operations +**Test Coverage:** 229+ tests covering: +- Schema and index operations (including bytes field, custom tokenizers, fast fields) - Document CRUD operations -- All query types (term, boolean, phrase, prefix, range, fuzzy) +- All query types (term, boolean, phrase, prefix, range, fuzzy, regex, MoreLikeThis) - Fuzzy search with distance levels (0-2), transposition handling - Range queries with different numeric types and boundary options +- Sort by field value (ascending/descending) +- Count queries +- Aggregations (terms, range, histogram, stats, nested) - Snippet generation and highlighting - Concurrent operations - Edge cases and error handling @@ -439,27 +568,35 @@ View at `doc/index.html` ## Development Status -**Current:** Phase 7 Complete - Fuzzy Matching and Typo Tolerance +**Current:** Phase 8 Complete - Tantivy 0.26.0 Features **Implemented:** - Schema definition and validation - Index creation and management - Document indexing with batch operations - Basic term search -- Advanced query parser (field:value, AND/OR, phrases, ranges) +- Advanced query parser (field:value, AND/OR, phrases, ranges, regex) - Range queries for all numeric types (u64, i64, f64) -- Fuzzy search with Levenshtein distance (3 functions: fuzzy, fuzzy_prefix, fuzzy_with_snippets) +- Fuzzy search with Levenshtein distance (fuzzy, fuzzy_prefix, fuzzy_with_snippets) - Highlighted snippets for search results - Prefix search for autocomplete +- Regex search on text fields +- MoreLikeThis (find similar documents) +- Count queries (lightweight document counting) +- Sort by fast field value (ascending/descending) +- Aggregations (terms, range, histogram, filter buckets + all metric types) +- Custom tokenizers (default, raw, en_stem, whitespace) +- Bytes field type for binary data +- Fast fields for columnar storage - Transaction support (commit/rollback) -- Upgraded to Tantivy 0.25 +- Upgraded to Tantivy 0.26.0 (crates.io) **Roadmap:** - QueryParser integration for fuzzy syntax (`term~N`) - Advanced suggestions system ("did you mean?") -- Faceted search and aggregations -- Custom analyzers and tokenizers -- Sorting and custom scoring +- Document deletion and updates +- Date field type +- Custom scoring and boosting ## License diff --git a/examples/aggregation_demo.exs b/examples/aggregation_demo.exs new file mode 100644 index 0000000..e713690 --- /dev/null +++ b/examples/aggregation_demo.exs @@ -0,0 +1,237 @@ +# Aggregation, Sorting, and Analytics Demo +# +# Run with: mix run examples/aggregation_demo.exs + +alias Muninn.{Schema, Index, IndexWriter, IndexReader, Searcher} +alias Muninn.Aggregation +alias Muninn.Aggregation.{Bucket, Metric} + +# --- Setup: Create index with fast fields --- + +path = "/tmp/muninn_agg_demo_#{:erlang.unique_integer([:positive])}" + +schema = + Schema.new() + |> Schema.add_text_field("title", stored: true, tokenizer: "en_stem") + |> Schema.add_text_field("category", stored: true, tokenizer: "raw", fast: true) + |> Schema.add_f64_field("price", stored: true, fast: true) + |> Schema.add_u64_field("quantity", stored: true, fast: true) + |> Schema.add_u64_field("rating", stored: true, fast: true) + +{:ok, index} = Index.create(path, schema) + +# Index sample products +products = [ + %{ + "title" => "MacBook Pro", + "category" => "electronics", + "price" => 2499.0, + "quantity" => 10, + "rating" => 5 + }, + %{ + "title" => "iPhone 15", + "category" => "electronics", + "price" => 999.0, + "quantity" => 50, + "rating" => 4 + }, + %{ + "title" => "AirPods Pro", + "category" => "electronics", + "price" => 249.0, + "quantity" => 200, + "rating" => 4 + }, + %{ + "title" => "iPad Air", + "category" => "electronics", + "price" => 599.0, + "quantity" => 30, + "rating" => 5 + }, + %{ + "title" => "Running Shoes", + "category" => "sports", + "price" => 129.0, + "quantity" => 100, + "rating" => 4 + }, + %{ + "title" => "Yoga Mat", + "category" => "sports", + "price" => 29.0, + "quantity" => 500, + "rating" => 3 + }, + %{ + "title" => "Tennis Racket", + "category" => "sports", + "price" => 199.0, + "quantity" => 40, + "rating" => 4 + }, + %{ + "title" => "Water Bottle", + "category" => "sports", + "price" => 15.0, + "quantity" => 1000, + "rating" => 3 + }, + %{ + "title" => "Elixir in Action", + "category" => "books", + "price" => 45.0, + "quantity" => 300, + "rating" => 5 + }, + %{ + "title" => "Programming Phoenix", + "category" => "books", + "price" => 40.0, + "quantity" => 250, + "rating" => 5 + }, + %{ + "title" => "The Pragmatic Programmer", + "category" => "books", + "price" => 50.0, + "quantity" => 400, + "rating" => 5 + }, + %{ + "title" => "Clean Code", + "category" => "books", + "price" => 35.0, + "quantity" => 350, + "rating" => 4 + } +] + +Enum.each(products, &IndexWriter.add_document(index, &1)) +IndexWriter.commit(index) + +{:ok, reader} = IndexReader.new(index) +{:ok, searcher} = Searcher.new(reader) + +IO.puts("=== Muninn Aggregation & Analytics Demo ===\n") +IO.puts("Indexed #{length(products)} products across 3 categories\n") + +# --- 1. Count --- +IO.puts("--- 1. Count Queries ---") +{:ok, total} = Searcher.count(searcher, "*", ["title"]) +{:ok, electronics} = Searcher.count(searcher, "category:electronics", ["title", "category"]) +{:ok, books} = Searcher.count(searcher, "category:books", ["title", "category"]) +IO.puts("Total products: #{total}") +IO.puts("Electronics: #{electronics}") +IO.puts("Books: #{books}\n") + +# --- 2. Sort by field --- +IO.puts("--- 2. Sort by Price (descending) ---") + +{:ok, results} = + Searcher.search_query_sorted(searcher, "*", ["title"], "price", reverse: true, limit: 5) + +for hit <- results["hits"] do + IO.puts(" $#{hit["doc"]["price"]} - #{hit["doc"]["title"]}") +end + +IO.puts("") + +# --- 3. Stats aggregation --- +IO.puts("--- 3. Price Statistics ---") +aggs = Aggregation.new() |> Aggregation.add("price_stats", Metric.stats("price")) +{:ok, results} = Searcher.aggregate(searcher, "*", ["title"], aggs) +stats = results["price_stats"] +IO.puts(" Count: #{stats["count"]}") +IO.puts(" Min: $#{stats["min"]}") +IO.puts(" Max: $#{stats["max"]}") +IO.puts(" Avg: $#{Float.round(stats["avg"], 2)}") +IO.puts(" Sum: $#{stats["sum"]}\n") + +# --- 4. Terms aggregation --- +IO.puts("--- 4. Products by Category ---") +aggs = Aggregation.new() |> Aggregation.add("by_category", Bucket.terms("category", size: 10)) +{:ok, results} = Searcher.aggregate(searcher, "*", ["title"], aggs) + +for bucket <- results["by_category"]["buckets"] do + IO.puts(" #{bucket["key"]}: #{bucket["doc_count"]} products") +end + +IO.puts("") + +# --- 5. Nested aggregation --- +IO.puts("--- 5. Average Price per Category ---") + +aggs = + Aggregation.new() + |> Aggregation.add( + "by_category", + Bucket.terms("category", size: 10) + |> Aggregation.sub("avg_price", Metric.avg("price")) + ) + +{:ok, results} = Searcher.aggregate(searcher, "*", ["title"], aggs) + +for bucket <- results["by_category"]["buckets"] do + avg = Float.round(bucket["avg_price"]["value"], 2) + IO.puts(" #{bucket["key"]}: $#{avg} avg (#{bucket["doc_count"]} items)") +end + +IO.puts("") + +# --- 6. Range aggregation --- +IO.puts("--- 6. Price Range Distribution ---") + +aggs = + Aggregation.new() + |> Aggregation.add( + "price_ranges", + Bucket.range("price", [ + %{"key" => "budget", "to" => 50.0}, + %{"key" => "mid-range", "from" => 50.0, "to" => 500.0}, + %{"key" => "premium", "from" => 500.0} + ]) + ) + +{:ok, results} = Searcher.aggregate(searcher, "*", ["title"], aggs) + +for bucket <- results["price_ranges"]["buckets"] do + IO.puts(" #{bucket["key"]}: #{bucket["doc_count"]} products") +end + +IO.puts("") + +# --- 7. Regex search --- +IO.puts("--- 7. Regex Search (titles matching 'pro.*') ---") +{:ok, results} = Searcher.search_regex(searcher, "title", "pro.*", limit: 10) + +for hit <- results["hits"] do + IO.puts(" #{hit["doc"]["title"]}") +end + +IO.puts("") + +# --- 8. MoreLikeThis --- +IO.puts("--- 8. MoreLikeThis (similar to 'programming books') ---") + +{:ok, results} = + Searcher.search_more_like_this( + searcher, + %{"title" => "programming elixir phoenix books"}, + min_doc_freq: 1, + min_term_freq: 1, + limit: 5 + ) + +IO.puts(" Found #{results["total_hits"]} similar documents:") + +for hit <- results["hits"] do + IO.puts(" - #{hit["doc"]["title"]} (score: #{Float.round(hit["score"], 2)})") +end + +IO.puts("") + +# Cleanup +File.rm_rf!(path) +IO.puts("Done! (cleaned up temp index)") diff --git a/lib/muninn/aggregation.ex b/lib/muninn/aggregation.ex new file mode 100644 index 0000000..af1221d --- /dev/null +++ b/lib/muninn/aggregation.ex @@ -0,0 +1,46 @@ +defmodule Muninn.Aggregation do + @moduledoc """ + Builder for constructing aggregation requests. + + Aggregations compute analytics over search results — counting documents + per category, computing average prices, building histograms, etc. + + ## Examples + + alias Muninn.Aggregation + alias Muninn.Aggregation.{Bucket, Metric} + + # Simple terms aggregation + aggs = Aggregation.new() + |> Aggregation.add("categories", Bucket.terms("category", size: 10)) + + {:ok, results} = Muninn.Searcher.aggregate(searcher, "*", ["title"], aggs) + + # Nested: stats per category + aggs = Aggregation.new() + |> Aggregation.add("categories", + Bucket.terms("category", size: 10) + |> Aggregation.sub("price_stats", Metric.stats("price")) + ) + + """ + + @type t :: map() + + @doc "Creates a new empty aggregation request." + @spec new() :: t() + def new, do: %{} + + @doc "Adds a named aggregation to the request." + @spec add(t(), String.t(), map()) :: t() + def add(aggs, name, aggregation) when is_binary(name) and is_map(aggregation) do + Map.put(aggs, name, aggregation) + end + + @doc "Adds a sub-aggregation to a bucket aggregation." + @spec sub(map(), String.t(), map()) :: map() + def sub(parent_agg, name, child_agg) do + sub_aggs = Map.get(parent_agg, "aggs", %{}) + Map.put(parent_agg, "aggs", Map.put(sub_aggs, name, child_agg)) + end +end diff --git a/lib/muninn/aggregation/bucket.ex b/lib/muninn/aggregation/bucket.ex new file mode 100644 index 0000000..8da8f98 --- /dev/null +++ b/lib/muninn/aggregation/bucket.ex @@ -0,0 +1,96 @@ +defmodule Muninn.Aggregation.Bucket do + @moduledoc """ + Bucket aggregation builders. + + Bucket aggregations group documents into buckets based on field values, + ranges, or other criteria. + """ + + @doc """ + Groups documents by field values. + + ## Options + + * `:size` - Maximum number of buckets to return + * `:min_doc_count` - Minimum document count for a bucket to be included + * `:order` - Sort order for buckets (e.g., `%{"_count" => "asc"}`) + + ## Examples + + Bucket.terms("category", size: 10) + + """ + @spec terms(String.t(), keyword()) :: map() + def terms(field, opts \\ []) do + inner = %{"field" => field} + + inner = + opts + |> Enum.reduce(inner, fn + {:size, v}, acc -> Map.put(acc, "size", v) + {:order, v}, acc -> Map.put(acc, "order", v) + {:min_doc_count, v}, acc -> Map.put(acc, "min_doc_count", v) + _, acc -> acc + end) + + %{"terms" => inner} + end + + @doc """ + Groups documents by numeric ranges. + + ## Examples + + Bucket.range("price", [ + %{"to" => 50.0}, + %{"from" => 50.0, "to" => 100.0}, + %{"from" => 100.0} + ]) + + """ + @spec range(String.t(), list(map())) :: map() + def range(field, ranges) do + %{"range" => %{"field" => field, "ranges" => ranges}} + end + + @doc """ + Groups documents into fixed-width numeric buckets. + + ## Options + + * `:offset` - Bucket boundary offset + * `:min_doc_count` - Minimum document count + + ## Examples + + Bucket.histogram("price", 10.0) + + """ + @spec histogram(String.t(), number(), keyword()) :: map() + def histogram(field, interval, opts \\ []) do + inner = %{"field" => field, "interval" => interval} + + inner = + opts + |> Enum.reduce(inner, fn + {:offset, v}, acc -> Map.put(acc, "offset", v) + {:min_doc_count, v}, acc -> Map.put(acc, "min_doc_count", v) + _, acc -> acc + end) + + %{"histogram" => inner} + end + + @doc """ + Filters documents for sub-aggregations. + + ## Examples + + Bucket.filter(%{"term" => %{"status" => "active"}}) + + """ + @spec filter(map()) :: map() + def filter(filter_query) do + %{"filter" => filter_query} + end +end diff --git a/lib/muninn/aggregation/metric.ex b/lib/muninn/aggregation/metric.ex new file mode 100644 index 0000000..3092b4e --- /dev/null +++ b/lib/muninn/aggregation/metric.ex @@ -0,0 +1,74 @@ +defmodule Muninn.Aggregation.Metric do + @moduledoc """ + Metric aggregation builders. + + Metric aggregations compute numeric values (averages, sums, etc.) + over document fields. + """ + + @doc "Computes the average of a numeric field." + @spec avg(String.t()) :: map() + def avg(field), do: %{"avg" => %{"field" => field}} + + @doc "Computes the sum of a numeric field." + @spec sum(String.t()) :: map() + def sum(field), do: %{"sum" => %{"field" => field}} + + @doc "Computes the minimum value of a numeric field." + @spec min(String.t()) :: map() + def min(field), do: %{"min" => %{"field" => field}} + + @doc "Computes the maximum value of a numeric field." + @spec max(String.t()) :: map() + def max(field), do: %{"max" => %{"field" => field}} + + @doc "Computes count, min, max, avg, and sum at once." + @spec stats(String.t()) :: map() + def stats(field), do: %{"stats" => %{"field" => field}} + + @doc "Counts values in a field." + @spec count(String.t()) :: map() + def count(field), do: %{"value_count" => %{"field" => field}} + + @doc """ + Computes approximate distinct count using HyperLogLog. + + ## Options + + * `:precision_threshold` - Precision/memory trade-off (default: 3000) + + """ + @spec cardinality(String.t(), keyword()) :: map() + def cardinality(field, opts \\ []) do + inner = %{"field" => field} + + inner = + case Keyword.get(opts, :precision_threshold) do + nil -> inner + threshold -> Map.put(inner, "precision_threshold", threshold) + end + + %{"cardinality" => inner} + end + + @doc """ + Computes percentiles of a numeric field. + + ## Options + + * `:percents` - List of percentile values to compute (default: [1, 5, 25, 50, 75, 95, 99]) + + """ + @spec percentiles(String.t(), keyword()) :: map() + def percentiles(field, opts \\ []) do + inner = %{"field" => field} + + inner = + case Keyword.get(opts, :percents) do + nil -> inner + percents -> Map.put(inner, "percents", percents) + end + + %{"percentiles" => inner} + end +end diff --git a/lib/muninn/index.ex b/lib/muninn/index.ex index 735345c..b579a24 100644 --- a/lib/muninn/index.ex +++ b/lib/muninn/index.ex @@ -48,10 +48,11 @@ defmodule Muninn.Index do @spec create(String.t(), Schema.t()) :: {:ok, t()} | {:error, atom()} def create(path, %Schema{} = schema) do with :ok <- Schema.validate(schema) do - # Convert schema to list of tuples {name, type, stored, indexed} + # Convert schema to list of tuples {name, type, stored, indexed, fast, tokenizer} fields = Enum.map(schema.fields, fn field -> - {field.name, Atom.to_string(field.type), field.stored, field.indexed} + {field.name, Atom.to_string(field.type), field.stored, field.indexed, field.fast, + field.tokenizer || "default"} end) Native.index_create(path, fields) diff --git a/lib/muninn/native.ex b/lib/muninn/native.ex index e770169..56f61a6 100644 --- a/lib/muninn/native.ex +++ b/lib/muninn/native.ex @@ -183,4 +183,42 @@ defmodule Muninn.Native do _limit ), do: :erlang.nif_error(:nif_not_loaded) + + @doc false + def searcher_count(_searcher, _query_string, _default_fields), + do: :erlang.nif_error(:nif_not_loaded) + + @doc false + def searcher_search_regex(_searcher, _field_name, _pattern, _limit), + do: :erlang.nif_error(:nif_not_loaded) + + @doc false + def searcher_search_more_like_this( + _searcher, + _document_fields, + _min_doc_freq, + _min_term_freq, + _max_doc_freq, + _min_word_length, + _max_word_length, + _max_query_terms, + _boost_factor, + _limit + ), + do: :erlang.nif_error(:nif_not_loaded) + + @doc false + def searcher_search_query_sorted( + _searcher, + _query_string, + _default_fields, + _sort_field, + _reverse, + _limit + ), + do: :erlang.nif_error(:nif_not_loaded) + + @doc false + def searcher_aggregate(_searcher, _query_string, _default_fields, _aggs_json), + do: :erlang.nif_error(:nif_not_loaded) end diff --git a/lib/muninn/schema.ex b/lib/muninn/schema.ex index b3a8688..718f373 100644 --- a/lib/muninn/schema.ex +++ b/lib/muninn/schema.ex @@ -146,6 +146,29 @@ defmodule Muninn.Schema do %{schema | fields: fields ++ [field]} end + @doc """ + Adds a bytes (binary) field to the schema. + + ## Options + + * `:stored` - Whether to store the field value (default: `false`) + * `:indexed` - Whether to index the field (default: `true`) + * `:fast` - Whether to enable fast field storage (default: `false`) + + ## Examples + + iex> schema = Muninn.Schema.new() + iex> schema = Muninn.Schema.add_bytes_field(schema, "payload", stored: true) + iex> hd(schema.fields).type + :bytes + + """ + @spec add_bytes_field(t(), String.t(), keyword()) :: t() + def add_bytes_field(%__MODULE__{fields: fields} = schema, name, opts \\ []) do + field = Field.new(:bytes, name, opts) + %{schema | fields: fields ++ [field]} + end + @doc """ Validates the schema. diff --git a/lib/muninn/schema/field.ex b/lib/muninn/schema/field.ex index 665f1f9..d46e2dd 100644 --- a/lib/muninn/schema/field.ex +++ b/lib/muninn/schema/field.ex @@ -9,10 +9,12 @@ defmodule Muninn.Schema.Field do type: field_type(), name: String.t(), stored: boolean(), - indexed: boolean() + indexed: boolean(), + fast: boolean(), + tokenizer: String.t() | nil } - defstruct [:type, :name, stored: false, indexed: true] + defstruct [:type, :name, stored: false, indexed: true, fast: false, tokenizer: nil] @doc """ Creates a new field. @@ -21,6 +23,10 @@ defmodule Muninn.Schema.Field do * `:stored` - Whether to store the field value (default: `false`) * `:indexed` - Whether to index the field (default: `true`) + * `:fast` - Whether to enable fast field (columnar) storage (default: `false`). + Required for sorting and aggregations on numeric fields. + * `:tokenizer` - Tokenizer to use for text fields (default: `nil`, uses `"default"`). + Built-in options: `"default"`, `"raw"`, `"en_stem"`, `"whitespace"`. """ @spec new(field_type(), String.t(), keyword()) :: t() @@ -29,7 +35,9 @@ defmodule Muninn.Schema.Field do type: type, name: name, stored: Keyword.get(opts, :stored, false), - indexed: Keyword.get(opts, :indexed, true) + indexed: Keyword.get(opts, :indexed, true), + fast: Keyword.get(opts, :fast, false), + tokenizer: Keyword.get(opts, :tokenizer, nil) } end @@ -42,7 +50,9 @@ defmodule Muninn.Schema.Field do type: Atom.to_string(field.type), name: field.name, stored: field.stored, - indexed: field.indexed + indexed: field.indexed, + fast: field.fast, + tokenizer: field.tokenizer || "default" } end end diff --git a/lib/muninn/searcher.ex b/lib/muninn/searcher.ex index 5f126a2..fa70731 100644 --- a/lib/muninn/searcher.ex +++ b/lib/muninn/searcher.ex @@ -682,4 +682,206 @@ defmodule Muninn.Searcher do ) end end + + @doc """ + Counts documents matching a query without retrieving them. + + This is more efficient than a full search when you only need the count. + + ## Parameters + + * `searcher` - The searcher to use + * `query_string` - The query string with natural syntax + * `default_fields` - List of field names to search when no field is specified + + ## Examples + + {:ok, count} = Muninn.Searcher.count(searcher, "elixir", ["title", "content"]) + + """ + @spec count(t(), String.t(), list(String.t())) :: + {:ok, non_neg_integer()} | {:error, String.t()} + def count(searcher, query_string, default_fields) + when is_binary(query_string) and is_list(default_fields) do + Native.searcher_count(searcher, query_string, default_fields) + end + + @doc """ + Performs a regex search on a specific field. + + Uses Tantivy's regex engine (based on `tantivy-fst`). Note that regex patterns + match against indexed (lowercased, tokenized) terms. + + The query parser also supports `/regex/` syntax via `search_query/4`. + + ## Parameters + + * `searcher` - The searcher to use + * `field_name` - The text field to search in + * `pattern` - The regex pattern + * `opts` - Keyword list of options: + - `:limit` - Maximum number of results (default: 10) + + ## Examples + + {:ok, results} = Muninn.Searcher.search_regex(searcher, "title", "elix.*") + + """ + @spec search_regex(t(), String.t(), String.t(), keyword()) :: + {:ok, map()} | {:error, String.t()} + def search_regex(searcher, field_name, pattern, opts \\ []) + when is_binary(field_name) and is_binary(pattern) do + limit = Keyword.get(opts, :limit, 10) + Native.searcher_search_regex(searcher, field_name, pattern, limit) + end + + @doc """ + Finds documents similar to the provided document fields. + + Uses Tantivy's MoreLikeThis query to find documents with similar term distributions. + + ## Parameters + + * `searcher` - The searcher to use + * `document_fields` - A map of field name to text value representing the reference document + * `opts` - Keyword list of options: + - `:min_doc_freq` - Ignore terms appearing in fewer docs (default: 1) + - `:min_term_freq` - Ignore terms less frequent than this (default: 1) + - `:max_doc_freq` - Ignore terms in more docs than this (default: `:unlimited`) + - `:min_word_length` - Minimum word length (default: 0, no minimum) + - `:max_word_length` - Maximum word length (default: 0, no maximum) + - `:max_query_terms` - Maximum terms in the generated query (default: 25) + - `:boost_factor` - Score boost factor (default: 1.0) + - `:limit` - Maximum results (default: 10) + + ## Examples + + {:ok, results} = Muninn.Searcher.search_more_like_this( + searcher, + %{"title" => "Elixir programming", "content" => "Functional programming with Elixir"}, + min_doc_freq: 1, + min_term_freq: 1, + limit: 5 + ) + + """ + @spec search_more_like_this(t(), map(), keyword()) :: {:ok, map()} | {:error, String.t()} + def search_more_like_this(searcher, document_fields, opts \\ []) + when is_map(document_fields) do + min_doc_freq = Keyword.get(opts, :min_doc_freq, 1) + min_term_freq = Keyword.get(opts, :min_term_freq, 1) + + max_doc_freq = + case Keyword.get(opts, :max_doc_freq, :unlimited) do + :unlimited -> 18_446_744_073_709_551_615 + val -> val + end + + min_word_length = Keyword.get(opts, :min_word_length, 0) + max_word_length = Keyword.get(opts, :max_word_length, 0) + max_query_terms = Keyword.get(opts, :max_query_terms, 25) + boost_factor = Keyword.get(opts, :boost_factor, 1.0) + limit = Keyword.get(opts, :limit, 10) + + Native.searcher_search_more_like_this( + searcher, + document_fields, + min_doc_freq, + min_term_freq, + max_doc_freq, + min_word_length, + max_word_length, + max_query_terms, + boost_factor, + limit + ) + end + + @doc """ + Executes a search sorted by a fast field value instead of relevance score. + + Requires the sort field to be a numeric type (u64, i64, f64) with `fast: true` + in the schema. + + ## Parameters + + * `searcher` - The searcher to use + * `query_string` - The query string with natural syntax + * `default_fields` - List of field names to search when no field is specified + * `sort_field` - Name of the fast field to sort by + * `opts` - Keyword list of options: + - `:reverse` - Sort descending when `true` (default: `false` for ascending) + - `:limit` - Maximum number of results (default: 10) + + ## Returns + + Results include `"sort_value"` (the fast field value) instead of `"score"`. + + ## Examples + + {:ok, results} = Muninn.Searcher.search_query_sorted( + searcher, + "*", + ["title"], + "price", + reverse: true, + limit: 10 + ) + + """ + @spec search_query_sorted(t(), String.t(), list(String.t()), String.t(), keyword()) :: + {:ok, map()} | {:error, String.t()} + def search_query_sorted(searcher, query_string, default_fields, sort_field, opts \\ []) + when is_binary(query_string) and is_list(default_fields) and is_binary(sort_field) do + limit = Keyword.get(opts, :limit, 10) + reverse = Keyword.get(opts, :reverse, false) + + Native.searcher_search_query_sorted( + searcher, + query_string, + default_fields, + sort_field, + reverse, + limit + ) + end + + @doc """ + Executes aggregations over documents matching a query. + + Uses Tantivy's aggregation framework. Aggregated fields must have `fast: true` + in the schema. + + ## Parameters + + * `searcher` - The searcher to use + * `query_string` - Query to scope which documents are aggregated (use `"*"` for all) + * `default_fields` - Default fields for the query parser + * `aggregations` - Aggregation request as a map (from builder DSL) or JSON string + * `opts` - Reserved for future options + + ## Examples + + aggs = %{ + "avg_price" => %{"avg" => %{"field" => "price"}} + } + + {:ok, results} = Muninn.Searcher.aggregate(searcher, "*", ["title"], aggs) + + """ + @spec aggregate(t(), String.t(), list(String.t()), map() | String.t(), keyword()) :: + {:ok, map()} | {:error, String.t()} + def aggregate(searcher, query_string, default_fields, aggregations, _opts \\ []) + when is_binary(query_string) and is_list(default_fields) do + aggs_json = + case aggregations do + json when is_binary(json) -> json + map when is_map(map) -> Jason.encode!(map) + end + + case Native.searcher_aggregate(searcher, query_string, default_fields, aggs_json) do + {:ok, result_json} -> {:ok, Jason.decode!(result_json)} + {:error, _} = error -> error + end + end end diff --git a/mix.exs b/mix.exs index 82bf4fb..042a8de 100644 --- a/mix.exs +++ b/mix.exs @@ -1,7 +1,7 @@ defmodule Muninn.MixProject do use Mix.Project - @version "0.5.3" + @version "0.5.4" @source_url "https://github.com/nyo16/muninn" def project do diff --git a/native/muninn/Cargo.lock b/native/muninn/Cargo.lock index 9a94f32..a476af6 100644 --- a/native/muninn/Cargo.lock +++ b/native/muninn/Cargo.lock @@ -471,9 +471,9 @@ dependencies = [ [[package]] name = "lz4_flex" -version = "0.12.0" +version = "0.13.0" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "ab6473172471198271ff72e9379150e9dfd70d8e533e0752a27e515b48dd375e" +checksum = "db9a0d582c2874f68138a16ce1867e0ffde6c0bb0a0df85e1f36d04146db488a" [[package]] name = "measure_time" @@ -507,10 +507,11 @@ checksum = "68354c5c6bd36d73ff3feceb05efa59b6acb7626617f4962be322a825e61f79a" [[package]] name = "muninn" -version = "0.5.2" +version = "0.5.3" dependencies = [ "regex", "rustler", + "serde_json", "tantivy", ] @@ -532,9 +533,9 @@ dependencies = [ [[package]] name = "num-conv" -version = "0.1.0" +version = "0.2.1" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "51d515d32fb182ee37cda2ccdcb92950d6a3c2893aa280e540671c2cd0f3b1d9" +checksum = "c6673768db2d862beb9b39a78fdcb1a69439615d5794a1be50caa9bc92c81967" [[package]] name = "num-traits" @@ -569,7 +570,8 @@ dependencies = [ [[package]] name = "ownedbytes" version = "0.9.0" -source = "git+https://github.com/quickwit-oss/tantivy?rev=51f340f83d06680fc2e231481fa10115f3bc2b7b#51f340f83d06680fc2e231481fa10115f3bc2b7b" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "2fbd56f7631767e61784dc43f8580f403f4475bd4aaa4da003e6295e1bab4a7e" dependencies = [ "stable_deref_trait", ] @@ -798,9 +800,9 @@ checksum = "0fda2ff0d084019ba4d7c6f371c95d8fd75ce3524c3cb8fb653a3023f6323e64" [[package]] name = "sketches-ddsketch" -version = "0.3.0" +version = "0.4.0" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "c1e9a774a6c28142ac54bb25d25562e6bcf957493a184f15ad4eebccb23e410a" +checksum = "05e40b6cf54d988dc1a2223531b969c9a9e30906ad90ef64890c27b4bfbb46ea" dependencies = [ "serde", ] @@ -843,7 +845,8 @@ dependencies = [ [[package]] name = "tantivy" version = "0.26.0" -source = "git+https://github.com/quickwit-oss/tantivy?rev=51f340f83d06680fc2e231481fa10115f3bc2b7b#51f340f83d06680fc2e231481fa10115f3bc2b7b" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "778da245841522199d512d19511b041425d8cff3a8f262b4e1516fceb050289a" dependencies = [ "aho-corasick", "arc-swap", @@ -894,16 +897,18 @@ dependencies = [ [[package]] name = "tantivy-bitpacker" -version = "0.9.0" -source = "git+https://github.com/quickwit-oss/tantivy?rev=51f340f83d06680fc2e231481fa10115f3bc2b7b#51f340f83d06680fc2e231481fa10115f3bc2b7b" +version = "0.10.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "4fed3d674429bcd2de5d0a6d1aa5495fed8afd9c5ecce993019caf7615f53fa4" dependencies = [ "bitpacking", ] [[package]] name = "tantivy-columnar" -version = "0.6.0" -source = "git+https://github.com/quickwit-oss/tantivy?rev=51f340f83d06680fc2e231481fa10115f3bc2b7b#51f340f83d06680fc2e231481fa10115f3bc2b7b" +version = "0.7.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "c57166f5bcfd478f370ab8445afb4678dce44801fa5ce5c451aaf8595583c5dc" dependencies = [ "downcast-rs", "fastdivide", @@ -917,8 +922,9 @@ dependencies = [ [[package]] name = "tantivy-common" -version = "0.10.0" -source = "git+https://github.com/quickwit-oss/tantivy?rev=51f340f83d06680fc2e231481fa10115f3bc2b7b#51f340f83d06680fc2e231481fa10115f3bc2b7b" +version = "0.11.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "bbf10915aa75da3c3b0d58b58853d2e889efbaf32d4982a4c3715dde6bba23e5" dependencies = [ "async-trait", "byteorder", @@ -940,8 +946,9 @@ dependencies = [ [[package]] name = "tantivy-query-grammar" -version = "0.25.0" -source = "git+https://github.com/quickwit-oss/tantivy?rev=51f340f83d06680fc2e231481fa10115f3bc2b7b#51f340f83d06680fc2e231481fa10115f3bc2b7b" +version = "0.26.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "dfadb8526b6da90704feb293b0701a6aae62ea14983143344be2dc5ce30f1d82" dependencies = [ "fnv", "nom", @@ -952,8 +959,9 @@ dependencies = [ [[package]] name = "tantivy-sstable" -version = "0.6.0" -source = "git+https://github.com/quickwit-oss/tantivy?rev=51f340f83d06680fc2e231481fa10115f3bc2b7b#51f340f83d06680fc2e231481fa10115f3bc2b7b" +version = "0.7.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "8a2cfc3ac5164cbadc28965ffb145a8f47582a60ae5897859ad8d4316596c606" dependencies = [ "futures-util", "itertools", @@ -965,8 +973,9 @@ dependencies = [ [[package]] name = "tantivy-stacker" -version = "0.6.0" -source = "git+https://github.com/quickwit-oss/tantivy?rev=51f340f83d06680fc2e231481fa10115f3bc2b7b#51f340f83d06680fc2e231481fa10115f3bc2b7b" +version = "0.7.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "6cbb051742da9d53ca9e8fff43a9b10e319338b24e2c0e15d0372df19ffeb951" dependencies = [ "murmurhash32", "tantivy-common", @@ -974,8 +983,9 @@ dependencies = [ [[package]] name = "tantivy-tokenizer-api" -version = "0.6.0" -source = "git+https://github.com/quickwit-oss/tantivy?rev=51f340f83d06680fc2e231481fa10115f3bc2b7b#51f340f83d06680fc2e231481fa10115f3bc2b7b" +version = "0.7.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "eac258c2c6390673f2685813afeeafcb8c4e0ee7de8dd3fc46838dcc37263f98" dependencies = [ "serde", ] @@ -1015,30 +1025,30 @@ dependencies = [ [[package]] name = "time" -version = "0.3.44" +version = "0.3.47" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "91e7d9e3bb61134e77bde20dd4825b97c010155709965fedf0f49bb138e52a9d" +checksum = "743bd48c283afc0388f9b8827b976905fb217ad9e647fae3a379a9283c4def2c" dependencies = [ "deranged", "itoa", "num-conv", "powerfmt", - "serde", + "serde_core", "time-core", "time-macros", ] [[package]] name = "time-core" -version = "0.1.6" +version = "0.1.8" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "40868e7c1d2f0b8d73e4a8c7f0ff63af4f6d19be117e90bd73eb1d62cf831c6b" +checksum = "7694e1cfe791f8d31026952abf09c69ca6f6fa4e1a1229e18988f06a04a12dca" [[package]] name = "time-macros" -version = "0.2.24" +version = "0.2.27" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "30cfb0125f12d9c277f35663a0a33f8c30190f4e4574868a330595412d34ebf3" +checksum = "2e70e4c5a0e0a8a4823ad65dfe1a6930e4f4d756dcd9dd7939022b5e8c501215" dependencies = [ "num-conv", "time-core", diff --git a/native/muninn/Cargo.toml b/native/muninn/Cargo.toml index cf1d8cd..576dec1 100644 --- a/native/muninn/Cargo.toml +++ b/native/muninn/Cargo.toml @@ -1,6 +1,6 @@ [package] name = "muninn" -version = "0.5.3" +version = "0.5.4" edition = "2021" [lib] @@ -9,8 +9,9 @@ crate-type = ["cdylib"] [dependencies] rustler = "0.37.0" -tantivy = { git = "https://github.com/quickwit-oss/tantivy", rev = "51f340f83d06680fc2e231481fa10115f3bc2b7b" } +tantivy = "0.26.0" regex = "1.11" +serde_json = "1" [features] default = ["nif_version_2_15"] diff --git a/native/muninn/src/aggregation.rs b/native/muninn/src/aggregation.rs new file mode 100644 index 0000000..89d02e1 --- /dev/null +++ b/native/muninn/src/aggregation.rs @@ -0,0 +1,56 @@ +use rustler::ResourceArc; +use tantivy::aggregation::agg_req::Aggregations; +use tantivy::aggregation::{AggContextParams, AggregationCollector}; +use tantivy::query::{AllQuery, QueryParser}; + +use crate::searcher::SearcherResource; + +/// Executes aggregations over documents matching a query. +/// Takes a JSON string for the aggregation request, returns a JSON string for results. +pub fn searcher_aggregate( + searcher_res: ResourceArc, + query_string: String, + default_fields: Vec, + aggs_json: String, +) -> Result { + let searcher = &searcher_res.searcher; + let schema = searcher.index().schema(); + + // Parse the query - special case "*" as AllQuery + let query: Box = if query_string == "*" { + Box::new(AllQuery) + } else { + let mut fields = Vec::new(); + for field_name in &default_fields { + let field = schema + .get_field(field_name) + .map_err(|_| format!("Field '{}' not found in schema", field_name))?; + fields.push(field); + } + + if fields.is_empty() { + return Err("At least one default field must be provided".to_string()); + } + + let query_parser = QueryParser::for_index(searcher.index(), fields); + query_parser + .parse_query(&query_string) + .map_err(|e| format!("Failed to parse query '{}': {}", query_string, e))? + }; + + // Parse the aggregation request from JSON + let agg_req: Aggregations = serde_json::from_str(&aggs_json) + .map_err(|e| format!("Failed to parse aggregation request: {}", e))?; + + // Create collector and execute + let context = AggContextParams::default(); + let collector = AggregationCollector::from_aggs(agg_req, context); + + let agg_result = searcher + .search(&*query, &collector) + .map_err(|e| format!("Aggregation failed: {}", e))?; + + // Serialize result to JSON + serde_json::to_string(&agg_result) + .map_err(|e| format!("Failed to serialize aggregation result: {}", e)) +} diff --git a/native/muninn/src/lib.rs b/native/muninn/src/lib.rs index 637ba39..f8f9f71 100644 --- a/native/muninn/src/lib.rs +++ b/native/muninn/src/lib.rs @@ -13,6 +13,7 @@ mod atoms { } } +mod aggregation; mod index; mod reader; mod schema; @@ -266,6 +267,86 @@ fn searcher_search_fuzzy_with_snippets<'a>( ) } +#[rustler::nif] +fn searcher_count( + searcher: rustler::ResourceArc, + query_string: String, + default_fields: Vec, +) -> Result { + searcher::searcher_count(searcher, query_string, default_fields) +} + +#[rustler::nif] +fn searcher_search_regex<'a>( + env: rustler::Env<'a>, + searcher: rustler::ResourceArc, + field_name: String, + pattern: String, + limit: usize, +) -> Result, String> { + searcher::searcher_search_regex(env, searcher, field_name, pattern, limit) +} + +#[rustler::nif] +fn searcher_search_more_like_this<'a>( + env: rustler::Env<'a>, + searcher: rustler::ResourceArc, + document_fields: std::collections::HashMap, + min_doc_freq: u64, + min_term_freq: usize, + max_doc_freq: u64, + min_word_length: usize, + max_word_length: usize, + max_query_terms: usize, + boost_factor: f32, + limit: usize, +) -> Result, String> { + searcher::searcher_search_more_like_this( + env, + searcher, + document_fields, + min_doc_freq, + min_term_freq, + max_doc_freq, + min_word_length, + max_word_length, + max_query_terms, + boost_factor, + limit, + ) +} + +#[rustler::nif] +fn searcher_search_query_sorted<'a>( + env: rustler::Env<'a>, + searcher: rustler::ResourceArc, + query_string: String, + default_fields: Vec, + sort_field: String, + reverse: bool, + limit: usize, +) -> Result, String> { + searcher::searcher_search_query_sorted( + env, + searcher, + query_string, + default_fields, + sort_field, + reverse, + limit, + ) +} + +#[rustler::nif(schedule = "DirtyCpu")] +fn searcher_aggregate( + searcher: rustler::ResourceArc, + query_string: String, + default_fields: Vec, + aggs_json: String, +) -> Result { + aggregation::searcher_aggregate(searcher, query_string, default_fields, aggs_json) +} + rustler::init!("Elixir.Muninn.Native", load = on_load); fn on_load(env: rustler::Env, _info: rustler::Term) -> bool { diff --git a/native/muninn/src/schema.rs b/native/muninn/src/schema.rs index 8ec0af3..a02e97f 100644 --- a/native/muninn/src/schema.rs +++ b/native/muninn/src/schema.rs @@ -1,24 +1,44 @@ use rustler::{Env, ResourceArc}; -use tantivy::schema::{NumericOptions, Schema, SchemaBuilder, TextFieldIndexing, TextOptions}; +use tantivy::schema::{ + BytesOptions, NumericOptions, Schema, SchemaBuilder, TextFieldIndexing, TextOptions, +}; /// Resource wrapper for Tantivy Schema pub struct SchemaResource { pub schema: Schema, } -/// Field definition from Elixir - Using tuple (name, type, stored, indexed) -pub type FieldDef = (String, String, bool, bool); +/// Field definition from Elixir - Using tuple (name, type, stored, indexed, fast, tokenizer) +pub type FieldDef = (String, String, bool, bool, bool, String); /// Schema definition from Elixir - Using list of field definitions pub type SchemaDef = Vec; +/// Known built-in tokenizers +const KNOWN_TOKENIZERS: &[&str] = &["default", "raw", "en_stem", "whitespace"]; + /// Creates a Tantivy schema from the Elixir schema definition pub fn build_schema(schema_def: SchemaDef) -> Result { let mut schema_builder = SchemaBuilder::new(); - for (name, field_type, stored, indexed) in schema_def { + for (name, field_type, stored, indexed, fast, tokenizer) in schema_def { match field_type.as_str() { "text" => { + // Validate tokenizer + let tokenizer_name = if tokenizer.is_empty() { + "default" + } else { + &tokenizer + }; + + if !KNOWN_TOKENIZERS.contains(&tokenizer_name) { + return Err(format!( + "Unknown tokenizer '{}'. Known tokenizers: {}", + tokenizer_name, + KNOWN_TOKENIZERS.join(", ") + )); + } + let mut text_options = TextOptions::default(); if stored { @@ -27,13 +47,17 @@ pub fn build_schema(schema_def: SchemaDef) -> Result { if indexed { let indexing = TextFieldIndexing::default() - .set_tokenizer("default") + .set_tokenizer(tokenizer_name) .set_index_option( tantivy::schema::IndexRecordOption::WithFreqsAndPositions, ); text_options = text_options.set_indexing_options(indexing); } + if fast { + text_options = text_options.set_fast(Some("raw")); + } + schema_builder.add_text_field(&name, text_options); } "u64" | "i64" | "f64" => { @@ -47,6 +71,10 @@ pub fn build_schema(schema_def: SchemaDef) -> Result { numeric_options = numeric_options.set_indexed(); } + if fast { + numeric_options = numeric_options.set_fast(); + } + match field_type.as_str() { "u64" => schema_builder.add_u64_field(&name, numeric_options), "i64" => schema_builder.add_i64_field(&name, numeric_options), @@ -65,8 +93,29 @@ pub fn build_schema(schema_def: SchemaDef) -> Result { bool_options = bool_options.set_indexed(); } + if fast { + bool_options = bool_options.set_fast(); + } + schema_builder.add_bool_field(&name, bool_options); } + "bytes" => { + let mut bytes_options = BytesOptions::default(); + + if stored { + bytes_options = bytes_options.set_stored(); + } + + if indexed { + bytes_options = bytes_options.set_indexed(); + } + + if fast { + bytes_options = bytes_options.set_fast(); + } + + schema_builder.add_bytes_field(&name, bytes_options); + } _ => { return Err(format!("Unsupported field type: {}", field_type)); } diff --git a/native/muninn/src/searcher.rs b/native/muninn/src/searcher.rs index 05136eb..d916ef0 100644 --- a/native/muninn/src/searcher.rs +++ b/native/muninn/src/searcher.rs @@ -2,11 +2,13 @@ use rustler::{Env, ResourceArc}; use std::collections::HashMap; use std::ops::Bound; use std::panic::RefUnwindSafe; -use tantivy::collector::TopDocs; -use tantivy::query::{FuzzyTermQuery, Query, QueryParser, RangeQuery, RegexQuery, TermQuery}; -use tantivy::schema::FieldType; +use tantivy::collector::{Count, TopDocs}; +use tantivy::query::{ + FuzzyTermQuery, MoreLikeThisQuery, Query, QueryParser, RangeQuery, RegexQuery, TermQuery, +}; +use tantivy::schema::{FieldType, OwnedValue}; use tantivy::snippet::SnippetGenerator; -use tantivy::{Searcher, TantivyDocument, Term}; +use tantivy::{Order, Searcher, TantivyDocument, Term}; use crate::reader::ReaderResource; @@ -691,7 +693,13 @@ fn document_to_hit_map<'a>( tantivy::schema::OwnedValue::Bool(b) => { doc_fields.insert(field_name, b.encode(env)); } - _ => {} // Skip unsupported types + tantivy::schema::OwnedValue::Bytes(ref b) => { + let mut binary = rustler::NewBinary::new(env, b.len()); + binary.as_mut_slice().copy_from_slice(b); + let bin: rustler::Binary = binary.into(); + doc_fields.insert(field_name, bin.encode(env)); + } + _ => {} } } } @@ -747,6 +755,12 @@ fn document_to_hit_map_with_snippets<'a>( tantivy::schema::OwnedValue::Bool(b) => { doc_fields.insert(field_name, b.encode(env)); } + tantivy::schema::OwnedValue::Bytes(ref b) => { + let mut binary = rustler::NewBinary::new(env, b.len()); + binary.as_mut_slice().copy_from_slice(b); + let bin: rustler::Binary = binary.into(); + doc_fields.insert(field_name, bin.encode(env)); + } _ => {} // Skip unsupported types } } @@ -779,6 +793,330 @@ fn document_to_hit_map_with_snippets<'a>( .unwrap() } +/// Counts documents matching a query without retrieving them +pub fn searcher_count( + searcher_res: ResourceArc, + query_string: String, + default_fields: Vec, +) -> Result { + let searcher = &searcher_res.searcher; + let schema = searcher.index().schema(); + + let mut fields = Vec::new(); + for field_name in &default_fields { + let field = schema + .get_field(field_name) + .map_err(|_| format!("Field '{}' not found in schema", field_name))?; + fields.push(field); + } + + if fields.is_empty() { + return Err("At least one default field must be provided".to_string()); + } + + let query_parser = QueryParser::for_index(searcher.index(), fields); + let query = query_parser + .parse_query(&query_string) + .map_err(|e| format!("Failed to parse query '{}': {}", query_string, e))?; + + let count = searcher + .search(&*query, &Count) + .map_err(|e| format!("Count failed: {}", e))?; + + Ok(count as u64) +} + +/// Performs a regex query on a specific field +pub fn searcher_search_regex<'a>( + env: rustler::Env<'a>, + searcher_res: ResourceArc, + field_name: String, + pattern: String, + limit: usize, +) -> Result, String> { + let searcher = &searcher_res.searcher; + let schema = searcher.index().schema(); + + let field = schema + .get_field(&field_name) + .map_err(|_| format!("Field '{}' not found in schema", field_name))?; + + let field_entry = schema.get_field_entry(field); + if !matches!(field_entry.field_type(), FieldType::Str(_)) { + return Err(format!( + "Field '{}' is not a text field. Regex search only works on text fields.", + field_name + )); + } + + let regex_query = RegexQuery::from_pattern(&pattern, field) + .map_err(|e| format!("Invalid regex pattern '{}': {}", pattern, e))?; + + execute_query(env, searcher, &schema, ®ex_query, limit) +} + +/// Performs a MoreLikeThis query to find similar documents +pub fn searcher_search_more_like_this<'a>( + env: rustler::Env<'a>, + searcher_res: ResourceArc, + document_fields: HashMap, + min_doc_freq: u64, + min_term_freq: usize, + max_doc_freq: u64, + min_word_length: usize, + max_word_length: usize, + max_query_terms: usize, + boost_factor: f32, + limit: usize, +) -> Result, String> { + let searcher = &searcher_res.searcher; + let schema = searcher.index().schema(); + + if document_fields.is_empty() { + return Err("Document fields map cannot be empty".to_string()); + } + + // Convert field name/value pairs to (Field, Vec) + let mut doc_fields_vec: Vec<(tantivy::schema::Field, Vec)> = Vec::new(); + for (field_name, text_value) in &document_fields { + let field = schema + .get_field(field_name) + .map_err(|_| format!("Field '{}' not found in schema", field_name))?; + doc_fields_vec.push((field, vec![OwnedValue::Str(text_value.clone())])); + } + + // Build the MoreLikeThis query + let mut builder = MoreLikeThisQuery::builder() + .with_min_doc_frequency(min_doc_freq) + .with_min_term_frequency(min_term_freq) + .with_max_query_terms(max_query_terms) + .with_boost_factor(boost_factor); + + if max_doc_freq < u64::MAX { + builder = builder.with_max_doc_frequency(max_doc_freq); + } + + if min_word_length > 0 { + builder = builder.with_min_word_length(min_word_length); + } + + if max_word_length > 0 { + builder = builder.with_max_word_length(max_word_length); + } + + let query = builder.with_document_fields(doc_fields_vec); + + execute_query(env, searcher, &schema, &query, limit) +} + +/// Performs a search sorted by a fast field value instead of relevance score +pub fn searcher_search_query_sorted<'a>( + env: rustler::Env<'a>, + searcher_res: ResourceArc, + query_string: String, + default_fields: Vec, + sort_field: String, + reverse: bool, + limit: usize, +) -> Result, String> { + let searcher = &searcher_res.searcher; + let schema = searcher.index().schema(); + + // Parse the query + let mut fields = Vec::new(); + for field_name in &default_fields { + let field = schema + .get_field(field_name) + .map_err(|_| format!("Field '{}' not found in schema", field_name))?; + fields.push(field); + } + + if fields.is_empty() { + return Err("At least one default field must be provided".to_string()); + } + + let query_parser = QueryParser::for_index(searcher.index(), fields); + let query = query_parser + .parse_query(&query_string) + .map_err(|e| format!("Failed to parse query '{}': {}", query_string, e))?; + + // Resolve sort field + let sort_field_ref = schema + .get_field(&sort_field) + .map_err(|_| format!("Sort field '{}' not found in schema", sort_field))?; + + let field_entry = schema.get_field_entry(sort_field_ref); + let order = if reverse { Order::Desc } else { Order::Asc }; + + use rustler::types::map; + use rustler::Encoder; + + // Dispatch based on field type + match field_entry.field_type() { + FieldType::U64(_) => { + let collector = + TopDocs::with_limit(limit).order_by_fast_field::(&sort_field, order); + let top_docs = searcher + .search(&*query, &collector) + .map_err(|e| format!("Search failed: {}", e))?; + + let total_hits = top_docs.len(); + let mut hits = Vec::new(); + + for (sort_value, doc_address) in top_docs { + let doc: TantivyDocument = searcher + .doc(doc_address) + .map_err(|e| format!("Failed to retrieve document: {}", e))?; + + let mut doc_fields: HashMap = HashMap::new(); + build_doc_fields(env, &schema, &doc, &mut doc_fields); + let doc_map = doc_fields.encode(env); + + let hit = map::map_new(env) + .map_put("sort_value".encode(env), sort_value.encode(env)) + .ok() + .unwrap() + .map_put("doc".encode(env), doc_map) + .ok() + .unwrap(); + hits.push(hit); + } + + let result = map::map_new(env) + .map_put("total_hits".encode(env), total_hits.encode(env)) + .ok() + .unwrap() + .map_put("hits".encode(env), hits.encode(env)) + .ok() + .unwrap(); + Ok(result) + } + FieldType::I64(_) => { + let collector = + TopDocs::with_limit(limit).order_by_fast_field::(&sort_field, order); + let top_docs = searcher + .search(&*query, &collector) + .map_err(|e| format!("Search failed: {}", e))?; + + let total_hits = top_docs.len(); + let mut hits = Vec::new(); + + for (sort_value, doc_address) in top_docs { + let doc: TantivyDocument = searcher + .doc(doc_address) + .map_err(|e| format!("Failed to retrieve document: {}", e))?; + + let mut doc_fields: HashMap = HashMap::new(); + build_doc_fields(env, &schema, &doc, &mut doc_fields); + let doc_map = doc_fields.encode(env); + + let hit = map::map_new(env) + .map_put("sort_value".encode(env), sort_value.encode(env)) + .ok() + .unwrap() + .map_put("doc".encode(env), doc_map) + .ok() + .unwrap(); + hits.push(hit); + } + + let result = map::map_new(env) + .map_put("total_hits".encode(env), total_hits.encode(env)) + .ok() + .unwrap() + .map_put("hits".encode(env), hits.encode(env)) + .ok() + .unwrap(); + Ok(result) + } + FieldType::F64(_) => { + let collector = + TopDocs::with_limit(limit).order_by_fast_field::(&sort_field, order); + let top_docs = searcher + .search(&*query, &collector) + .map_err(|e| format!("Search failed: {}", e))?; + + let total_hits = top_docs.len(); + let mut hits = Vec::new(); + + for (sort_value, doc_address) in top_docs { + let doc: TantivyDocument = searcher + .doc(doc_address) + .map_err(|e| format!("Failed to retrieve document: {}", e))?; + + let mut doc_fields: HashMap = HashMap::new(); + build_doc_fields(env, &schema, &doc, &mut doc_fields); + let doc_map = doc_fields.encode(env); + + let hit = map::map_new(env) + .map_put("sort_value".encode(env), sort_value.encode(env)) + .ok() + .unwrap() + .map_put("doc".encode(env), doc_map) + .ok() + .unwrap(); + hits.push(hit); + } + + let result = map::map_new(env) + .map_put("total_hits".encode(env), total_hits.encode(env)) + .ok() + .unwrap() + .map_put("hits".encode(env), hits.encode(env)) + .ok() + .unwrap(); + Ok(result) + } + _ => Err(format!( + "Sort field '{}' must be a numeric type (u64, i64, f64)", + sort_field + )), + } +} + +/// Helper to extract document fields into a HashMap for Elixir encoding +fn build_doc_fields<'a>( + env: rustler::Env<'a>, + schema: &tantivy::schema::Schema, + doc: &TantivyDocument, + doc_fields: &mut HashMap>, +) { + use rustler::Encoder; + + for field in schema.fields() { + let field_name = field.1.name().to_string(); + let values: Vec<_> = doc.get_all(field.0).collect(); + + if let Some(value) = values.first() { + let owned_value: tantivy::schema::OwnedValue = (*value).into(); + match owned_value { + tantivy::schema::OwnedValue::Str(s) => { + doc_fields.insert(field_name, s.as_str().encode(env)); + } + tantivy::schema::OwnedValue::U64(n) => { + doc_fields.insert(field_name, n.encode(env)); + } + tantivy::schema::OwnedValue::I64(n) => { + doc_fields.insert(field_name, n.encode(env)); + } + tantivy::schema::OwnedValue::F64(n) => { + doc_fields.insert(field_name, n.encode(env)); + } + tantivy::schema::OwnedValue::Bool(b) => { + doc_fields.insert(field_name, b.encode(env)); + } + tantivy::schema::OwnedValue::Bytes(ref b) => { + let mut binary = rustler::NewBinary::new(env, b.len()); + binary.as_mut_slice().copy_from_slice(b); + let bin: rustler::Binary = binary.into(); + doc_fields.insert(field_name, bin.encode(env)); + } + _ => {} + } + } + } +} + pub fn load(env: Env) -> bool { rustler::resource!(SearcherResource, env); true diff --git a/native/muninn/src/writer.rs b/native/muninn/src/writer.rs index 3d4ad41..8901438 100644 --- a/native/muninn/src/writer.rs +++ b/native/muninn/src/writer.rs @@ -66,6 +66,11 @@ pub fn writer_add_document( tantivy_doc.add_bool(field, bool_val); } } + FieldType::Bytes(_) => { + if let Ok(bin) = value.decode::() { + tantivy_doc.add_bytes(field, bin.as_slice()); + } + } _ => { // Unsupported field type, skip } diff --git a/test/muninn/aggregation_integration_test.exs b/test/muninn/aggregation_integration_test.exs new file mode 100644 index 0000000..3236232 --- /dev/null +++ b/test/muninn/aggregation_integration_test.exs @@ -0,0 +1,214 @@ +defmodule Muninn.AggregationIntegrationTest do + use ExUnit.Case, async: true + + alias Muninn.{Index, IndexWriter, IndexReader, Searcher, Schema} + alias Muninn.Aggregation + alias Muninn.Aggregation.{Bucket, Metric} + + setup do + test_path = "/tmp/muninn_agg_int_#{:erlang.unique_integer([:positive])}" + on_exit(fn -> Muninn.TestHelpers.safe_rm_rf(test_path) end) + {:ok, test_path: test_path} + end + + defp create_product_index(test_path) do + schema = + Schema.new() + |> Schema.add_text_field("title", stored: true) + |> Schema.add_text_field("category", stored: true, tokenizer: "raw", fast: true) + |> Schema.add_f64_field("price", stored: true, fast: true) + |> Schema.add_u64_field("quantity", stored: true, fast: true) + + {:ok, index} = Index.create(test_path, schema) + + products = [ + %{"title" => "laptop", "category" => "electronics", "price" => 999.0, "quantity" => 5}, + %{"title" => "phone", "category" => "electronics", "price" => 699.0, "quantity" => 20}, + %{"title" => "tablet", "category" => "electronics", "price" => 499.0, "quantity" => 15}, + %{"title" => "shirt", "category" => "clothing", "price" => 29.0, "quantity" => 100}, + %{"title" => "pants", "category" => "clothing", "price" => 49.0, "quantity" => 80}, + %{"title" => "book", "category" => "books", "price" => 15.0, "quantity" => 200} + ] + + Enum.each(products, &IndexWriter.add_document(index, &1)) + IndexWriter.commit(index) + + {:ok, reader} = IndexReader.new(index) + {:ok, searcher} = Searcher.new(reader) + searcher + end + + describe "stats aggregation" do + test "computes stats over all documents", %{test_path: test_path} do + searcher = create_product_index(test_path) + + aggs = + Aggregation.new() + |> Aggregation.add("price_stats", Metric.stats("price")) + + {:ok, results} = Searcher.aggregate(searcher, "*", ["title"], aggs) + + stats = results["price_stats"] + assert stats["count"] == 6 + assert stats["min"] == 15.0 + assert stats["max"] == 999.0 + assert_in_delta stats["avg"], 381.66, 1.0 + end + end + + describe "avg aggregation" do + test "computes average price", %{test_path: test_path} do + searcher = create_product_index(test_path) + + aggs = + Aggregation.new() + |> Aggregation.add("avg_price", Metric.avg("price")) + + {:ok, results} = Searcher.aggregate(searcher, "*", ["title"], aggs) + assert_in_delta results["avg_price"]["value"], 381.66, 1.0 + end + end + + describe "terms aggregation" do + test "groups by category", %{test_path: test_path} do + searcher = create_product_index(test_path) + + aggs = + Aggregation.new() + |> Aggregation.add("by_category", Bucket.terms("category", size: 10)) + + {:ok, results} = Searcher.aggregate(searcher, "*", ["title"], aggs) + + buckets = results["by_category"]["buckets"] + assert length(buckets) == 3 + + electronics = Enum.find(buckets, &(&1["key"] == "electronics")) + assert electronics["doc_count"] == 3 + + clothing = Enum.find(buckets, &(&1["key"] == "clothing")) + assert clothing["doc_count"] == 2 + + books = Enum.find(buckets, &(&1["key"] == "books")) + assert books["doc_count"] == 1 + end + end + + describe "nested aggregations" do + test "stats per category", %{test_path: test_path} do + searcher = create_product_index(test_path) + + aggs = + Aggregation.new() + |> Aggregation.add( + "by_category", + Bucket.terms("category", size: 10) + |> Aggregation.sub("price_stats", Metric.stats("price")) + ) + + {:ok, results} = Searcher.aggregate(searcher, "*", ["title"], aggs) + + buckets = results["by_category"]["buckets"] + electronics = Enum.find(buckets, &(&1["key"] == "electronics")) + assert electronics["price_stats"]["count"] == 3 + assert electronics["price_stats"]["min"] == 499.0 + assert electronics["price_stats"]["max"] == 999.0 + end + end + + describe "range aggregation" do + test "groups by price ranges", %{test_path: test_path} do + searcher = create_product_index(test_path) + + aggs = + Aggregation.new() + |> Aggregation.add( + "price_ranges", + Bucket.range("price", [ + %{"to" => 50.0}, + %{"from" => 50.0, "to" => 500.0}, + %{"from" => 500.0} + ]) + ) + + {:ok, results} = Searcher.aggregate(searcher, "*", ["title"], aggs) + + buckets = results["price_ranges"]["buckets"] + assert length(buckets) == 3 + + # Under 50: book (15), shirt (29), pants (49) + cheap = Enum.find(buckets, &(&1["to"] == 50.0)) + assert cheap["doc_count"] == 3 + + # 50-500: tablet (499) + mid = Enum.find(buckets, &(&1["from"] == 50.0 && &1["to"] == 500.0)) + assert mid["doc_count"] == 1 + + # 500+: laptop (999), phone (699) + expensive = Enum.find(buckets, &(&1["from"] == 500.0)) + assert expensive["doc_count"] == 2 + end + end + + describe "histogram aggregation" do + test "creates fixed-width buckets", %{test_path: test_path} do + searcher = create_product_index(test_path) + + aggs = + Aggregation.new() + |> Aggregation.add("price_hist", Bucket.histogram("price", 100.0)) + + {:ok, results} = Searcher.aggregate(searcher, "*", ["title"], aggs) + + buckets = results["price_hist"]["buckets"] + assert is_list(buckets) + assert length(buckets) > 0 + end + end + + describe "query-scoped aggregation" do + test "aggregates only matching documents", %{test_path: test_path} do + searcher = create_product_index(test_path) + + aggs = + Aggregation.new() + |> Aggregation.add("price_stats", Metric.stats("price")) + + # Only aggregate electronics (search for electronics in category) + {:ok, results} = + Searcher.aggregate(searcher, "category:electronics", ["title", "category"], aggs) + + stats = results["price_stats"] + assert stats["count"] == 3 + assert stats["min"] == 499.0 + end + end + + describe "multiple aggregations" do + test "runs multiple aggregations at once", %{test_path: test_path} do + searcher = create_product_index(test_path) + + aggs = + Aggregation.new() + |> Aggregation.add("avg_price", Metric.avg("price")) + |> Aggregation.add("max_price", Metric.max("price")) + |> Aggregation.add("min_price", Metric.min("price")) + + {:ok, results} = Searcher.aggregate(searcher, "*", ["title"], aggs) + + assert results["min_price"]["value"] == 15.0 + assert results["max_price"]["value"] == 999.0 + assert_in_delta results["avg_price"]["value"], 381.66, 1.0 + end + end + + describe "error handling" do + test "invalid aggregation JSON returns error", %{test_path: test_path} do + searcher = create_product_index(test_path) + + {:error, reason} = + Searcher.aggregate(searcher, "*", ["title"], "{invalid json") + + assert reason =~ "Failed to parse" + end + end +end diff --git a/test/muninn/aggregation_test.exs b/test/muninn/aggregation_test.exs new file mode 100644 index 0000000..a170b46 --- /dev/null +++ b/test/muninn/aggregation_test.exs @@ -0,0 +1,97 @@ +defmodule Muninn.AggregationTest do + use ExUnit.Case, async: true + + alias Muninn.Aggregation + alias Muninn.Aggregation.{Bucket, Metric} + + describe "builder DSL" do + test "new/0 creates empty map" do + assert Aggregation.new() == %{} + end + + test "add/3 adds aggregation" do + aggs = + Aggregation.new() + |> Aggregation.add("avg_price", Metric.avg("price")) + + assert aggs == %{"avg_price" => %{"avg" => %{"field" => "price"}}} + end + + test "sub/3 adds sub-aggregation" do + parent = Bucket.terms("category") + + result = + parent + |> Aggregation.sub("avg_price", Metric.avg("price")) + + assert result["aggs"]["avg_price"] == %{"avg" => %{"field" => "price"}} + end + + test "multiple aggregations" do + aggs = + Aggregation.new() + |> Aggregation.add("avg_price", Metric.avg("price")) + |> Aggregation.add("max_price", Metric.max("price")) + + assert Map.has_key?(aggs, "avg_price") + assert Map.has_key?(aggs, "max_price") + end + end + + describe "bucket builders" do + test "terms with options" do + result = Bucket.terms("category", size: 10, min_doc_count: 2) + assert result["terms"]["field"] == "category" + assert result["terms"]["size"] == 10 + assert result["terms"]["min_doc_count"] == 2 + end + + test "range" do + ranges = [%{"to" => 50.0}, %{"from" => 50.0, "to" => 100.0}, %{"from" => 100.0}] + result = Bucket.range("price", ranges) + assert result["range"]["field"] == "price" + assert result["range"]["ranges"] == ranges + end + + test "histogram" do + result = Bucket.histogram("price", 10.0, min_doc_count: 1) + assert result["histogram"]["field"] == "price" + assert result["histogram"]["interval"] == 10.0 + assert result["histogram"]["min_doc_count"] == 1 + end + end + + describe "metric builders" do + test "avg" do + assert Metric.avg("price") == %{"avg" => %{"field" => "price"}} + end + + test "sum" do + assert Metric.sum("price") == %{"sum" => %{"field" => "price"}} + end + + test "min" do + assert Metric.min("price") == %{"min" => %{"field" => "price"}} + end + + test "max" do + assert Metric.max("price") == %{"max" => %{"field" => "price"}} + end + + test "stats" do + assert Metric.stats("price") == %{"stats" => %{"field" => "price"}} + end + + test "cardinality with precision" do + result = Metric.cardinality("category", precision_threshold: 100) + assert result["cardinality"]["field"] == "category" + assert result["cardinality"]["precision_threshold"] == 100 + end + + test "percentiles with custom percents" do + result = Metric.percentiles("price", percents: [25, 50, 75, 99]) + assert result["percentiles"]["field"] == "price" + assert result["percentiles"]["percents"] == [25, 50, 75, 99] + end + end +end diff --git a/test/muninn/bytes_field_test.exs b/test/muninn/bytes_field_test.exs new file mode 100644 index 0000000..b78c987 --- /dev/null +++ b/test/muninn/bytes_field_test.exs @@ -0,0 +1,81 @@ +defmodule Muninn.BytesFieldTest do + use ExUnit.Case, async: true + + alias Muninn.{Index, IndexWriter, IndexReader, Searcher, Schema} + + setup do + test_path = "/tmp/muninn_bytes_#{:erlang.unique_integer([:positive])}" + on_exit(fn -> Muninn.TestHelpers.safe_rm_rf(test_path) end) + {:ok, test_path: test_path} + end + + describe "bytes field type" do + test "round-trip binary storage and retrieval", %{test_path: test_path} do + schema = + Schema.new() + |> Schema.add_text_field("title", stored: true) + |> Schema.add_bytes_field("payload", stored: true) + + {:ok, index} = Index.create(test_path, schema) + + binary_data = <<1, 2, 3, 4, 5, 255, 0, 128>> + IndexWriter.add_document(index, %{"title" => "test doc", "payload" => binary_data}) + IndexWriter.commit(index) + + {:ok, reader} = IndexReader.new(index) + {:ok, searcher} = Searcher.new(reader) + + {:ok, results} = Searcher.search_query(searcher, "test", ["title"], limit: 10) + + assert results["total_hits"] == 1 + hit = List.first(results["hits"]) + assert hit["doc"]["title"] == "test doc" + assert hit["doc"]["payload"] == binary_data + end + + test "empty binary data", %{test_path: test_path} do + schema = + Schema.new() + |> Schema.add_text_field("title", stored: true) + |> Schema.add_bytes_field("data", stored: true) + + {:ok, index} = Index.create(test_path, schema) + + IndexWriter.add_document(index, %{"title" => "empty", "data" => <<>>}) + IndexWriter.commit(index) + + {:ok, reader} = IndexReader.new(index) + {:ok, searcher} = Searcher.new(reader) + + {:ok, results} = Searcher.search_query(searcher, "empty", ["title"], limit: 10) + assert results["total_hits"] == 1 + assert List.first(results["hits"])["doc"]["data"] == <<>> + end + + test "bytes field with stored: false is not returned", %{test_path: test_path} do + schema = + Schema.new() + |> Schema.add_text_field("title", stored: true) + |> Schema.add_bytes_field("hidden", stored: false) + + {:ok, index} = Index.create(test_path, schema) + + IndexWriter.add_document(index, %{"title" => "doc", "hidden" => <<1, 2, 3>>}) + IndexWriter.commit(index) + + {:ok, reader} = IndexReader.new(index) + {:ok, searcher} = Searcher.new(reader) + + {:ok, results} = Searcher.search_query(searcher, "doc", ["title"], limit: 10) + assert results["total_hits"] == 1 + refute Map.has_key?(List.first(results["hits"])["doc"], "hidden") + end + + test "schema includes bytes field", %{test_path: _test_path} do + schema = Schema.new() |> Schema.add_bytes_field("data", stored: true) + assert length(schema.fields) == 1 + assert hd(schema.fields).type == :bytes + assert hd(schema.fields).name == "data" + end + end +end diff --git a/test/muninn/count_test.exs b/test/muninn/count_test.exs new file mode 100644 index 0000000..a957da6 --- /dev/null +++ b/test/muninn/count_test.exs @@ -0,0 +1,71 @@ +defmodule Muninn.CountTest do + use ExUnit.Case, async: true + + alias Muninn.{Index, IndexWriter, IndexReader, Searcher, Schema} + + setup do + test_path = "/tmp/muninn_count_#{:erlang.unique_integer([:positive])}" + on_exit(fn -> Muninn.TestHelpers.safe_rm_rf(test_path) end) + {:ok, test_path: test_path} + end + + describe "count/3" do + test "returns 0 on empty index", %{test_path: test_path} do + schema = Schema.new() |> Schema.add_text_field("title", stored: true) + {:ok, index} = Index.create(test_path, schema) + IndexWriter.commit(index) + + {:ok, reader} = IndexReader.new(index) + {:ok, searcher} = Searcher.new(reader) + + {:ok, count} = Searcher.count(searcher, "anything", ["title"]) + assert count == 0 + end + + test "returns correct count for matching documents", %{test_path: test_path} do + schema = Schema.new() |> Schema.add_text_field("title", stored: true) + {:ok, index} = Index.create(test_path, schema) + + IndexWriter.add_document(index, %{"title" => "Elixir Programming"}) + IndexWriter.add_document(index, %{"title" => "Elixir in Action"}) + IndexWriter.add_document(index, %{"title" => "Phoenix Framework"}) + IndexWriter.commit(index) + + {:ok, reader} = IndexReader.new(index) + {:ok, searcher} = Searcher.new(reader) + + {:ok, count} = Searcher.count(searcher, "elixir", ["title"]) + assert count == 2 + end + + test "returns 0 for non-matching query", %{test_path: test_path} do + schema = Schema.new() |> Schema.add_text_field("title", stored: true) + {:ok, index} = Index.create(test_path, schema) + + IndexWriter.add_document(index, %{"title" => "Elixir Programming"}) + IndexWriter.commit(index) + + {:ok, reader} = IndexReader.new(index) + {:ok, searcher} = Searcher.new(reader) + + {:ok, count} = Searcher.count(searcher, "rust", ["title"]) + assert count == 0 + end + + test "works with boolean queries", %{test_path: test_path} do + schema = Schema.new() |> Schema.add_text_field("title", stored: true) + {:ok, index} = Index.create(test_path, schema) + + IndexWriter.add_document(index, %{"title" => "Elixir and Phoenix"}) + IndexWriter.add_document(index, %{"title" => "Elixir Basics"}) + IndexWriter.add_document(index, %{"title" => "Phoenix Guide"}) + IndexWriter.commit(index) + + {:ok, reader} = IndexReader.new(index) + {:ok, searcher} = Searcher.new(reader) + + {:ok, count} = Searcher.count(searcher, "elixir AND phoenix", ["title"]) + assert count == 1 + end + end +end diff --git a/test/muninn/more_like_this_test.exs b/test/muninn/more_like_this_test.exs new file mode 100644 index 0000000..af7ffd9 --- /dev/null +++ b/test/muninn/more_like_this_test.exs @@ -0,0 +1,100 @@ +defmodule Muninn.MoreLikeThisTest do + use ExUnit.Case, async: true + + alias Muninn.{Index, IndexWriter, IndexReader, Searcher, Schema} + + setup do + test_path = "/tmp/muninn_mlt_#{:erlang.unique_integer([:positive])}" + on_exit(fn -> Muninn.TestHelpers.safe_rm_rf(test_path) end) + {:ok, test_path: test_path} + end + + describe "search_more_like_this/3" do + test "finds similar documents", %{test_path: test_path} do + schema = + Schema.new() + |> Schema.add_text_field("content", stored: true) + + {:ok, index} = Index.create(test_path, schema) + + # Create a corpus with overlapping vocabulary + docs = [ + "elixir is a functional programming language", + "elixir runs on the erlang virtual machine", + "phoenix is a web framework for elixir", + "rust is a systems programming language", + "rust provides memory safety without garbage collection", + "python is a popular programming language", + "python is great for data science", + "java is an object oriented programming language", + "java runs on the jvm virtual machine", + "ruby is a dynamic programming language" + ] + + Enum.each(docs, fn doc -> + IndexWriter.add_document(index, %{"content" => doc}) + end) + + IndexWriter.commit(index) + + {:ok, reader} = IndexReader.new(index) + {:ok, searcher} = Searcher.new(reader) + + # Find docs similar to "elixir programming" + {:ok, results} = + Searcher.search_more_like_this( + searcher, + %{"content" => "elixir is a programming language"}, + min_doc_freq: 1, + min_term_freq: 1, + limit: 5 + ) + + assert results["total_hits"] >= 1 + # Results should include docs about elixir or programming + contents = Enum.map(results["hits"], & &1["doc"]["content"]) + assert Enum.any?(contents, &String.contains?(&1, "elixir")) + end + + test "returns empty results when no similar documents", %{test_path: test_path} do + schema = + Schema.new() + |> Schema.add_text_field("content", stored: true) + + {:ok, index} = Index.create(test_path, schema) + + IndexWriter.add_document(index, %{"content" => "hello world"}) + IndexWriter.commit(index) + + {:ok, reader} = IndexReader.new(index) + {:ok, searcher} = Searcher.new(reader) + + # Use a completely unrelated document with high min_doc_freq to find nothing + {:ok, results} = + Searcher.search_more_like_this( + searcher, + %{"content" => "zzzzzzz xyzxyz uniqueterm"}, + min_doc_freq: 1, + min_term_freq: 1, + limit: 5 + ) + + assert results["total_hits"] == 0 + end + + test "empty document fields returns error", %{test_path: test_path} do + schema = + Schema.new() + |> Schema.add_text_field("content", stored: true) + + {:ok, index} = Index.create(test_path, schema) + IndexWriter.commit(index) + + {:ok, reader} = IndexReader.new(index) + {:ok, searcher} = Searcher.new(reader) + + {:error, reason} = Searcher.search_more_like_this(searcher, %{}) + assert reason =~ "empty" + end + end +end diff --git a/test/muninn/native_test.exs b/test/muninn/native_test.exs index 2523154..02d1da1 100644 --- a/test/muninn/native_test.exs +++ b/test/muninn/native_test.exs @@ -6,8 +6,8 @@ defmodule Muninn.NativeTest do describe "schema_build/1" do test "builds a schema from field list" do fields = [ - {"title", "text", true, true}, - {"body", "text", true, false} + {"title", "text", true, true, false, "default"}, + {"body", "text", true, false, false, "default"} ] schema_resource = Native.schema_build(fields) @@ -25,8 +25,8 @@ defmodule Muninn.NativeTest do describe "schema_num_fields/1" do test "returns number of fields in schema" do fields = [ - {"field1", "text", true, true}, - {"field2", "text", false, true} + {"field1", "text", true, true, false, "default"}, + {"field2", "text", false, true, false, "default"} ] schema = Native.schema_build(fields) @@ -43,7 +43,7 @@ defmodule Muninn.NativeTest do test "counts many fields correctly" do fields = for i <- 1..10 do - {"field_#{i}", "text", true, true} + {"field_#{i}", "text", true, true, false, "default"} end schema = Native.schema_build(fields) @@ -57,7 +57,7 @@ defmodule Muninn.NativeTest do on_exit(fn -> File.rm_rf!(path) end) - fields = [{"title", "text", true, true}] + fields = [{"title", "text", true, true, false, "default"}] assert {:ok, index} = Native.index_create(path, fields) assert is_reference(index) @@ -71,7 +71,7 @@ defmodule Muninn.NativeTest do on_exit(fn -> File.rm_rf!(path) end) # Create first - fields = [{"field", "text", true, true}] + fields = [{"field", "text", true, true, false, "default"}] {:ok, _} = Native.index_create(path, fields) # Then open diff --git a/test/muninn/regex_query_test.exs b/test/muninn/regex_query_test.exs new file mode 100644 index 0000000..9d20ba3 --- /dev/null +++ b/test/muninn/regex_query_test.exs @@ -0,0 +1,107 @@ +defmodule Muninn.RegexQueryTest do + use ExUnit.Case, async: true + + alias Muninn.{Index, IndexWriter, IndexReader, Searcher, Schema} + + setup do + test_path = "/tmp/muninn_regex_#{:erlang.unique_integer([:positive])}" + on_exit(fn -> Muninn.TestHelpers.safe_rm_rf(test_path) end) + {:ok, test_path: test_path} + end + + describe "search_regex/4" do + test "basic regex match", %{test_path: test_path} do + schema = Schema.new() |> Schema.add_text_field("title", stored: true) + {:ok, index} = Index.create(test_path, schema) + + IndexWriter.add_document(index, %{"title" => "Elixir Programming"}) + IndexWriter.add_document(index, %{"title" => "Phoenix Framework"}) + IndexWriter.add_document(index, %{"title" => "Erlang Basics"}) + IndexWriter.commit(index) + + {:ok, reader} = IndexReader.new(index) + {:ok, searcher} = Searcher.new(reader) + + # Match terms starting with "eli" + {:ok, results} = Searcher.search_regex(searcher, "title", "eli.*") + assert results["total_hits"] == 1 + assert List.first(results["hits"])["doc"]["title"] == "Elixir Programming" + end + + test "character class matching", %{test_path: test_path} do + schema = Schema.new() |> Schema.add_text_field("title", stored: true) + {:ok, index} = Index.create(test_path, schema) + + IndexWriter.add_document(index, %{"title" => "cat sat"}) + IndexWriter.add_document(index, %{"title" => "bat mat"}) + IndexWriter.add_document(index, %{"title" => "dog run"}) + IndexWriter.commit(index) + + {:ok, reader} = IndexReader.new(index) + {:ok, searcher} = Searcher.new(reader) + + # Match terms like [cbm]at + {:ok, results} = Searcher.search_regex(searcher, "title", "[cbm]at") + assert results["total_hits"] >= 2 + end + + test "no matches returns 0 hits", %{test_path: test_path} do + schema = Schema.new() |> Schema.add_text_field("title", stored: true) + {:ok, index} = Index.create(test_path, schema) + + IndexWriter.add_document(index, %{"title" => "Hello World"}) + IndexWriter.commit(index) + + {:ok, reader} = IndexReader.new(index) + {:ok, searcher} = Searcher.new(reader) + + {:ok, results} = Searcher.search_regex(searcher, "title", "zzz.*xyz") + assert results["total_hits"] == 0 + end + + test "invalid regex returns error", %{test_path: test_path} do + schema = Schema.new() |> Schema.add_text_field("title", stored: true) + {:ok, index} = Index.create(test_path, schema) + IndexWriter.commit(index) + + {:ok, reader} = IndexReader.new(index) + {:ok, searcher} = Searcher.new(reader) + + {:error, reason} = Searcher.search_regex(searcher, "title", "[invalid") + assert is_binary(reason) + end + + test "limit is respected", %{test_path: test_path} do + schema = Schema.new() |> Schema.add_text_field("title", stored: true) + {:ok, index} = Index.create(test_path, schema) + + for i <- 1..10 do + IndexWriter.add_document(index, %{"title" => "item#{i}"}) + end + + IndexWriter.commit(index) + + {:ok, reader} = IndexReader.new(index) + {:ok, searcher} = Searcher.new(reader) + + {:ok, results} = Searcher.search_regex(searcher, "title", "item.*", limit: 3) + assert length(results["hits"]) == 3 + end + + test "error on non-text field", %{test_path: test_path} do + schema = + Schema.new() + |> Schema.add_text_field("title", stored: true) + |> Schema.add_u64_field("count", stored: true) + + {:ok, index} = Index.create(test_path, schema) + IndexWriter.commit(index) + + {:ok, reader} = IndexReader.new(index) + {:ok, searcher} = Searcher.new(reader) + + {:error, reason} = Searcher.search_regex(searcher, "count", ".*") + assert reason =~ "not a text field" + end + end +end diff --git a/test/muninn/sort_query_test.exs b/test/muninn/sort_query_test.exs new file mode 100644 index 0000000..fd56399 --- /dev/null +++ b/test/muninn/sort_query_test.exs @@ -0,0 +1,130 @@ +defmodule Muninn.SortQueryTest do + use ExUnit.Case, async: true + + alias Muninn.{Index, IndexWriter, IndexReader, Searcher, Schema} + + setup do + test_path = "/tmp/muninn_sort_#{:erlang.unique_integer([:positive])}" + on_exit(fn -> Muninn.TestHelpers.safe_rm_rf(test_path) end) + {:ok, test_path: test_path} + end + + defp create_product_index(test_path) do + schema = + Schema.new() + |> Schema.add_text_field("title", stored: true) + |> Schema.add_u64_field("views", stored: true, fast: true) + |> Schema.add_i64_field("score", stored: true, fast: true) + |> Schema.add_f64_field("price", stored: true, fast: true) + + {:ok, index} = Index.create(test_path, schema) + + IndexWriter.add_document(index, %{ + "title" => "cheap item", + "views" => 100, + "score" => -5, + "price" => 9.99 + }) + + IndexWriter.add_document(index, %{ + "title" => "popular item", + "views" => 5000, + "score" => 42, + "price" => 29.99 + }) + + IndexWriter.add_document(index, %{ + "title" => "expensive item", + "views" => 500, + "score" => 10, + "price" => 199.99 + }) + + IndexWriter.commit(index) + + {:ok, reader} = IndexReader.new(index) + {:ok, searcher} = Searcher.new(reader) + searcher + end + + describe "search_query_sorted/5" do + test "sort by u64 ascending", %{test_path: test_path} do + searcher = create_product_index(test_path) + + {:ok, results} = + Searcher.search_query_sorted(searcher, "item", ["title"], "views", limit: 10) + + hits = results["hits"] + assert length(hits) == 3 + views = Enum.map(hits, & &1["doc"]["views"]) + assert views == [100, 500, 5000] + end + + test "sort by u64 descending", %{test_path: test_path} do + searcher = create_product_index(test_path) + + {:ok, results} = + Searcher.search_query_sorted(searcher, "item", ["title"], "views", + reverse: true, + limit: 10 + ) + + hits = results["hits"] + views = Enum.map(hits, & &1["doc"]["views"]) + assert views == [5000, 500, 100] + end + + test "sort by i64 with negative values", %{test_path: test_path} do + searcher = create_product_index(test_path) + + {:ok, results} = + Searcher.search_query_sorted(searcher, "item", ["title"], "score", limit: 10) + + hits = results["hits"] + scores = Enum.map(hits, & &1["doc"]["score"]) + assert scores == [-5, 10, 42] + end + + test "sort by f64", %{test_path: test_path} do + searcher = create_product_index(test_path) + + {:ok, results} = + Searcher.search_query_sorted(searcher, "item", ["title"], "price", + reverse: true, + limit: 10 + ) + + hits = results["hits"] + prices = Enum.map(hits, & &1["doc"]["price"]) + assert prices == [199.99, 29.99, 9.99] + end + + test "results include sort_value", %{test_path: test_path} do + searcher = create_product_index(test_path) + + {:ok, results} = + Searcher.search_query_sorted(searcher, "item", ["title"], "views", limit: 10) + + hit = List.first(results["hits"]) + assert Map.has_key?(hit, "sort_value") + end + + test "limit is respected", %{test_path: test_path} do + searcher = create_product_index(test_path) + + {:ok, results} = + Searcher.search_query_sorted(searcher, "item", ["title"], "price", limit: 2) + + assert length(results["hits"]) == 2 + end + + test "error on non-numeric sort field", %{test_path: test_path} do + searcher = create_product_index(test_path) + + {:error, reason} = + Searcher.search_query_sorted(searcher, "item", ["title"], "title", limit: 10) + + assert reason =~ "numeric" + end + end +end diff --git a/test/muninn/tokenizer_test.exs b/test/muninn/tokenizer_test.exs new file mode 100644 index 0000000..34ca6d9 --- /dev/null +++ b/test/muninn/tokenizer_test.exs @@ -0,0 +1,156 @@ +defmodule Muninn.TokenizerTest do + use ExUnit.Case, async: true + + alias Muninn.{Index, IndexWriter, IndexReader, Searcher, Schema} + + setup do + test_path = "/tmp/muninn_tokenizer_#{:erlang.unique_integer([:positive])}" + on_exit(fn -> Muninn.TestHelpers.safe_rm_rf(test_path) end) + {:ok, test_path: test_path} + end + + describe "default tokenizer" do + test "splits and lowercases as before", %{test_path: test_path} do + schema = Schema.new() |> Schema.add_text_field("title", stored: true) + {:ok, index} = Index.create(test_path, schema) + + IndexWriter.add_document(index, %{"title" => "Hello World"}) + IndexWriter.commit(index) + + {:ok, reader} = IndexReader.new(index) + {:ok, searcher} = Searcher.new(reader) + + {:ok, results} = Searcher.search_query(searcher, "hello", ["title"]) + assert results["total_hits"] == 1 + + {:ok, results} = Searcher.search_query(searcher, "world", ["title"]) + assert results["total_hits"] == 1 + end + end + + describe "en_stem tokenizer" do + test "stems English words", %{test_path: test_path} do + schema = + Schema.new() + |> Schema.add_text_field("content", stored: true, tokenizer: "en_stem") + + {:ok, index} = Index.create(test_path, schema) + + IndexWriter.add_document(index, %{"content" => "running quickly"}) + IndexWriter.add_document(index, %{"content" => "she runs every morning"}) + IndexWriter.commit(index) + + {:ok, reader} = IndexReader.new(index) + {:ok, searcher} = Searcher.new(reader) + + # "run" should match both "running" and "runs" via stemming + {:ok, results} = Searcher.search_query(searcher, "run", ["content"]) + assert results["total_hits"] == 2 + end + + test "stemming matches plural forms", %{test_path: test_path} do + schema = + Schema.new() + |> Schema.add_text_field("content", stored: true, tokenizer: "en_stem") + + {:ok, index} = Index.create(test_path, schema) + + IndexWriter.add_document(index, %{"content" => "the cats sat on the mat"}) + IndexWriter.add_document(index, %{"content" => "a cat on a mat"}) + IndexWriter.commit(index) + + {:ok, reader} = IndexReader.new(index) + {:ok, searcher} = Searcher.new(reader) + + {:ok, results} = Searcher.search_query(searcher, "cat", ["content"]) + assert results["total_hits"] == 2 + end + end + + describe "raw tokenizer" do + test "stores entire field value as single token", %{test_path: test_path} do + schema = + Schema.new() + |> Schema.add_text_field("category", stored: true, tokenizer: "raw") + + {:ok, index} = Index.create(test_path, schema) + + IndexWriter.add_document(index, %{"category" => "Hello World"}) + IndexWriter.commit(index) + + {:ok, reader} = IndexReader.new(index) + {:ok, searcher} = Searcher.new(reader) + + # Exact match of full value should work + {:ok, results} = Searcher.search_query(searcher, ~s(category:"Hello World"), ["category"]) + assert results["total_hits"] == 1 + + # Partial word should NOT match with raw tokenizer + {:ok, results} = Searcher.search_query(searcher, "hello", ["category"]) + assert results["total_hits"] == 0 + end + end + + describe "whitespace tokenizer" do + test "splits on whitespace only", %{test_path: test_path} do + schema = + Schema.new() + |> Schema.add_text_field("content", stored: true, tokenizer: "whitespace") + + {:ok, index} = Index.create(test_path, schema) + + IndexWriter.add_document(index, %{"content" => "hello-world foo_bar"}) + IndexWriter.commit(index) + + {:ok, reader} = IndexReader.new(index) + {:ok, searcher} = Searcher.new(reader) + + # "hello-world" is kept as one token (whitespace tokenizer doesn't split on hyphens) + {:ok, results} = Searcher.search_query(searcher, "hello-world", ["content"]) + assert results["total_hits"] == 1 + + # "foo_bar" is also one token + {:ok, results} = Searcher.search_query(searcher, "foo_bar", ["content"]) + assert results["total_hits"] == 1 + end + end + + describe "different tokenizers on same schema" do + test "fields can use different tokenizers", %{test_path: test_path} do + schema = + Schema.new() + |> Schema.add_text_field("title", stored: true, tokenizer: "default") + |> Schema.add_text_field("keyword", stored: true, tokenizer: "raw") + + {:ok, index} = Index.create(test_path, schema) + + IndexWriter.add_document(index, %{ + "title" => "Hello World", + "keyword" => "Hello World" + }) + + IndexWriter.commit(index) + + {:ok, reader} = IndexReader.new(index) + {:ok, searcher} = Searcher.new(reader) + + # "hello" matches title (tokenized) but not keyword (raw) + {:ok, results} = Searcher.search_query(searcher, "title:hello", ["title"]) + assert results["total_hits"] == 1 + + {:ok, results} = Searcher.search_query(searcher, "keyword:hello", ["keyword"]) + assert results["total_hits"] == 0 + end + end + + describe "invalid tokenizer" do + test "returns error for unknown tokenizer", %{test_path: test_path} do + schema = + Schema.new() + |> Schema.add_text_field("title", stored: true, tokenizer: "nonexistent") + + assert {:error, reason} = Index.create(test_path, schema) + assert reason =~ "Unknown tokenizer" + end + end +end