Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
25 changes: 25 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,31 @@

All notable changes to this project will be documented in this file.

## [0.5.4] - 2026-04-11

### Changed
- Switched Tantivy dependency from git (commit 51f340f) to crates.io release 0.26.0
- Extended internal FieldDef tuple from 4 to 6 elements (added fast, tokenizer)

### Added
- **Bytes field type**: `Schema.add_bytes_field/3` for binary data storage and retrieval
- **Custom tokenizers**: Per-field tokenizer option for text fields (`default`, `raw`, `en_stem`, `whitespace`)
- **Fast fields**: `fast: true` option on numeric/bool/text fields for columnar storage
- **Count collector**: `Searcher.count/3` for lightweight document counting without retrieval
- **Regex queries**: `Searcher.search_regex/4` for programmatic regex pattern matching on text fields
- **MoreLikeThis queries**: `Searcher.search_more_like_this/3` for finding similar documents by term distribution
- **Sort by field value**: `Searcher.search_query_sorted/5` for sorting results by fast field instead of BM25 score
- **Aggregations**: Full aggregation framework with JSON pass-through NIF
- `Searcher.aggregate/5` for executing aggregations over search results
- `Muninn.Aggregation` builder DSL with `new/0`, `add/3`, `sub/3`
- `Muninn.Aggregation.Bucket` — terms, range, histogram, filter bucket aggregations
- `Muninn.Aggregation.Metric` — avg, sum, min, max, stats, count, cardinality, percentiles

### Tantivy 0.26.0 Highlights (since previous git pin)
- **Bugfixes**: Fixed phrase query prefixed with `*`, vint buffer overflow during index creation, integer overflow in `ExpUnrolledLinkedList` for large datasets, integer overflow in segment sorting and merge policy truncation, merging of intermediate aggregation results, deduplicate doc counts in term aggregation for multi-valued fields, lenient elastic range queries with trailing closing parentheses
- **Features**: Filter aggregation, composite aggregation, include/exclude filtering for term aggregations, regex support in query parser, TermQuery fallback for non-indexed fast fields, fast field support for Bytes values, natural-order-with-none-highest in TopDocs ordering, stemming behind feature flag
- **Performance**: High cardinality aggregation speed improvements, saturated posting list optimization, lazy scorers, union performance improvements, seek_danger for efficient intersections

## [0.5.3] - 2026-02-16

### Changed
Expand Down
245 changes: 191 additions & 54 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,15 +37,20 @@ This library embodies that spirit: it flies through your documents, indexes what

- **Fast**: Rust-powered search via native NIFs
- **Full-text search**: Text indexing with customizable tokenization
- **Multiple field types**: text, u64, i64, f64, bool
- **Flexible schemas**: Define stored and indexed fields
- **Advanced queries**: Field-specific search, boolean operators, phrase matching, range queries
- **Multiple field types**: text, u64, i64, f64, bool, bytes
- **Custom tokenizers**: Per-field tokenizer support (`default`, `raw`, `en_stem`, `whitespace`)
- **Flexible schemas**: Define stored, indexed, and fast fields
- **Advanced queries**: Field-specific search, boolean operators, phrase matching, range queries, regex
- **Range queries**: Numeric range filtering with flexible boundaries
- **Fuzzy matching**: Error-tolerant search with Levenshtein distance for handling typos
- **MoreLikeThis**: Find similar documents by term distribution
- **Aggregations**: Terms, range, histogram buckets + avg, sum, stats, cardinality metrics with nesting
- **Sort by field**: Order results by fast field value instead of relevance score
- **Count queries**: Lightweight document counting without retrieval
- **Highlighting**: HTML snippets with highlighted matching words
- **Autocomplete**: Prefix search for typeahead functionality (with fuzzy support)
- **Thread-safe**: Concurrent index operations supported
- **Production-ready**: Comprehensive error handling and 175+ tests
- **Production-ready**: Comprehensive error handling and 229+ tests

## Installation

Expand All @@ -61,7 +66,7 @@ end

**Requirements:**
- Elixir ~> 1.18
- Rust ~> 1.85 (for compilation, Tantivy 0.25 requires Edition 2024)
- Rust ~> 1.92 (for compilation, Tantivy 0.26 + Rustler 0.37.2 require Rust 1.91+)

## Quick Start

Expand All @@ -71,9 +76,11 @@ end
alias Muninn.Schema

schema = Schema.new()
|> Schema.add_text_field("title", stored: true, indexed: true)
|> Schema.add_text_field("title", stored: true, indexed: true, tokenizer: "en_stem")
|> Schema.add_text_field("body", stored: true, indexed: true)
|> Schema.add_u64_field("views", stored: true, indexed: true)
|> Schema.add_text_field("category", stored: true, tokenizer: "raw", fast: true)
|> Schema.add_u64_field("views", stored: true, indexed: true, fast: true)
|> Schema.add_f64_field("price", stored: true, fast: true)
|> Schema.add_bool_field("published", stored: true, indexed: true)
```

Expand Down Expand Up @@ -295,6 +302,135 @@ Handle spelling errors and typos automatically using Levenshtein distance:
- **Distance=2**: ~5-50x slower than exact search (use for suggestions only)
- Transposition cost enabled by default (more intuitive for users)

### Regex Search

Search with regular expressions on text fields:

```elixir
# Programmatic regex query
{:ok, results} = Searcher.search_regex(searcher, "title", "elix.*", limit: 10)

# Also supported via query parser syntax
{:ok, results} = Searcher.search_query(searcher, "/elix.*/", ["title"])
```

### MoreLikeThis (Find Similar Documents)

Find documents similar to a reference document by analyzing term distributions:

```elixir
{:ok, results} = Searcher.search_more_like_this(
searcher,
%{"title" => "Elixir programming", "body" => "Functional programming with Elixir"},
min_doc_freq: 1,
min_term_freq: 1,
max_query_terms: 25,
limit: 5
)
```

### Count Queries

Efficiently count matching documents without retrieving them:

```elixir
{:ok, count} = Searcher.count(searcher, "elixir AND phoenix", ["title", "body"])
# Returns: {:ok, 42}
```

### Sort by Field Value

Sort results by a fast field instead of relevance score:

```elixir
# Sort by price ascending
{:ok, results} = Searcher.search_query_sorted(
searcher,
"category:electronics",
["title"],
"price"
)

# Sort by views descending
{:ok, results} = Searcher.search_query_sorted(
searcher,
"*",
["title"],
"views",
reverse: true,
limit: 20
)

# Results include sort_value instead of score:
# %{"sort_value" => 5000, "doc" => %{"title" => "Popular Item", ...}}
```

> **Note:** Sort fields must be numeric (u64, i64, f64) with `fast: true` in the schema.

### Aggregations

Compute analytics over search results using the aggregation framework:

```elixir
alias Muninn.Aggregation
alias Muninn.Aggregation.{Bucket, Metric}

# Simple metric aggregation
aggs = Aggregation.new()
|> Aggregation.add("avg_price", Metric.avg("price"))
|> Aggregation.add("price_stats", Metric.stats("price"))

{:ok, results} = Searcher.aggregate(searcher, "*", ["title"], aggs)
# results["avg_price"]["value"] => 381.66
# results["price_stats"] => %{"count" => 6, "min" => 15.0, "max" => 999.0, ...}

# Terms aggregation (group by category)
aggs = Aggregation.new()
|> Aggregation.add("by_category", Bucket.terms("category", size: 10))

{:ok, results} = Searcher.aggregate(searcher, "*", ["title"], aggs)
# results["by_category"]["buckets"] => [
# %{"key" => "electronics", "doc_count" => 3},
# %{"key" => "clothing", "doc_count" => 2},
# ...
# ]

# Nested aggregation (stats per category)
aggs = Aggregation.new()
|> Aggregation.add("by_category",
Bucket.terms("category", size: 10)
|> Aggregation.sub("price_stats", Metric.stats("price"))
)

# Range buckets
aggs = Aggregation.new()
|> Aggregation.add("price_ranges",
Bucket.range("price", [
%{"to" => 50.0},
%{"from" => 50.0, "to" => 500.0},
%{"from" => 500.0}
])
)

# Histogram
aggs = Aggregation.new()
|> Aggregation.add("price_hist", Bucket.histogram("price", 100.0))

# Scoped to a query (only aggregate matching docs)
{:ok, results} = Searcher.aggregate(
searcher,
"category:electronics",
["title", "category"],
aggs
)
```

> **Note:** Aggregated fields must have `fast: true` in the schema. For text field aggregation (e.g., terms), use `tokenizer: "raw"` with `fast: true`.

**Available Bucket Aggregations:** `Bucket.terms/2`, `Bucket.range/2`, `Bucket.histogram/3`, `Bucket.filter/1`

**Available Metric Aggregations:** `Metric.avg/1`, `Metric.sum/1`, `Metric.min/1`, `Metric.max/1`, `Metric.stats/1`, `Metric.count/1`, `Metric.cardinality/2`, `Metric.percentiles/2`

## Field Types

| Type | Description | Example Use Case |
Expand All @@ -304,12 +440,15 @@ Handle spelling errors and typos automatically using Levenshtein distance:
| `i64` | Signed 64-bit integers | Scores, offsets, differences |
| `f64` | 64-bit floating point | Prices, ratings, coordinates |
| `bool` | Boolean values | Flags, states (published, active) |
| `bytes` | Arbitrary binary data | Embeddings, serialized data, hashes |

**Field Options:**
- `stored: true/false` - Store the original value (retrievable in search results)
- `indexed: true/false` - Index the field for searching/filtering
- `fast: true/false` - Enable columnar storage (required for sorting and aggregations)
- `tokenizer: "name"` - Tokenizer for text fields: `"default"`, `"raw"`, `"en_stem"`, `"whitespace"`

**Defaults:** `stored: false`, `indexed: true`
**Defaults:** `stored: false`, `indexed: true`, `fast: false`, `tokenizer: nil` (uses `"default"`)

## Examples

Expand All @@ -320,6 +459,7 @@ See the `examples/` directory for complete working examples:
- `highlighting_demo.exs` - Highlighted snippets and prefix search
- `range_functions_demo.exs` - Range queries (QueryParser vs dedicated functions)
- `fuzzy_search_demo.exs` - Fuzzy matching for typo tolerance
- `aggregation_demo.exs` - Aggregations, sorting, and analytics
- `complete_search_demo.exs` - Full feature showcase
- `comparison_demo.exs` - Side-by-side comparison of search methods

Expand All @@ -332,49 +472,35 @@ mix run examples/complete_search_demo.exs

### Core Modules

- `Muninn.Schema` - Define index schema with field types
- `Muninn.Schema` - Define index schema with field types and options
- `Muninn.Index` - Create and open indices
- `Muninn.IndexWriter` - Add, update documents, commit/rollback
- `Muninn.IndexReader` - Read access to index
- `Muninn.Searcher` - Execute search queries
- `Muninn.Searcher` - Execute search queries, sorting, counting, and aggregations
- `Muninn.Query` - Build search queries
- `Muninn.Aggregation` - Builder DSL for aggregation requests
- `Muninn.Aggregation.Bucket` - Bucket aggregation builders (terms, range, histogram, filter)
- `Muninn.Aggregation.Metric` - Metric aggregation builders (avg, sum, min, max, stats, etc.)

### Search Methods

**Basic Term Search** - Simple, direct term matching:
```elixir
query = Query.term("field", "value")
Searcher.search(searcher, query, limit: 10)
```

**Query Parser** - Natural syntax with boolean operators:
```elixir
Searcher.search_query(searcher, "field:value AND other", ["field", "other"])
```

**With Snippets** - Highlighted search results:
```elixir
Searcher.search_with_snippets(searcher, query, search_fields, snippet_fields, opts)
```

**Prefix Search** - Autocomplete functionality:
```elixir
Searcher.search_prefix(searcher, "field", "prefix", limit: 10)
```

**Range Queries** - Numeric filtering with flexible boundaries:
```elixir
Searcher.search_range_u64(searcher, "views", 100, 1000, inclusive: :both)
Searcher.search_range_i64(searcher, "temperature", -10, 30)
Searcher.search_range_f64(searcher, "price", 10.0, 100.0)
```

**Fuzzy Search** - Error-tolerant matching with Levenshtein distance:
```elixir
Searcher.search_fuzzy(searcher, "title", "elixr", distance: 1)
Searcher.search_fuzzy_prefix(searcher, "author", "jse", distance: 1)
Searcher.search_fuzzy_with_snippets(searcher, "content", "elixr", ["content"])
```
| Method | Description |
|--------|-------------|
| `Searcher.search/3` | Term query — direct term matching |
| `Searcher.search_query/4` | Query parser — boolean operators, phrase queries, field-specific |
| `Searcher.search_with_snippets/5` | Query parser + highlighted HTML snippets |
| `Searcher.search_prefix/4` | Prefix matching for autocomplete |
| `Searcher.search_range_u64/5` | Numeric range query (u64) |
| `Searcher.search_range_i64/5` | Numeric range query (i64) |
| `Searcher.search_range_f64/5` | Numeric range query (f64) |
| `Searcher.search_fuzzy/4` | Fuzzy matching with Levenshtein distance |
| `Searcher.search_fuzzy_prefix/4` | Fuzzy prefix matching for autocomplete with typo tolerance |
| `Searcher.search_fuzzy_with_snippets/5` | Fuzzy matching + highlighted snippets |
| `Searcher.search_regex/4` | Regex pattern matching on text fields |
| `Searcher.search_more_like_this/3` | Find similar documents by term distribution |
| `Searcher.search_query_sorted/5` | Query with results sorted by fast field value |
| `Searcher.count/3` | Count matching documents without retrieval |
| `Searcher.aggregate/5` | Execute aggregations over matching documents |

## Architecture

Expand Down Expand Up @@ -416,12 +542,15 @@ mix test --cover
mix test test/muninn/searcher_test.exs
```

**Test Coverage:** 175+ tests covering:
- Schema and index operations
**Test Coverage:** 229+ tests covering:
- Schema and index operations (including bytes field, custom tokenizers, fast fields)
- Document CRUD operations
- All query types (term, boolean, phrase, prefix, range, fuzzy)
- All query types (term, boolean, phrase, prefix, range, fuzzy, regex, MoreLikeThis)
- Fuzzy search with distance levels (0-2), transposition handling
- Range queries with different numeric types and boundary options
- Sort by field value (ascending/descending)
- Count queries
- Aggregations (terms, range, histogram, stats, nested)
- Snippet generation and highlighting
- Concurrent operations
- Edge cases and error handling
Expand All @@ -439,27 +568,35 @@ View at `doc/index.html`

## Development Status

**Current:** Phase 7 Complete - Fuzzy Matching and Typo Tolerance
**Current:** Phase 8 Complete - Tantivy 0.26.0 Features

**Implemented:**
- Schema definition and validation
- Index creation and management
- Document indexing with batch operations
- Basic term search
- Advanced query parser (field:value, AND/OR, phrases, ranges)
- Advanced query parser (field:value, AND/OR, phrases, ranges, regex)
- Range queries for all numeric types (u64, i64, f64)
- Fuzzy search with Levenshtein distance (3 functions: fuzzy, fuzzy_prefix, fuzzy_with_snippets)
- Fuzzy search with Levenshtein distance (fuzzy, fuzzy_prefix, fuzzy_with_snippets)
- Highlighted snippets for search results
- Prefix search for autocomplete
- Regex search on text fields
- MoreLikeThis (find similar documents)
- Count queries (lightweight document counting)
- Sort by fast field value (ascending/descending)
- Aggregations (terms, range, histogram, filter buckets + all metric types)
- Custom tokenizers (default, raw, en_stem, whitespace)
- Bytes field type for binary data
- Fast fields for columnar storage
- Transaction support (commit/rollback)
- Upgraded to Tantivy 0.25
- Upgraded to Tantivy 0.26.0 (crates.io)

**Roadmap:**
- QueryParser integration for fuzzy syntax (`term~N`)
- Advanced suggestions system ("did you mean?")
- Faceted search and aggregations
- Custom analyzers and tokenizers
- Sorting and custom scoring
- Document deletion and updates
- Date field type
- Custom scoring and boosting

## License

Expand Down
Loading