Skip to content

Expose Tantivy 0.26.0 features: aggregations, tokenizers, regex, sort, bytes, count, MLT#23

Merged
nyo16 merged 12 commits into
masterfrom
feature/tantivy-0.26-features
Apr 11, 2026
Merged

Expose Tantivy 0.26.0 features: aggregations, tokenizers, regex, sort, bytes, count, MLT#23
nyo16 merged 12 commits into
masterfrom
feature/tantivy-0.26-features

Conversation

@nyo16
Copy link
Copy Markdown
Owner

@nyo16 nyo16 commented Apr 11, 2026

Summary

  • Switch Tantivy from git pin (commit 51f340f) to crates.io 0.26.0 release
  • Extend FieldDef tuple from 4 to 6 elements to support fast and tokenizer options
  • Add 7 new features across schema, query, and analytics layers
  • 229 tests (54 new), 0 failures, clippy clean

New Features

Schema

Feature API Description
Bytes field Schema.add_bytes_field/3 Binary data storage/retrieval via rustler::Binary
Custom tokenizers add_text_field("f", tokenizer: "en_stem") Per-field tokenizer: default, raw, en_stem, whitespace
Fast fields add_u64_field("f", fast: true) Columnar storage for sort & aggregations

Queries

Feature API Description
Count Searcher.count/3 Lightweight doc count without retrieval
Regex Searcher.search_regex/4 Programmatic RegexQuery on text fields
MoreLikeThis Searcher.search_more_like_this/3 Find similar documents by term distribution
Sort by field Searcher.search_query_sorted/5 Sort by fast field value instead of BM25 score

Aggregations

Feature API Description
JSON pass-through NIF Searcher.aggregate/5 Single DirtyCpu-scheduled NIF for all aggregation types
Builder DSL Aggregation.Bucket / Aggregation.Metric Terms, range, histogram, filter + avg, sum, min, max, stats, cardinality, percentiles

Example Usage

# Custom tokenizer + fast field
schema = Schema.new()
  |> Schema.add_text_field("title", stored: true, tokenizer: "en_stem")
  |> Schema.add_text_field("category", stored: true, tokenizer: "raw", fast: true)
  |> Schema.add_f64_field("price", stored: true, fast: true)

# Count without fetching docs
{:ok, count} = Searcher.count(searcher, "elixir", ["title"])

# Sort by price descending
{:ok, results} = Searcher.search_query_sorted(searcher, "*", ["title"], "price", reverse: true)

# Aggregation: avg price per category
alias Muninn.Aggregation
alias Muninn.Aggregation.{Bucket, Metric}

aggs = Aggregation.new()
  |> Aggregation.add("by_category",
    Bucket.terms("category", size: 10)
    |> Aggregation.sub("avg_price", Metric.avg("price"))
  )

{:ok, results} = Searcher.aggregate(searcher, "*", ["title"], aggs)

Test Plan

- All 175 existing tests pass (backward compatible)
- 54 new tests covering bytes, tokenizers, count, regex, MLT, sort, aggregation builder, aggregation integration
- cargo clippy clean
- mix format + cargo fmt clean

nyo16 added 12 commits December 31, 2025 12:57
- Update aarch64-apple-darwin to use macos-14 (Apple Silicon)
- Update x86_64-apple-darwin to use macos-13-large (Intel)
- macOS-13 standard runners have been retired by GitHub Actions
…regex, sort, bytes, count, MLT

Switch Tantivy dependency from git pin to crates.io 0.26.0 release and expose
7 new features through the Elixir API:

Schema:
- Bytes field type with round-trip binary storage via rustler::Binary
- Custom tokenizers per text field (default, raw, en_stem, whitespace)
- Fast field option for numeric/bool/text fields (required for sort + aggregations)

Query features:
- Count collector (Searcher.count/3) - lightweight doc counting without retrieval
- Regex queries (Searcher.search_regex/4) - programmatic RegexQuery API
- MoreLikeThis (Searcher.search_more_like_this/3) - find similar documents
- Sort by field value (Searcher.search_query_sorted/5) - TopDocs::order_by_fast_field

Aggregations:
- Single JSON pass-through NIF (DirtyCpu scheduled) for all aggregation types
- Elixir builder DSL: Aggregation, Aggregation.Bucket, Aggregation.Metric
- Supports terms, range, histogram, filter buckets + avg/sum/min/max/stats/
  cardinality/percentiles metrics with nesting

FieldDef tuple extended from 4 to 6 elements (name, type, stored, indexed, fast, tokenizer).
229 tests (54 new), 0 failures. Clippy clean.
- CHANGELOG: document all 7 new features added in 0.5.4
- README: update features list, field types table, add sections for regex,
  MLT, count, sort, aggregations; update API reference and dev status
- Add examples/aggregation_demo.exs showcasing all new features
@nyo16 nyo16 merged commit 909d167 into master Apr 11, 2026
3 checks passed
@nyo16 nyo16 deleted the feature/tantivy-0.26-features branch April 11, 2026 21:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant