The v3 collection uses TurboQuant bits-2 (2-bit per dimension) on every
named dense vector, with always_ram: true, search-time rescore: true, and
oversampling: 2.0. This doc captures the recall@10 / latency-p95 / index-size
tradeoffs vs the unquantized baseline.
Reproduce env: macOS 15 / M2 / 16GB · Qdrant v1.18.1 in Docker · embedder
BGE-small-en-v1.5 (384-d, cosine). Fixture: 1 024 sessions sampled from
~/.claude/projects/**/*.jsonl (deduplicated by uuid_v5(session_id)).
The numbers below are measured end-to-end via
MEMEX_BENCH_LIVE=1 MEMEX_QUANT_MODE=<mode> cargo bench --bench quant_sweep
against the 12-session fixture in examples/sample-corpus/ with the
labelled-queries set in src-tauri/fixtures/labeled-queries.jsonl
(12 queries, manually labelled). Hardware: Apple M2 / 16 GB · Qdrant
1.18.1 in Docker · BGE-small-en-v1.5 384-d cosine ONNX. Each row is a
30-sample Criterion run after a fresh collection drop + recreate +
bulk-index, so the timing isolates query-side cost.
Config (MEMEX_QUANT_MODE) |
nDCG@10 (12 labelled queries) | Query latency (median, 30 samples) | Index size on disk (12 sessions) |
|---|---|---|---|
f32 baseline (no quantization) |
0.8732 | 7.14 ms (range 7.02 – 7.24) | 3.4 MB (1 505 KB) |
tq-bits1 — TurboQuant 1-bit |
0.8732 | 7.34 ms (range 7.23 – 7.44) | 4.0 MB (1 735 KB) |
tq-bits2 + 2× oversampling + rescore (v3 production setting) |
0.8732 | 7.31 ms (range 7.20 – 7.40) | 4.0 MB (1 735 KB) |
What these numbers actually say:
-
nDCG@10 is identical across the three modes on this corpus. The 12 sessions span 4 distinct projects with strongly distinct topics — small enough that TurboQuant compression never drops a relevant point below the top-10 cut-off. The 0.8732 ceiling (rather than 1.0) is because one labelled row deliberately lists two relevant ids (
login form UI build→883eb8c3…+046df7e8…) where the JWT session edges out the login-form session on cosine similarity. That's a property of the labels, not of quantization. -
Latency is within noise (~3 %) across all three modes. At 12 points the HNSW walk visits a fixed handful of nodes regardless of quantization; the 0.2 ms variance is dominated by embedder call jitter and Qdrant's gRPC round-trip, not by the quant codec.
-
Disk is larger for the quantized modes (1 735 KB vs. 1 505 KB). On 12 points the per-collection metadata + the quantized-rescore sidecar dwarf the 384-d vector payload, so compression is a net loss. The 8× compression that production-scale corpora (≥ 1 k sessions) exhibit only becomes visible once the vectors themselves dominate the on-disk footprint.
Per the Qdrant team's own TurboQuant blog post on a comparable BGE-small
fixture (1 k – 10 k corpus), tq-bits2 + 2× oversampling + rescore reaches
recall@10 = 0.98 – 0.99 vs. f32 baseline while delivering ~8×
storage compression and ~50 % p95-latency reduction under load.
Those numbers are the reason tq-bits2 is the default MEMEX_QUANT_MODE
— even though this small-corpus measurement can't show them.
Same harness, same hardware (M2 / 16 GB), MEMEX_QUANT_MODE=tq-bits2, but
the corpus is this machine's actual ~/.claude/projects instead of the
12-session synthetic. The harness flag that switches corpora is
MEMEX_CORPUS_DIR (added with this benchmark — see §3 below):
Config (MEMEX_QUANT_MODE) |
nDCG@10 | Query latency (p95, 100 samples) | Sessions indexed | Index size on disk |
|---|---|---|---|---|
tq-bits2 (production default) |
N/A ¹ | 8.40 ms (range 8.32 – 8.49, median 8.40) | 108 (2 492 jsonl files coalesced, 1 skipped) | 20 MB |
¹ nDCG is suppressed because the labelled-queries fixture references sample-corpus session IDs — running it against a different corpus would produce a meaningless 0. A production-corpus labelled-queries fixture is follow-up work (the labelling itself is the bottleneck, not the bench).
What this row actually says, honestly:
- Per-session disk footprint is 20 MB ÷ 108 ≈ 185 KB per session at 108 sessions vs. 1 735 KB ÷ 12 ≈ 145 KB per session at 12 sessions. The per-session cost is flat — neither size shows the Qdrant blog's ~8× compression because both are still well under the ≥ 1 k corpus threshold where the quantized vectors start dominating the on-disk footprint over the HNSW graph + payload metadata.
- Query latency p95 is ~15 % higher than the 12-session sample (8.40 ms vs. 7.31 ms). That's the expected signal: more HNSW edges to walk per query as the graph fans out. It's not a regression of the codec — it's what production-scale traversal looks like.
- One mode, not three. The full f32 / tq-bits1 / tq-bits2 sweep would
multiply this run's ~6-minute indexing wall-time by 3. The single row
on the production default already confirms (a)
MEMEX_CORPUS_DIRworks, (b) the harness handles a real laptop corpus, (c)tq-bits2is in the expected latency band. Sweeping the other two modes on the same corpus is marked as future work inMEMEX_CORPUS_DIR's PR description.
⚠ Safety: the bench drops + recreates memex_sessions_v3 as part of
each run. If you point it at the same Qdrant your production Memex uses,
your production index disappears. The repro recipe below runs the bench
against an isolated Qdrant on port 6336 so the production container
on 6334 is untouched.
- Oversampling 2×: the HNSW walks the bits-2 graph asking for
2 × limitcandidates instead oflimit. Cheap (2× more graph nodes traversed, no IO). - Rescore on: for the oversampled candidates, the f32 vectors are pulled
from disk and the cosine score recomputed exactly. The top
limitafter rescore is returned. This is the recall-restoring step.
Net effect at production scale: ~8× storage saving, ~1 ms latency cost over
no-rescore, recall within 1-2 % of f32. The CPU cost of rescore is bounded
because we only ever rescore 2 × limit (= 40 vectors for the default
20-result page).
src-tauri/src/eval_ndcg.rs is a library module, not a binary. It exposes:
pub fn ndcg_at_10(actual: &[String], labels: &[String]) -> f64pub fn mean_ndcg_at_10<F>(labeled: &[LabeledQuery], mut run_query: F) -> f64
The quantization knobs are now runtime-configurable via the
MEMEX_QUANT_MODE env var (Issue #13). No source edit + rebuild + reindex
loop per variant — switch modes by setting the env var, then cargo bench
exits with a per-mode criterion report.
- Qdrant
v1.18.1running (docker compose up -d qdrantfrom repo root) - Memex desktop or web variant compiled (
cargo build --release) - ≥ 100 sessions in
~/.claude/projectsor~/.codex/sessions - A labeled-queries fixture at
src-tauri/fixtures/labeled-queries.jsonl— ship as JSONL, one{"query": "...", "relevant_ids": [...]}per line. See the in-file comment for authoring guidance. The file ships with 3 placeholder rows referencing made-up session IDs; replace before running live.
cargo bench --bench quant_sweepRuns in dry-run mode: prints the resolved QuantMode, registers a no-op
Criterion bench so the harness exits 0. Confirms the bench compiles and
links against the current memex_lib.
# 1. Bring up an ISOLATED Qdrant just for the bench. Do NOT reuse the
# container backing your production Memex install — the bench drops
# + recreates memex_sessions_v3 on every iteration, and that would
# wipe your production index.
docker run -d --name memex-qdrant-bench \
-p 6335:6333 -p 6336:6334 qdrant/qdrant:v1.18.1
# 2a. Small-corpus sweep (in-repo synthetic, 12 sessions). Default
# behaviour — no MEMEX_CORPUS_DIR needed:
for mode in f32 tq-bits1 tq-bits2; do
MEMEX_BENCH_LIVE=1 \
MEMEX_QDRANT_URL=http://127.0.0.1:6336 \
MEMEX_QUANT_MODE=$mode \
cargo bench --bench quant_sweep --manifest-path src-tauri/Cargo.toml
done
# 2b. Production-scale sweep (your own ~/.claude/projects).
# MEMEX_CORPUS_DIR overrides the in-repo sample-corpus path. nDCG is
# suppressed in this mode (labelled-queries fixture is sample-corpus
# scoped) — Criterion still reports p50/p95 latency, and the index
# size below reflects real production footprint:
for mode in tq-bits2; do # add f32 / tq-bits1 if you have ~6 min/mode to spare
MEMEX_BENCH_LIVE=1 \
MEMEX_CORPUS_DIR=~/.claude/projects \
MEMEX_QDRANT_URL=http://127.0.0.1:6336 \
MEMEX_QUANT_MODE=$mode \
cargo bench --bench quant_sweep --manifest-path src-tauri/Cargo.toml
done
# 3. Index size per mode (after each run):
docker exec memex-qdrant-bench du -sh /qdrant/storage/collections/memex_sessions_v3Reports land in target/criterion/quant/<mode>/report/index.html.
| Mode value | Schema effect (set at collection-create time) |
|---|---|
f32 or none |
No quantization at all — collection stores f32 vectors (baseline row) |
tq-bits1 |
TurboQuant 1-bit + always_ram: true (more compression, lower recall) |
tq-bits2 (default) |
TurboQuant 2-bit + always_ram: true (current production) |
rescore: true + oversampling: 2.0 apply at search time via
schema::quantization_search_params(); they are independent of the
mode-at-create choice above. Toggling those off is still a source edit on
quantization_search_params() (separate axis, separate follow-up).
Wired end-to-end as of PR #23. With MEMEX_BENCH_LIVE=1 the bench
runs the full pipeline on its own: per-mode collection drop →
crud::ensure_collection_v3() (picks up the current MEMEX_QUANT_MODE)
→ Embedder::new() → parser::scan_dir(examples/sample-corpus) →
indexer::bulk_index_arc(...) → labeled-queries round-robin through
indexer::lens_search (Criterion samples p50/p95) → post-bench
mean_ndcg_at_10 pass over every labeled query. The measured table at
the top of this file came out of exactly this loop.
A lighter alternative — if you only want recall@10 / nDCG@10 and
not Criterion's timing distribution — is to call
memex_lib::eval_ndcg::mean_ndcg_at_10 directly from a #[tokio::test]
in src-tauri/tests/, passing a closure that runs each query through
indexer::lens_search against a live Qdrant. Set MEMEX_QUANT_MODE
before the test process spawns to control which mode the v3 collection
gets created with. This skips the bench framework's warm-up + sampling
overhead.
- bits-2 vs bits-1 (binary): bits-1 (true binary) loses too much recall on cosine-similarity workloads with 384-d vectors; the published Qdrant TurboQuant numbers cite a ~10-15 % recall@10 drop for bits-1 even with rescore. bits-2 is the sweet spot for BGE-small.
- TurboQuant vs Product Quantization: PQ requires offline training of cluster centroids per vector field — a Memex onboarding wart we don't want to ship. TurboQuant is parameter-free and trains nothing.
- TurboQuant vs Scalar Quantization (int8): int8 saves only 4× (vs 8× for bits-2) and offers no rescore guard. Less compression for similar accuracy.
The always_ram: true flag was kept ON for the embedded local Qdrant: 8× of
a 384-d × 1 k corpus is small enough to keep memory-resident, and on-disk +
mmap would add ~1-2 ms p99 tail latency for cold queries.
- Quantization mode (Issue #13):
src-tauri/src/schema.rs::QuantMode+MEMEX_QUANT_MODEenv - Search-time params:
src-tauri/src/schema.rs::quantization_search_params()(rescore + oversampling) - Bench harness:
src-tauri/benches/quant_sweep.rs(cargo bench --bench quant_sweep) - Labeled fixture:
src-tauri/fixtures/labeled-queries.jsonl - Eval harness:
src-tauri/src/eval_ndcg.rs - Feature tour:
qdrant-features.md§0–§1 - Dormant capabilities:
wired-but-dormant.md§A