Idea: a small retrieval-quality eval for msgvault (+ a finding) — worth pursuing?

Hi Wes — thanks for msgvault, I've enjoyed digging through the code.

I've been teaching myself RAG by building a little email-RAG side project (https://github.com/fmasi/mailrag), and while messing with retrieval I started benchmarking against msgvault. A few things turned up that feel more useful to a real project than to my hobby repo, so I figured I'd share them.

I wanted it reproducible by someone other than me, so I ran msgvault's fts/vector/hybrid modes against the public TREC 2010 Legal Track (Enron) qrels — real human relevance judgments, so anyone can re-run it. Two observations, and I might just be holding it wrong:

- There's no way to measure retrieval quality today, which felt like the main gap. I hacked together a tiny `msgvault eval` locally (recall@k, nDCG, MAP over a qrels file) just to see, and it works. Happy to share it.
- A reranking stage helps a lot on the real qrels: P@10 went from about 0.23 to 0.40 on the verbose queries.

(I also hit one clean bug while doing this, filed separately as #366: hybrid mode errors on FTS5 special characters in the query.)

Why I bothered: on my own ~20k-email work mailbox, the same stack (hybrid, reranking, thread expansion, contextual summaries) took coverage@3 from 45% to 84%, and recall@1 from 36% to 70%. Big caveat: those 360 eval queries are LLM-generated and LLM-judged, which is honestly the whole reason I wanted to re-test on a public set with real human judgments. The private numbers are what I live with day to day; the public run is there so you don't have to take my synthetic set on faith. Method's written up in the repo.

Anyway — would an in-tree retrieval eval (and maybe an optional reranker) be useful to you, or is it out of scope? If you're open to it I'm glad to do the work; I'd just want a steer on the shape you'd prefer (an `internal/eval` package plus a dev command, a separate tool, whatever fits). No worries if it's not a fit.

Thanks for reading.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Idea: a small retrieval-quality eval for msgvault (+ a finding) — worth pursuing? #367

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Idea: a small retrieval-quality eval for msgvault (+ a finding) — worth pursuing? #367

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions