Hi Wes — thanks for msgvault, I've enjoyed digging through the code.
I've been teaching myself RAG by building a little email-RAG side project (https://github.com/fmasi/mailrag), and while messing with retrieval I started benchmarking against msgvault. A few things turned up that feel more useful to a real project than to my hobby repo, so I figured I'd share them.
I wanted it reproducible by someone other than me, so I ran msgvault's fts/vector/hybrid modes against the public TREC 2010 Legal Track (Enron) qrels — real human relevance judgments, so anyone can re-run it. Two observations, and I might just be holding it wrong:
- There's no way to measure retrieval quality today, which felt like the main gap. I hacked together a tiny
msgvault eval locally (recall@k, nDCG, MAP over a qrels file) just to see, and it works. Happy to share it.
- A reranking stage helps a lot on the real qrels: P@10 went from about 0.23 to 0.40 on the verbose queries.
(I also hit one clean bug while doing this, filed separately as #366: hybrid mode errors on FTS5 special characters in the query.)
Why I bothered: on my own ~20k-email work mailbox, the same stack (hybrid, reranking, thread expansion, contextual summaries) took coverage@3 from 45% to 84%, and recall@1 from 36% to 70%. Big caveat: those 360 eval queries are LLM-generated and LLM-judged, which is honestly the whole reason I wanted to re-test on a public set with real human judgments. The private numbers are what I live with day to day; the public run is there so you don't have to take my synthetic set on faith. Method's written up in the repo.
Anyway — would an in-tree retrieval eval (and maybe an optional reranker) be useful to you, or is it out of scope? If you're open to it I'm glad to do the work; I'd just want a steer on the shape you'd prefer (an internal/eval package plus a dev command, a separate tool, whatever fits). No worries if it's not a fit.
Thanks for reading.
Hi Wes — thanks for msgvault, I've enjoyed digging through the code.
I've been teaching myself RAG by building a little email-RAG side project (https://github.com/fmasi/mailrag), and while messing with retrieval I started benchmarking against msgvault. A few things turned up that feel more useful to a real project than to my hobby repo, so I figured I'd share them.
I wanted it reproducible by someone other than me, so I ran msgvault's fts/vector/hybrid modes against the public TREC 2010 Legal Track (Enron) qrels — real human relevance judgments, so anyone can re-run it. Two observations, and I might just be holding it wrong:
msgvault evallocally (recall@k, nDCG, MAP over a qrels file) just to see, and it works. Happy to share it.(I also hit one clean bug while doing this, filed separately as #366: hybrid mode errors on FTS5 special characters in the query.)
Why I bothered: on my own ~20k-email work mailbox, the same stack (hybrid, reranking, thread expansion, contextual summaries) took coverage@3 from 45% to 84%, and recall@1 from 36% to 70%. Big caveat: those 360 eval queries are LLM-generated and LLM-judged, which is honestly the whole reason I wanted to re-test on a public set with real human judgments. The private numbers are what I live with day to day; the public run is there so you don't have to take my synthetic set on faith. Method's written up in the repo.
Anyway — would an in-tree retrieval eval (and maybe an optional reranker) be useful to you, or is it out of scope? If you're open to it I'm glad to do the work; I'd just want a steer on the shape you'd prefer (an
internal/evalpackage plus a dev command, a separate tool, whatever fits). No worries if it's not a fit.Thanks for reading.