v3.13.0 — 5-Tier Scoring Pyramid + SaaS Middleware
What's New
5-Tier Scoring Pyramid
| Tier | Backend | Accuracy | Latency | Install |
|---|---|---|---|---|
| 5 | NLI (FactCG) | 75.6% BA | 14.6 ms | pip install director-ai[nli] |
| 4 | Distilled NLI (preview) | ~70% BA | 5 ms | pip install director-ai[nli-lite] |
| 3 | Embedding (bge-small) | ~65% BA | 3 ms | pip install director-ai[embed] |
| 2 | Rules engine (8 rules) | rule-based | <1 ms | pip install director-ai |
| 1 | Heuristic (lite) | ~55% BA | <1 ms | pip install director-ai |
Select via config: scorer_backend="rules", "embed", "deberta", or "lite".
SaaS Middleware
- APIKeyMiddleware: Bearer/X-API-Key auth, constant-time hmac validation, audit-safe key hashing
- RateLimitMiddleware: per-key token-bucket with configurable RPM and burst
- Cloud Run Dockerfile (
deploy/cloud-run/Dockerfile.saas): FactCG ONNX pre-baked, non-root, healthcheck
Rust Fast-Path
4 rules scorer rules wired to Rust backfire_kernel: EntityGrounding, NumericConsistency, NegationFlip, WordOverlap
Fixes
- FactCG accuracy corrected: 75.8% → 75.6% per-dataset mean BA (AggreFact leaderboard convention)
- Leaderboard position: #6 (verified). With per-dataset tuning: 77.76% (potential #1)
- FaithLens 86.4% retraction (fabricated figure removed)
- HuggingFace supply-chain hardening:
MODEL_REGISTRYwith pinned revision SHAs - Circular reasoning Python fallback punctuation bug
- 169 new CLI integration tests for 11 benchmark scripts
Stats
- 9 scorer backends: deberta, onnx, minicheck, lite, rules, embed, nli-lite, rust, backfire
- 5000+ tests (5157 collected)
- 25 files updated in version bump
Note: Distilled NLI (Tier 4,
nli-lite) is preview — code and backend are functional, model training pending.
Full changelog: https://github.com/anulum/director-ai/blob/main/CHANGELOG.md#3130--2026-04-13