feat(eval): 干净 A/B(生成一次→复核关/开)+ README 为小模型而设计的鲁棒性#18
Merged
Conversation
…ME small-model design The first eval ran review() twice (once per config), so the precision delta between baseline and verify was confounded with DeepSeek's generation randomness — the report's own #1 limitation. Fix: generate findings ONCE per PR, then score two configs off that single generation — baseline = the raw findings, optimized = the SAME findings after adversarial verification. The only variable left is verification, so the precision attribution is clean. - runCleanAB() injects two clients (generation, verification) so token cost splits cleanly: baseline = generation, optimized = generation + verify overhead, with the generation shared and identical between configs. - Entry guard (import.meta.url vs argv[1]) so importing run.ts for tests does not kick off a live eval / exit the process. - New run.test.ts with fakes: asserts a single shared generation (genCalls==1), FP refuted away (precision 1/2 -> 1/1, recall unchanged), token attribution (110 / 135), and graceful per-case error handling. - Methodology strings updated to describe the clean A/B + token split. README: new section "为低成本 / 小模型而设计的鲁棒性" articulating that the pipeline is built so cheap/small models don't break — forced structured output + regen, graceful degradation, adversarial verification (over-flagging), deterministic dedup (drift), caching (cost). Reframes the DeepSeek note: small models run reliably; Opus is for the deepest reasoning, not for not crashing. Eval-only + docs; no Action-bundle code touched, so dist/ is unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
干净 A/B:生成一次 → 对同一批做 复核关/开
第一版评测对每个配置各调一次
review()生成一遍 finding,把「复核效果」和 DeepSeek 的生成随机性混在一起——这是评测报告自己点名的头号局限。本 PR 改为:每个 PR 只生成一次 finding,再对同一批 finding 分别评分——baseline = 不复核,优化 = 同一批做对抗式复核。唯一变量就是复核本身,精确率归因彻底干净。
runCleanAB()注入两个客户端(生成 / 复核),token 干净拆分:baseline = 生成,优化 = 生成 + 复核开销,且生成部分两配置同一次调用、完全可比。import.meta.urlvsargv[1]),让测试可以安全 importrun.ts而不触发真实评测 / 退出进程。run.test.ts(用 fake 客户端):断言只生成一次(genCalls==1)、误报被反驳掉(精确率 1/2 → 1/1、召回不变)、token 归因(110 / 135)、单例错误不崩。README:为低成本 / 小模型而设计的鲁棒性
新增一节,讲清楚整条流水线是按「即使跑在便宜的小模型上也不出问题」设计的——强制结构化 + 重生成(格式幻觉)、优雅降级(彻底解析失败)、对抗式复核(过度报警)、确定性去重(措辞漂移)、缓存 + 渐进式披露(成本)。并把 DeepSeek 说明从「较弱、只验证链路」重构为:小模型也跑得稳,Opus 只是给最深的推理,而非「为了不崩溃」。
仅改 eval 与文档,未触及 Action bundle 代码,
dist/不变。112 测试全绿。🤖 Generated with Claude Code