feat(eval): 干净 A/B（生成一次→复核关/开）+ README 为小模型而设计的鲁棒性 by NeverENG · Pull Request #18 · NeverENG/BanGD-AI-PR-Review-Tool

NeverENG · 2026-05-31T14:16:30Z

干净 A/B：生成一次 → 对同一批做复核关/开

第一版评测对每个配置各调一次 review() 生成一遍 finding，把「复核效果」和 DeepSeek 的生成随机性混在一起——这是评测报告自己点名的头号局限。

本 PR 改为：每个 PR 只生成一次 finding，再对同一批 finding 分别评分——baseline = 不复核，优化 = 同一批做对抗式复核。唯一变量就是复核本身，精确率归因彻底干净。

runCleanAB() 注入两个客户端（生成 / 复核），token 干净拆分：baseline = 生成，优化 = 生成 + 复核开销，且生成部分两配置同一次调用、完全可比。
入口守卫（import.meta.url vs argv[1]），让测试可以安全 import run.ts 而不触发真实评测 / 退出进程。
新增 run.test.ts（用 fake 客户端）：断言只生成一次（genCalls==1）、误报被反驳掉（精确率 1/2 → 1/1、召回不变）、token 归因（110 / 135）、单例错误不崩。
方法学文案更新为「生成一次 → 复核关/开」+ token 拆分说明。

README：为低成本 / 小模型而设计的鲁棒性

新增一节，讲清楚整条流水线是按「即使跑在便宜的小模型上也不出问题」设计的——强制结构化 + 重生成（格式幻觉）、优雅降级（彻底解析失败）、对抗式复核（过度报警）、确定性去重（措辞漂移）、缓存 + 渐进式披露（成本）。并把 DeepSeek 说明从「较弱、只验证链路」重构为：小模型也跑得稳，Opus 只是给最深的推理，而非「为了不崩溃」。

仅改 eval 与文档，未触及 Action bundle 代码，dist/ 不变。112 测试全绿。

🤖 Generated with Claude Code

…ME small-model design The first eval ran review() twice (once per config), so the precision delta between baseline and verify was confounded with DeepSeek's generation randomness — the report's own #1 limitation. Fix: generate findings ONCE per PR, then score two configs off that single generation — baseline = the raw findings, optimized = the SAME findings after adversarial verification. The only variable left is verification, so the precision attribution is clean. - runCleanAB() injects two clients (generation, verification) so token cost splits cleanly: baseline = generation, optimized = generation + verify overhead, with the generation shared and identical between configs. - Entry guard (import.meta.url vs argv[1]) so importing run.ts for tests does not kick off a live eval / exit the process. - New run.test.ts with fakes: asserts a single shared generation (genCalls==1), FP refuted away (precision 1/2 -> 1/1, recall unchanged), token attribution (110 / 135), and graceful per-case error handling. - Methodology strings updated to describe the clean A/B + token split. README: new section "为低成本 / 小模型而设计的鲁棒性" articulating that the pipeline is built so cheap/small models don't break — forced structured output + regen, graceful degradation, adversarial verification (over-flagging), deterministic dedup (drift), caching (cost). Reframes the DeepSeek note: small models run reliably; Opus is for the deepest reasoning, not for not crashing. Eval-only + docs; no Action-bundle code touched, so dist/ is unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

NeverENG merged commit aa60007 into main May 31, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(eval): 干净 A/B（生成一次→复核关/开）+ README 为小模型而设计的鲁棒性#18

feat(eval): 干净 A/B（生成一次→复核关/开）+ README 为小模型而设计的鲁棒性#18
NeverENG merged 1 commit into
mainfrom
feat/clean-ab-eval

NeverENG commented May 31, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

NeverENG commented May 31, 2026

干净 A/B：生成一次 → 对同一批做 复核关/开

README：为低成本 / 小模型而设计的鲁棒性

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

干净 A/B：生成一次 → 对同一批做复核关/开