feat: 评测语料集 + 脚手架，含 baseline vs 对抗式复核报告 (DESIGN §六.2) by NeverENG · Pull Request #14 · NeverENG/BanGD-AI-PR-Review-Tool

NeverENG · 2026-05-31T07:28:51Z

这是什么

DESIGN §六.2 的评测语料集 + 评测脚手架，把"感觉准"变成 precision/recall/F1，并跑出第一次真实 A/B。

⚠️ 依赖 #13（feat/general-niche-coverage）：评测也对 result.generalFindings 打分，故栈在其上。建议 #13 合入后再合本 PR。

语料（真实，可信，防泄漏）

6 个真实 BanDB PR（eval/cases.json，由 eval/build-corpus.mjs 从 GitHub 冻结）。
标签锚定作者明示意图或维护者在修复 PR 中确认的缺陷——如 #58 的崩溃一致性 + Stat→Seek 竞态，正是 BanGD 当初点出、维护者在 #63 修复的。
防泄漏：真实 PR 描述常直接点明缺陷（#43 自述"刻意用并发写法检验 BanGD"），故丢弃真实描述、改用中性一句话，只让 diff 进模型。

第一次运行的结果（DeepSeek，诚实且反直觉）

配置	P	R	F1	架构误报	token
baseline（无复核）	21.4%	100%	35.3%	11	74.6k
复核 3 票/条	0%	0%	0%	1	203.9k

对抗式复核在廉价模型上适得其反：DeepSeek 反驳者过度否决，把 3 条真 finding 全杀了（召回 100%→0%），F1 归零，还多烧 173% token。这用数据印证了 DESIGN §二（模型选择是分水岭：DeepSeek 只配冒烟）与 §六.3（同质反驳者是缺陷，应做视角分化）。完整解读见 docs/评测报告.md。

行动项（报告内）

verify_votes 默认值应随模型强度调整；弱模型上应调低/关闭。
真实准确率必须在 Opus 档重跑（本报告只代表廉价端点路径）。
优先实现视角分化反驳者，再用本语料 A/B 验证。

脚手架

src/eval/score.ts 纯函数打分（13 单测）；run.ts 单变量 A/B、每配置独立客户端隔离 token；npm run eval。
key 从 env 或 gitignored eval/.apikey 读取，绝不入库。

typecheck / lint / 102 测试全绿。

🤖 Generated with Claude Code

…eralFindings) Architecture findings are BanGD's differentiator, but emitting only those cedes ordinary bugs (off-by-one, swallowed errors, nil deref, races) to Copilot. Add a second result class, generalFindings: concrete diff-evidenced correctness/logic defects, lightweight (no four-段式), filed inline in the PR comment (not as tracked issues). - schema: GeneralFindingSchema + generalFindings (.default([]) for graceful degradation), kept in sync with the tool JSON schema by a test - prompt/system-prompt: 4th output part + quality red-lines (diff-evidenced only, no style nits, no dup of architecture findings, <=6, [] when none) - verify: generalize VerifyOutcome<T>/verifyItems<T>; verifyGeneralFindings runs the SAME adversarial majority-refute pass (refuter prompt reframed to "is this a real correctness defect") - review: verify both finding kinds in parallel; ReviewOutcome.droppedGeneralFindings - format: render general findings inline; omit the section when empty - action(.yml): general_finding_count output; dropped count covers both kinds - DESIGN.md §七: why both classes coexist, structural/delivery differences - tests: rendering, verification, schema-sync, default-omit coverage Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

PR #13 shipped the generalFindings niche described in the system prompt but with no worked example, while every architecture dimension has one. A wrong-or- absent few-shot is the single biggest quality lever per CLAUDE.md, so add the missing exemplar. - prompts/examples/general-findings.md: a worked example (binary-search off-by- one that infinite-loops) showing the 7-field generalFinding shape AND, just as important, the red-line — what NOT to report (style/naming nits, "add a test", unfounded speculation, dup of an architecture finding), plus the boundary vs architecture findings. - prompt.ts: assembleSystemPrompt takes an always-on generalExample, appended in its own block regardless of selected dimensions (generalFindings is requested on every review). Architecture examples relabeled "架构级 Few-shot 范例". - prompts.ts: PromptTexts.generalExample, loaded unconditionally (not dimension- gated); lives in the prompt-cached system block so marginal token cost ~= nil. - DESIGN.md §七: note the niche now ships with its own few-shot (parity with the per-dimension architecture examples). - tests: new prompt.test.ts (assembleSystemPrompt always-on behavior + user prompt parts); prompts.test.ts loads + red-line check; review.test.ts asserts the exemplar is present regardless of dimension selection. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…DESIGN §六.2) Builds the measurement scaffolding that turns "feels accurate" into precision/recall, and runs the first real A/B. The corpus is 6 REAL BanDB PRs (eval/cases.json, frozen from GitHub), labels anchored on author intent or a maintainer-confirmed fix (e.g. #58's crash-consistency + Stat→Seek race, which BanGD flagged and the maintainer fixed in #63). PR bodies are neutralized so the model detects from code, not from a body that states the answer. - src/eval/score.ts: pure precision/recall/F1 scorer; fuzzy match on basename + accepted type/category, one-to-one greedy. Fully unit-tested (13 tests). - src/eval/corpus.ts + eval/cases.json + eval/build-corpus.mjs: the frozen real-PR corpus and its (re-runnable) builder. - src/eval/run.ts: single-variable A/B (adversarial verify off vs on, all else held constant, fresh client per config to isolate tokens); writes docs/评测报告.md. Key from ANTHROPIC_API_KEY/DEEPSEEK_API_KEY or the gitignored eval/.apikey; npm run eval. - docs/评测报告.md: the first run (DeepSeek). Honest, counterintuitive result — on the cheap model the adversarial refuters OVER-refute: recall 100%→0% (true findings killed), F1 35.3%→0%, +173% tokens. Empirically backs DESIGN §二 (model choice is the divide) and motivates §六.3 (perspective-diverse refuters instead of homogeneous ones). Real accuracy must be re-run on Opus. Stacks on the generalFindings branch (eval scores result.generalFindings too). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

NeverENG and others added 3 commits May 31, 2026 10:55

NeverENG mentioned this pull request May 31, 2026

fix: 消除对抗式复核的反驳过度 (A 保留默认 + B 视角分化 + C 全票否决) #15

Merged

NeverENG merged commit f8d848f into main May 31, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: 评测语料集 + 脚手架，含 baseline vs 对抗式复核报告 (DESIGN §六.2)#14

feat: 评测语料集 + 脚手架，含 baseline vs 对抗式复核报告 (DESIGN §六.2)#14
NeverENG merged 3 commits into
mainfrom
feat/eval-corpus

NeverENG commented May 31, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

NeverENG commented May 31, 2026

这是什么

语料（真实，可信，防泄漏）

第一次运行的结果（DeepSeek，诚实且反直觉）

行动项（报告内）

脚手架

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant