Skip to content

feat: 评测语料集 + 脚手架,含 baseline vs 对抗式复核报告 (DESIGN §六.2)#14

Merged
NeverENG merged 3 commits into
mainfrom
feat/eval-corpus
May 31, 2026
Merged

feat: 评测语料集 + 脚手架,含 baseline vs 对抗式复核报告 (DESIGN §六.2)#14
NeverENG merged 3 commits into
mainfrom
feat/eval-corpus

Conversation

@NeverENG

Copy link
Copy Markdown
Owner

这是什么

DESIGN §六.2 的评测语料集 + 评测脚手架,把"感觉准"变成 precision/recall/F1,并跑出第一次真实 A/B。

⚠️ 依赖 #13feat/general-niche-coverage):评测也对 result.generalFindings 打分,故栈在其上。建议 #13 合入后再合本 PR。

语料(真实,可信,防泄漏)

  • 6 个真实 BanDB PReval/cases.json,由 eval/build-corpus.mjs 从 GitHub 冻结)。
  • 标签锚定作者明示意图维护者在修复 PR 中确认的缺陷——如 #58 的崩溃一致性 + Stat→Seek 竞态,正是 BanGD 当初点出、维护者在 #63 修复的。
  • 防泄漏:真实 PR 描述常直接点明缺陷(#43 自述"刻意用并发写法检验 BanGD"),故丢弃真实描述、改用中性一句话,只让 diff 进模型。

第一次运行的结果(DeepSeek,诚实且反直觉)

配置 P R F1 架构误报 token
baseline(无复核) 21.4% 100% 35.3% 11 74.6k
复核 3 票/条 0% 0% 0% 1 203.9k

对抗式复核在廉价模型上适得其反:DeepSeek 反驳者过度否决,把 3 条真 finding 全杀了(召回 100%→0%),F1 归零,还多烧 173% token。这用数据印证了 DESIGN §二(模型选择是分水岭:DeepSeek 只配冒烟)与 §六.3(同质反驳者是缺陷,应做视角分化)。完整解读见 docs/评测报告.md

行动项(报告内)

  • verify_votes 默认值应随模型强度调整;弱模型上应调低/关闭。
  • 真实准确率必须在 Opus 档重跑(本报告只代表廉价端点路径)。
  • 优先实现视角分化反驳者,再用本语料 A/B 验证。

脚手架

  • src/eval/score.ts 纯函数打分(13 单测);run.ts 单变量 A/B、每配置独立客户端隔离 token;npm run eval
  • key 从 env 或 gitignored eval/.apikey 读取,绝不入库。

typecheck / lint / 102 测试全绿。

🤖 Generated with Claude Code

NeverENG and others added 3 commits May 31, 2026 10:55
…eralFindings)

Architecture findings are BanGD's differentiator, but emitting only those
cedes ordinary bugs (off-by-one, swallowed errors, nil deref, races) to
Copilot. Add a second result class, generalFindings: concrete diff-evidenced
correctness/logic defects, lightweight (no four-段式), filed inline in the PR
comment (not as tracked issues).

- schema: GeneralFindingSchema + generalFindings (.default([]) for graceful
  degradation), kept in sync with the tool JSON schema by a test
- prompt/system-prompt: 4th output part + quality red-lines (diff-evidenced
  only, no style nits, no dup of architecture findings, <=6, [] when none)
- verify: generalize VerifyOutcome<T>/verifyItems<T>; verifyGeneralFindings
  runs the SAME adversarial majority-refute pass (refuter prompt reframed to
  "is this a real correctness defect")
- review: verify both finding kinds in parallel; ReviewOutcome.droppedGeneralFindings
- format: render general findings inline; omit the section when empty
- action(.yml): general_finding_count output; dropped count covers both kinds
- DESIGN.md §七: why both classes coexist, structural/delivery differences
- tests: rendering, verification, schema-sync, default-omit coverage

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
PR #13 shipped the generalFindings niche described in the system prompt but
with no worked example, while every architecture dimension has one. A wrong-or-
absent few-shot is the single biggest quality lever per CLAUDE.md, so add the
missing exemplar.

- prompts/examples/general-findings.md: a worked example (binary-search off-by-
  one that infinite-loops) showing the 7-field generalFinding shape AND, just as
  important, the red-line — what NOT to report (style/naming nits, "add a test",
  unfounded speculation, dup of an architecture finding), plus the boundary vs
  architecture findings.
- prompt.ts: assembleSystemPrompt takes an always-on generalExample, appended in
  its own block regardless of selected dimensions (generalFindings is requested
  on every review). Architecture examples relabeled "架构级 Few-shot 范例".
- prompts.ts: PromptTexts.generalExample, loaded unconditionally (not dimension-
  gated); lives in the prompt-cached system block so marginal token cost ~= nil.
- DESIGN.md §七: note the niche now ships with its own few-shot (parity with the
  per-dimension architecture examples).
- tests: new prompt.test.ts (assembleSystemPrompt always-on behavior + user
  prompt parts); prompts.test.ts loads + red-line check; review.test.ts asserts
  the exemplar is present regardless of dimension selection.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…DESIGN §六.2)

Builds the measurement scaffolding that turns "feels accurate" into
precision/recall, and runs the first real A/B. The corpus is 6 REAL BanDB PRs
(eval/cases.json, frozen from GitHub), labels anchored on author intent or a
maintainer-confirmed fix (e.g. #58's crash-consistency + Stat→Seek race, which
BanGD flagged and the maintainer fixed in #63). PR bodies are neutralized so the
model detects from code, not from a body that states the answer.

- src/eval/score.ts: pure precision/recall/F1 scorer; fuzzy match on basename +
  accepted type/category, one-to-one greedy. Fully unit-tested (13 tests).
- src/eval/corpus.ts + eval/cases.json + eval/build-corpus.mjs: the frozen
  real-PR corpus and its (re-runnable) builder.
- src/eval/run.ts: single-variable A/B (adversarial verify off vs on, all else
  held constant, fresh client per config to isolate tokens); writes
  docs/评测报告.md. Key from ANTHROPIC_API_KEY/DEEPSEEK_API_KEY or the gitignored
  eval/.apikey; npm run eval.
- docs/评测报告.md: the first run (DeepSeek). Honest, counterintuitive result —
  on the cheap model the adversarial refuters OVER-refute: recall 100%→0% (true
  findings killed), F1 35.3%→0%, +173% tokens. Empirically backs DESIGN §二
  (model choice is the divide) and motivates §六.3 (perspective-diverse refuters
  instead of homogeneous ones). Real accuracy must be re-run on Opus.

Stacks on the generalFindings branch (eval scores result.generalFindings too).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@NeverENG NeverENG merged commit f8d848f into main May 31, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant