feat: 评测语料集 + 脚手架,含 baseline vs 对抗式复核报告 (DESIGN §六.2)#14
Merged
Conversation
…eralFindings) Architecture findings are BanGD's differentiator, but emitting only those cedes ordinary bugs (off-by-one, swallowed errors, nil deref, races) to Copilot. Add a second result class, generalFindings: concrete diff-evidenced correctness/logic defects, lightweight (no four-段式), filed inline in the PR comment (not as tracked issues). - schema: GeneralFindingSchema + generalFindings (.default([]) for graceful degradation), kept in sync with the tool JSON schema by a test - prompt/system-prompt: 4th output part + quality red-lines (diff-evidenced only, no style nits, no dup of architecture findings, <=6, [] when none) - verify: generalize VerifyOutcome<T>/verifyItems<T>; verifyGeneralFindings runs the SAME adversarial majority-refute pass (refuter prompt reframed to "is this a real correctness defect") - review: verify both finding kinds in parallel; ReviewOutcome.droppedGeneralFindings - format: render general findings inline; omit the section when empty - action(.yml): general_finding_count output; dropped count covers both kinds - DESIGN.md §七: why both classes coexist, structural/delivery differences - tests: rendering, verification, schema-sync, default-omit coverage Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
PR #13 shipped the generalFindings niche described in the system prompt but with no worked example, while every architecture dimension has one. A wrong-or- absent few-shot is the single biggest quality lever per CLAUDE.md, so add the missing exemplar. - prompts/examples/general-findings.md: a worked example (binary-search off-by- one that infinite-loops) showing the 7-field generalFinding shape AND, just as important, the red-line — what NOT to report (style/naming nits, "add a test", unfounded speculation, dup of an architecture finding), plus the boundary vs architecture findings. - prompt.ts: assembleSystemPrompt takes an always-on generalExample, appended in its own block regardless of selected dimensions (generalFindings is requested on every review). Architecture examples relabeled "架构级 Few-shot 范例". - prompts.ts: PromptTexts.generalExample, loaded unconditionally (not dimension- gated); lives in the prompt-cached system block so marginal token cost ~= nil. - DESIGN.md §七: note the niche now ships with its own few-shot (parity with the per-dimension architecture examples). - tests: new prompt.test.ts (assembleSystemPrompt always-on behavior + user prompt parts); prompts.test.ts loads + red-line check; review.test.ts asserts the exemplar is present regardless of dimension selection. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…DESIGN §六.2) Builds the measurement scaffolding that turns "feels accurate" into precision/recall, and runs the first real A/B. The corpus is 6 REAL BanDB PRs (eval/cases.json, frozen from GitHub), labels anchored on author intent or a maintainer-confirmed fix (e.g. #58's crash-consistency + Stat→Seek race, which BanGD flagged and the maintainer fixed in #63). PR bodies are neutralized so the model detects from code, not from a body that states the answer. - src/eval/score.ts: pure precision/recall/F1 scorer; fuzzy match on basename + accepted type/category, one-to-one greedy. Fully unit-tested (13 tests). - src/eval/corpus.ts + eval/cases.json + eval/build-corpus.mjs: the frozen real-PR corpus and its (re-runnable) builder. - src/eval/run.ts: single-variable A/B (adversarial verify off vs on, all else held constant, fresh client per config to isolate tokens); writes docs/评测报告.md. Key from ANTHROPIC_API_KEY/DEEPSEEK_API_KEY or the gitignored eval/.apikey; npm run eval. - docs/评测报告.md: the first run (DeepSeek). Honest, counterintuitive result — on the cheap model the adversarial refuters OVER-refute: recall 100%→0% (true findings killed), F1 35.3%→0%, +173% tokens. Empirically backs DESIGN §二 (model choice is the divide) and motivates §六.3 (perspective-diverse refuters instead of homogeneous ones). Real accuracy must be re-run on Opus. Stacks on the generalFindings branch (eval scores result.generalFindings too). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
这是什么
DESIGN §六.2 的评测语料集 + 评测脚手架,把"感觉准"变成 precision/recall/F1,并跑出第一次真实 A/B。
语料(真实,可信,防泄漏)
eval/cases.json,由eval/build-corpus.mjs从 GitHub 冻结)。第一次运行的结果(DeepSeek,诚实且反直觉)
对抗式复核在廉价模型上适得其反:DeepSeek 反驳者过度否决,把 3 条真 finding 全杀了(召回 100%→0%),F1 归零,还多烧 173% token。这用数据印证了 DESIGN §二(模型选择是分水岭:DeepSeek 只配冒烟)与 §六.3(同质反驳者是缺陷,应做视角分化)。完整解读见
docs/评测报告.md。行动项(报告内)
verify_votes默认值应随模型强度调整;弱模型上应调低/关闭。脚手架
src/eval/score.ts纯函数打分(13 单测);run.ts单变量 A/B、每配置独立客户端隔离 token;npm run eval。eval/.apikey读取,绝不入库。typecheck / lint / 102 测试全绿。
🤖 Generated with Claude Code