feat: issue 标题以核心问题打头,而非仅类型+文件#16
Merged
Merged
Conversation
…eralFindings) Architecture findings are BanGD's differentiator, but emitting only those cedes ordinary bugs (off-by-one, swallowed errors, nil deref, races) to Copilot. Add a second result class, generalFindings: concrete diff-evidenced correctness/logic defects, lightweight (no four-段式), filed inline in the PR comment (not as tracked issues). - schema: GeneralFindingSchema + generalFindings (.default([]) for graceful degradation), kept in sync with the tool JSON schema by a test - prompt/system-prompt: 4th output part + quality red-lines (diff-evidenced only, no style nits, no dup of architecture findings, <=6, [] when none) - verify: generalize VerifyOutcome<T>/verifyItems<T>; verifyGeneralFindings runs the SAME adversarial majority-refute pass (refuter prompt reframed to "is this a real correctness defect") - review: verify both finding kinds in parallel; ReviewOutcome.droppedGeneralFindings - format: render general findings inline; omit the section when empty - action(.yml): general_finding_count output; dropped count covers both kinds - DESIGN.md §七: why both classes coexist, structural/delivery differences - tests: rendering, verification, schema-sync, default-omit coverage Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
PR #13 shipped the generalFindings niche described in the system prompt but with no worked example, while every architecture dimension has one. A wrong-or- absent few-shot is the single biggest quality lever per CLAUDE.md, so add the missing exemplar. - prompts/examples/general-findings.md: a worked example (binary-search off-by- one that infinite-loops) showing the 7-field generalFinding shape AND, just as important, the red-line — what NOT to report (style/naming nits, "add a test", unfounded speculation, dup of an architecture finding), plus the boundary vs architecture findings. - prompt.ts: assembleSystemPrompt takes an always-on generalExample, appended in its own block regardless of selected dimensions (generalFindings is requested on every review). Architecture examples relabeled "架构级 Few-shot 范例". - prompts.ts: PromptTexts.generalExample, loaded unconditionally (not dimension- gated); lives in the prompt-cached system block so marginal token cost ~= nil. - DESIGN.md §七: note the niche now ships with its own few-shot (parity with the per-dimension architecture examples). - tests: new prompt.test.ts (assembleSystemPrompt always-on behavior + user prompt parts); prompts.test.ts loads + red-line check; review.test.ts asserts the exemplar is present regardless of dimension selection. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…DESIGN §六.2) Builds the measurement scaffolding that turns "feels accurate" into precision/recall, and runs the first real A/B. The corpus is 6 REAL BanDB PRs (eval/cases.json, frozen from GitHub), labels anchored on author intent or a maintainer-confirmed fix (e.g. #58's crash-consistency + Stat→Seek race, which BanGD flagged and the maintainer fixed in #63). PR bodies are neutralized so the model detects from code, not from a body that states the answer. - src/eval/score.ts: pure precision/recall/F1 scorer; fuzzy match on basename + accepted type/category, one-to-one greedy. Fully unit-tested (13 tests). - src/eval/corpus.ts + eval/cases.json + eval/build-corpus.mjs: the frozen real-PR corpus and its (re-runnable) builder. - src/eval/run.ts: single-variable A/B (adversarial verify off vs on, all else held constant, fresh client per config to isolate tokens); writes docs/评测报告.md. Key from ANTHROPIC_API_KEY/DEEPSEEK_API_KEY or the gitignored eval/.apikey; npm run eval. - docs/评测报告.md: the first run (DeepSeek). Honest, counterintuitive result — on the cheap model the adversarial refuters OVER-refute: recall 100%→0% (true findings killed), F1 35.3%→0%, +173% tokens. Empirically backs DESIGN §二 (model choice is the divide) and motivates §六.3 (perspective-diverse refuters instead of homogeneous ones). Real accuracy must be re-run on Opus. Stacks on the generalFindings branch (eval scores result.generalFindings too). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…findings) The first eval run (docs/评测报告.md) exposed a disaster: on a weak refuter model the verification pass deleted EVERY true finding (recall 100%→0%, F1→0, +173% tokens). Root cause: the refuter prompt said "default to refuted when evidence is thin", and N identical refuters echoed the same over-refute. Fix: - (A) keep-by-default: a refuter rejects a finding only when it can cite concrete evidence it is wrong; "unsure" keeps it. Inverts the old uncertain⇒refute bias. - (B) perspective-diverse lenses: each refuter scrutinizes one angle hardest (diff-grounding / misread / can-it-happen) but returns a HOLISTIC verdict, so N refuters decorrelate. (Holistic, not per-axis — per-axis refuting would let a one-axis FP survive unanimity; DESIGN §六.3.) - (C) unanimous-to-drop: a finding is dropped only if every refuter rejects it — one defender keeps it. Makes killing a real finding hard. Re-run on DeepSeek confirms the fix: recall held 33.3%→33.3% (the true finding #43 survived verification, vs being killed before); F1 22.2%→40%; +5% tokens. The report is honest about confounds: part of the apparent precision gain is a DeepSeek parse-error artifact (#35) and the A/B confounds verify-effect with generation non-determinism — the clean claim is "over-refute eliminated". Also surfaced: DeepSeek returned unparseable output 3/12 times (#58 ×2, #35 ×1). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Architecture findings gain an optional `title` field — a concise one-line headline of the core problem — so a filed issue reads `🐯 [并发] 读路径无同步自增 hits 计数 · memtable.go` instead of the old `🐯 BanGD [并发] storage/zstorage/memtable.go`. The headline also surfaces in the finding header and the PR comment's issue list. `title` is optional with a graceful fallback (type + full path) on the rare miss: a cosmetic headline must not be able to fail validation and sink an otherwise-good review. The dedup key stays code-computed (pr<N>:<file>:<type>), independent of the title, so re-reviews still de-spam correctly. Schema (Zod + JSON-schema mirror, sync test green), system prompt (≤20-char specific headline), formatter, refute prompt, the primary few-shot exemplar, and fixtures all updated. dist/ rebuilt so the node20 Action ships it. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
架构 finding 新增可选
title字段(核心问题的一句话标题),使提出的 issue 标题从🐯 BanGD [并发] storage/zstorage/memtable.go升级为
🐯 [并发] 读路径无同步自增 hits 计数 · memtable.go,一眼可见问题本身。标题同时出现在 finding 头部与 PR 评论的 issue 列表。
title为可选 + 降级回退(缺失时回退到 类型+完整路径):纯装饰性标题不应能让校验失败、拖垮整条评审。dedup key 仍为代码计算的pr<N>:<file>:<type>,与标题无关,重复评审照常去重。Schema(Zod + JSON-schema 镜像,同步测试通过)、系统提示词(≤20 字、具体可定位)、格式化、反驳提示、主 few-shot 范例与测试 fixture 全部更新。
dist/已重建,node20 Action 直接生效。🤖 Generated with Claude Code