Skip to content

feat: issue 标题以核心问题打头,而非仅类型+文件#16

Merged
NeverENG merged 5 commits into
mainfrom
feat/issue-title-headline
May 31, 2026
Merged

feat: issue 标题以核心问题打头,而非仅类型+文件#16
NeverENG merged 5 commits into
mainfrom
feat/issue-title-headline

Conversation

@NeverENG

Copy link
Copy Markdown
Owner

架构 finding 新增可选 title 字段(核心问题的一句话标题),使提出的 issue 标题从
🐯 BanGD [并发] storage/zstorage/memtable.go
升级为
🐯 [并发] 读路径无同步自增 hits 计数 · memtable.go
一眼可见问题本身。标题同时出现在 finding 头部与 PR 评论的 issue 列表。

title 为可选 + 降级回退(缺失时回退到 类型+完整路径):纯装饰性标题不应能让校验失败、拖垮整条评审。dedup key 仍为代码计算的 pr<N>:<file>:<type>,与标题无关,重复评审照常去重。

Schema(Zod + JSON-schema 镜像,同步测试通过)、系统提示词(≤20 字、具体可定位)、格式化、反驳提示、主 few-shot 范例与测试 fixture 全部更新。dist/ 已重建,node20 Action 直接生效。

🤖 Generated with Claude Code

NeverENG and others added 5 commits May 31, 2026 10:55
…eralFindings)

Architecture findings are BanGD's differentiator, but emitting only those
cedes ordinary bugs (off-by-one, swallowed errors, nil deref, races) to
Copilot. Add a second result class, generalFindings: concrete diff-evidenced
correctness/logic defects, lightweight (no four-段式), filed inline in the PR
comment (not as tracked issues).

- schema: GeneralFindingSchema + generalFindings (.default([]) for graceful
  degradation), kept in sync with the tool JSON schema by a test
- prompt/system-prompt: 4th output part + quality red-lines (diff-evidenced
  only, no style nits, no dup of architecture findings, <=6, [] when none)
- verify: generalize VerifyOutcome<T>/verifyItems<T>; verifyGeneralFindings
  runs the SAME adversarial majority-refute pass (refuter prompt reframed to
  "is this a real correctness defect")
- review: verify both finding kinds in parallel; ReviewOutcome.droppedGeneralFindings
- format: render general findings inline; omit the section when empty
- action(.yml): general_finding_count output; dropped count covers both kinds
- DESIGN.md §七: why both classes coexist, structural/delivery differences
- tests: rendering, verification, schema-sync, default-omit coverage

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
PR #13 shipped the generalFindings niche described in the system prompt but
with no worked example, while every architecture dimension has one. A wrong-or-
absent few-shot is the single biggest quality lever per CLAUDE.md, so add the
missing exemplar.

- prompts/examples/general-findings.md: a worked example (binary-search off-by-
  one that infinite-loops) showing the 7-field generalFinding shape AND, just as
  important, the red-line — what NOT to report (style/naming nits, "add a test",
  unfounded speculation, dup of an architecture finding), plus the boundary vs
  architecture findings.
- prompt.ts: assembleSystemPrompt takes an always-on generalExample, appended in
  its own block regardless of selected dimensions (generalFindings is requested
  on every review). Architecture examples relabeled "架构级 Few-shot 范例".
- prompts.ts: PromptTexts.generalExample, loaded unconditionally (not dimension-
  gated); lives in the prompt-cached system block so marginal token cost ~= nil.
- DESIGN.md §七: note the niche now ships with its own few-shot (parity with the
  per-dimension architecture examples).
- tests: new prompt.test.ts (assembleSystemPrompt always-on behavior + user
  prompt parts); prompts.test.ts loads + red-line check; review.test.ts asserts
  the exemplar is present regardless of dimension selection.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…DESIGN §六.2)

Builds the measurement scaffolding that turns "feels accurate" into
precision/recall, and runs the first real A/B. The corpus is 6 REAL BanDB PRs
(eval/cases.json, frozen from GitHub), labels anchored on author intent or a
maintainer-confirmed fix (e.g. #58's crash-consistency + Stat→Seek race, which
BanGD flagged and the maintainer fixed in #63). PR bodies are neutralized so the
model detects from code, not from a body that states the answer.

- src/eval/score.ts: pure precision/recall/F1 scorer; fuzzy match on basename +
  accepted type/category, one-to-one greedy. Fully unit-tested (13 tests).
- src/eval/corpus.ts + eval/cases.json + eval/build-corpus.mjs: the frozen
  real-PR corpus and its (re-runnable) builder.
- src/eval/run.ts: single-variable A/B (adversarial verify off vs on, all else
  held constant, fresh client per config to isolate tokens); writes
  docs/评测报告.md. Key from ANTHROPIC_API_KEY/DEEPSEEK_API_KEY or the gitignored
  eval/.apikey; npm run eval.
- docs/评测报告.md: the first run (DeepSeek). Honest, counterintuitive result —
  on the cheap model the adversarial refuters OVER-refute: recall 100%→0% (true
  findings killed), F1 35.3%→0%, +173% tokens. Empirically backs DESIGN §二
  (model choice is the divide) and motivates §六.3 (perspective-diverse refuters
  instead of homogeneous ones). Real accuracy must be re-run on Opus.

Stacks on the generalFindings branch (eval scores result.generalFindings too).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…findings)

The first eval run (docs/评测报告.md) exposed a disaster: on a weak refuter
model the verification pass deleted EVERY true finding (recall 100%→0%, F1→0,
+173% tokens). Root cause: the refuter prompt said "default to refuted when
evidence is thin", and N identical refuters echoed the same over-refute. Fix:

- (A) keep-by-default: a refuter rejects a finding only when it can cite concrete
  evidence it is wrong; "unsure" keeps it. Inverts the old uncertain⇒refute bias.
- (B) perspective-diverse lenses: each refuter scrutinizes one angle hardest
  (diff-grounding / misread / can-it-happen) but returns a HOLISTIC verdict, so N
  refuters decorrelate. (Holistic, not per-axis — per-axis refuting would let a
  one-axis FP survive unanimity; DESIGN §六.3.)
- (C) unanimous-to-drop: a finding is dropped only if every refuter rejects it —
  one defender keeps it. Makes killing a real finding hard.

Re-run on DeepSeek confirms the fix: recall held 33.3%→33.3% (the true finding
#43 survived verification, vs being killed before); F1 22.2%→40%; +5% tokens.
The report is honest about confounds: part of the apparent precision gain is a
DeepSeek parse-error artifact (#35) and the A/B confounds verify-effect with
generation non-determinism — the clean claim is "over-refute eliminated". Also
surfaced: DeepSeek returned unparseable output 3/12 times (#58 ×2, #35 ×1).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Architecture findings gain an optional `title` field — a concise one-line
headline of the core problem — so a filed issue reads
`🐯 [并发] 读路径无同步自增 hits 计数 · memtable.go` instead of the old
`🐯 BanGD [并发] storage/zstorage/memtable.go`. The headline also surfaces in
the finding header and the PR comment's issue list.

`title` is optional with a graceful fallback (type + full path) on the rare
miss: a cosmetic headline must not be able to fail validation and sink an
otherwise-good review. The dedup key stays code-computed (pr<N>:<file>:<type>),
independent of the title, so re-reviews still de-spam correctly.

Schema (Zod + JSON-schema mirror, sync test green), system prompt (≤20-char
specific headline), formatter, refute prompt, the primary few-shot exemplar,
and fixtures all updated. dist/ rebuilt so the node20 Action ships it.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@NeverENG NeverENG merged commit b1573eb into main May 31, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant