feat(cairo-auditor): quality + cost improvements for deep mode (9 changes)#447
feat(cairo-auditor): quality + cost improvements for deep mode (9 changes)#447adrienlacombe wants to merge 1 commit into
Conversation
Implements nine reviewer suggestions across the cairo-auditor skill. Quality: - #3 Component resolution: shared `component_resolution.py` resolves embedded OZ Ownable/AccessControl/Upgradeable surfaces; wired into the surface map, the FP gate (judging.md), vuln-db notes, and the canonical detectors. Closes the IRREVOCABLE_ADMIN / CRITICAL_ADDRESS_INIT external-triage FP family. Adds a unit test and an AccessControl-rotation regression case. - #4 Cross-partition pass: adversarial agent explicitly hunts boundary-spanning chains; [CROSS-AGENT] semantics clarified (vector+adversarial corroboration). - #1 Recall + taxonomy coverage eval against 217 independent real-audit findings (recall_eval_cairo_auditor.py) — replaces the tautological "recall 1.0". Current coverage is an honest ~13% (0% on Critical), making the gap visible. - #2 Default-vs-deep A/B harness (ab_default_vs_deep.py) quantifying whether Agent 5 earns its premium cost. Cost: - #5 Shared cacheable bundle prefix so prompt-caching hosts reuse source across vector agents instead of 4x re-ingestion. - #6 Relevance-routed source per partition for large scopes (surface_map routing). - #7 Complexity-gated Agent 5 (skip/downgrade/full) via deterministic signal. - #8 Density-based model escalation (escalation_plan.py): opus only for files with first-pass signal. Housekeeping: - #9 Retired the redundant synthetic benchmark pack (duplicated the real-world pack); real-world pack is now the single deterministic regression gate. - Wired new tests/evals into ci.yml and cairo-skills-full-evals.yml. All local suites pass (preflight, deep smoke, component resolution, harness unittests, realworld benchmark 1.0/1.0, recall eval). VERSION intentionally not bumped — the release-hygiene gate requires a maintainer tag/release. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
WalkthroughThis PR implements a comprehensive evaluation and cost-optimization overhaul for the Cairo Auditor skill. It migrates from synthetic to real-world deterministic benchmarks, introduces component-resolution-based false-positive mitigation, adds new evaluation harnesses for recall and A/B comparison, and enables cost-optimized deep-mode orchestration via routing and escalation planning. ChangesCairo Auditor Evaluation & False-Positive Mitigation
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested reviewers
Caution Pre-merge checks failedPlease resolve all errors before merging. Addressing warnings is optional.
❌ Failed checks (2 errors, 1 warning)
✅ Passed checks (7 passed)
✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Warning Review ran into problems🔥 ProblemsStopped waiting for pipeline failures after 30000ms. One of your pipelines takes longer than our 30000ms fetch window to run, so review may not consider pipeline-failure results for inline comments if any failures occurred after the fetch window. Increase the timeout if you want to wait longer or run a Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Review Summary by QodoQuality and cost improvements for cairo-auditor deep mode with component resolution and recall evaluation
WalkthroughsDescription• Implements nine quality and cost improvements for cairo-auditor deep mode • Adds component resolution module to eliminate false positives on embedded OZ components • Introduces real-world recall evaluation against 217 independent audit findings • Adds default-vs-deep A/B comparison harness to quantify Agent 5 value • Implements cost controls: shared cached prefix, relevance routing, complexity gating, density escalation • Retires redundant synthetic benchmark; real-world pack is now single regression gate Diagramflowchart LR
A["Quality Improvements"] --> B["Component Resolution"]
A --> C["Real Recall Eval"]
A --> D["Default vs Deep A/B"]
A --> E["Cross-Partition Pass"]
F["Cost Controls"] --> G["Shared Cached Prefix"]
F --> H["Relevance Routing"]
F --> I["Complexity Gating"]
F --> J["Density Escalation"]
K["Housekeeping"] --> L["Retire Synthetic Benchmark"]
K --> M["Update CI/Evals"]
B --> N["FP Gate Integration"]
C --> O["Taxonomy Coverage"]
D --> P["Verdict Logic"]
File Changes1. scripts/quality/recall_eval_cairo_auditor.py
|
Code Review by Qodo
Context used 1. Escalation plan wrong targets
|
| labels = routing.get("partition_labels", {}) | ||
| if isinstance(partition_files, dict): | ||
| lines += ["| Partition | Lens | Relevant files |", "|---|---|---|"] | ||
| for idx in sorted(partition_files, key=lambda x: int(x)): |
| for drop in data.get("dropped_candidates", []) if isinstance(data.get("dropped_candidates"), list) else []: | ||
| if not isinstance(drop, dict): | ||
| continue | ||
| if str(drop.get("drop_reason")) in BORDERLINE_DROP_REASONS: | ||
| # Dropped candidates rarely carry a file; attribute to the agent | ||
| # scope so the orchestrator re-checks broadly when borderline. | ||
| mark(str(drop.get("candidate", "")) or "(scope)", "borderline candidate in first pass") | ||
|
|
There was a problem hiding this comment.
1. Escalation plan wrong targets 🐞 Bug ≡ Correctness
In escalation_plan.build_plan(), borderline dropped candidates are added to the escalation set using dropped_candidates[].candidate as if it were a file path, so escalate_files can contain arbitrary titles/trace text instead of real files. This breaks the documented contract that the plan lists files to re-check with the adversarial model and can cause downstream consumers to fail when they treat these entries as paths.
Agent Prompt
## Issue description
`skills/cairo-auditor/scripts/quality/escalation_plan.py` uses `drop["candidate"]` as the `file_path` key when a dropped candidate is borderline. But `candidate` is not a file path (it’s a human label/title), so the generated `escalate_files` list becomes invalid and cannot reliably drive file-based re-check routing.
## Issue Context
- The script is documented to “list the files that surfaced a finding … or a borderline candidate” for targeted adversarial re-check.
- Agent output normalization treats `dropped_candidates[].candidate` as a string label, not a path.
## Fix Focus Areas
- skills/cairo-auditor/scripts/quality/escalation_plan.py[39-85]
## Suggested fix
- For borderline drops:
- Prefer a real file field if present (e.g., `drop.get("file")` if the upstream schema ever includes it).
- Otherwise, **do not** use `candidate` as a file key. Instead either:
1) mark a sentinel like `"(scope)"` (constant) and add a top-level flag like `"escalate_scope": true`, or
2) escalate all files already seen in first-pass findings (`seen_files`) by calling `mark(f, ...)` for each `f in seen_files`.
- Ensure `escalate_files` contains only valid repo-relative file paths (or a single, well-defined sentinel/flag that downstream tooling can interpret), never arbitrary candidate strings.
ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools
There was a problem hiding this comment.
Actionable comments posted: 11
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@scripts/quality/ab_default_vs_deep.py`:
- Around line 44-54: The current dedup logic using _key and dict comprehensions
(default_by_key / deep_by_key) overwrites duplicate findings with the same key
and undercounts differences; change it to preserve duplicates by making the key
include stable location fields even when root_cause exists (e.g., include
class_id/file/title/line or a stable location tuple) or switch from dicts to
multisets/lists (e.g., map key -> list of findings or use collections.Counter
with a deterministic serialized finding) so that comparisons operate on counts
of identical findings rather than single entries; update usages that compute
only_default/only_deep to consume the multiset/list counts accordingly.
In `@scripts/quality/benchmark_cairo_auditor.py`:
- Around line 32-42: The current _load_component_resolution function silently
returns None on any import/runtime error which masks broken in-repo helpers;
change the behavior so that only a genuinely missing file/spec causes a None
fallback, but any exception thrown during spec.loader.exec_module(module) is
propagated (or re-raised) with contextual information (include module_path and a
short message) so CI fails fast; apply the same change to the other loader block
referenced at lines 48-63 (the alternate component_resolution import logic) so
both loaders only swallow absence but not execution/import errors.
In `@scripts/quality/recall_eval_cairo_auditor.py`:
- Around line 107-117: Malformed JSON and detector exceptions are currently
being swallowed in the loop that reads findings (the block using findings_dir,
Path(fp).read_text, json.loads and findings.append), which changes the
denominator silently; instead, stop silently skipping: when json.loads fails or
when a detector throws, raise an informative exception (or aggregate and raise
after scanning) that includes the offending file path (fp), the raw line content
(or line number), and the parser/detector exception/class name so the CI fails
loudly; update the JSON parse except block to raise (or append to an errors
list) with fp and line, and ensure the detector-exception-to-miss handling (the
similar logic at lines ~141-146) is changed to surface the offending
finding/detector class and error rather than converting to a miss.
In `@skills/cairo-auditor/README.md`:
- Line 503: Update the real-world corpus case count in the README table: find
the table row containing the text "| Real-world corpus | 42 | 1.000 | 1.000 |
[v0.2.0-cairo-auditor-realworld-benchmark.md](../../evals/scorecards/v0.2.0-cairo-auditor-realworld-benchmark.md)
|" and change the "42" to "43" to match the v0.2.0 scorecard (which added
rw_irrevocable_admin_safe_accesscontrol_mixin_grant_role_03).
In `@skills/cairo-auditor/scripts/quality/component_resolution.py`:
- Around line 88-97: The code currently collects all
"<receiver>.initializer(...)" occurrences into nonzero_initializers and treats
any as guarded when ownable or access_control are present; change this so only
initializer calls on the actual Ownable/AccessControl storage receivers are
considered. In resolve_components() narrow the set comprehension that builds
nonzero_initializers to include only match.group(1) values that match the
identified ownable/access_control receiver names (e.g., the variables that
signal OwnableComponent/AccessControlComponent), and update
initializer_guards_param() to check against that filtered set (not just the
booleans enforces_nonzero_on_init or any ".initializer" occurrence). Also apply
the same receiver-scoped filtering to the related logic around
upgrade_nonzero_guard (lines referenced 121-127) so only initializer calls on
the concrete upgradeable/ownable/access_control receivers suppress the
CRITICAL_ADDRESS_INIT_WITHOUT_NONZERO_GUARD.
In `@skills/cairo-auditor/scripts/quality/escalation_plan.py`:
- Around line 39-84: The build_plan currently escalates any file referenced in
agent outputs; update it to accept an optional scope set (e.g., add parameter
scope_files: Optional[set[str]] = None to build_plan) and validate escalated
entries against that set before returning results. Specifically, either have
mark() ignore file_path values not present in scope_files (treat "(scope)"
specially) or, after populating escalate, filter escalate_files/escalate_detail
to only include keys present in scope_files when scope_files is provided; keep
functions build_plan and mark as the touch points to implement this check so
out-of-scope references are not propagated to model_for_escalated.
In `@skills/cairo-auditor/scripts/quality/surface_map.py`:
- Around line 346-348: Update the inline comment that begins "Interaction
followed by state mutation in the same file is the classic cross-function / CEI
surface..." to expand the CEI acronym to "Checks-Effects-Interactions (CEI)" so
readers unfamiliar with the abbreviation understand the reference; keep the rest
of the comment unchanged and ensure the expanded form appears the first time CEI
is used in that comment.
In `@skills/cairo-auditor/SKILL.md`:
- Around line 302-306: The doc mentions "relevance-routed source per
`cairo-audit-routing.json`" and `partition_files["N"]` but doesn't show how the
orchestrator applies that routing; either add a forward reference to the
detailed substitution logic in `workflows/deep.md` (e.g., "See deep.md Cost
Controls → Relevance-routed source for substitution logic") or include a short
example in SKILL.md explaining how to build a per-agent filtered prefix by
reading `cairo-audit-routing.json`'s `partition_files` entry and replacing the
shared prefix with only those files for agent N (reference
`cairo-audit-routing.json`, `partition_files`, and the orchestrator/agent prefix
substitution step).
In `@skills/cairo-auditor/workflows/deep.md`:
- Around line 62-63: Clarify whether routing.json.complexity.threshold is
configurable: if it is adjustable, update the docs to state the configuration
mechanism (e.g., which CLI flag or config key to set the threshold, expected
value range, and where to place it) and show an example usage; if it is
intentionally fixed at 3, add a short rationale line near
routing.json.complexity explaining why the threshold is fixed and that
adversarial_action behavior depends on that hard-coded value. Ensure you
reference routing.json.complexity, the threshold value, and adversarial_action
in the added sentence so readers can find the relevant behavior.
In `@skills/cairo-auditor/workflows/default.md`:
- Around line 14-16: The sentence referencing "The shared cacheable bundle
prefix (`#5`)" uses a non-existent numeric cross-reference; locate the line
containing that exact phrase in workflows/default.md and replace "(`#5`)" with a
proper cross-reference such as a section title or markdown link (e.g., "(see
Shared cacheable bundle prefix)" or a link to the relevant header) so the
reference resolves correctly; ensure you update the phrase "The shared cacheable
bundle prefix (`#5`)" to read "The shared cacheable bundle prefix (see Shared
cacheable bundle prefix)" or similar, preserving the rest of the sentence.
- Around line 17-19: Update the sentence that mentions "cross-partition chains"
and "boundary-spanning composition" to include a brief parenthetical example
clarifying the terms for new readers (e.g., "(root causes spanning two vector
lenses, e.g., access control + reentrancy)"). Locate the phrase "cross-partition
chains" / "boundary-spanning composition" in the default mode description and
insert the parenthetical immediately after "cross-partition chains" so the
sentence reads clearly and preserves the recommendation to use `deep` mode.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 2a91bfd6-1efc-4b96-a7ab-89570514a79a
📒 Files selected for processing (27)
.github/workflows/cairo-skills-full-evals.yml.github/workflows/ci.ymlevals/README.mdevals/cases/cairo_auditor_benchmark.jsonlevals/cases/cairo_auditor_realworld_benchmark.jsonlevals/scorecards/v0.2.0-cairo-auditor-benchmark.mdevals/scorecards/v0.2.0-cairo-auditor-realworld-benchmark.mdevals/scorecards/v0.2.0-cairo-auditor-recall-coverage.jsonevals/scorecards/v0.2.0-cairo-auditor-recall-coverage.mdscripts/quality/ab_default_vs_deep.pyscripts/quality/benchmark_cairo_auditor.pyscripts/quality/recall_eval_cairo_auditor.pyscripts/quality/test_cairo_auditor_quality.pyskills/cairo-auditor/README.mdskills/cairo-auditor/SKILL.mdskills/cairo-auditor/agents/adversarial.mdskills/cairo-auditor/references/checklists/release-gate.mdskills/cairo-auditor/references/judging.mdskills/cairo-auditor/references/report-formatting.mdskills/cairo-auditor/references/vulnerability-db/CRITICAL-ADDRESS-INIT-WITHOUT-NONZERO-GUARD.mdskills/cairo-auditor/references/vulnerability-db/IRREVOCABLE-ADMIN.mdskills/cairo-auditor/scripts/quality/component_resolution.pyskills/cairo-auditor/scripts/quality/escalation_plan.pyskills/cairo-auditor/scripts/quality/surface_map.pyskills/cairo-auditor/tests/validate_component_resolution.pyskills/cairo-auditor/workflows/deep.mdskills/cairo-auditor/workflows/default.md
💤 Files with no reviewable changes (2)
- evals/cases/cairo_auditor_benchmark.jsonl
- evals/scorecards/v0.2.0-cairo-auditor-benchmark.md
📜 Review details
🧰 Additional context used
📓 Path-based instructions (6)
skills/**/
📄 CodeRabbit inference engine (AGENTS.md)
skills/**/: Runpython3 scripts/quality/validate_skills.pyandpython3 scripts/skills_manifest.py --checkfor skills/install UX changes
Runpython3 scripts/quality/check_codex_distribution.pyandpython3 -m unittest scripts/quality/test_codex_distribution.pyfor skills/install UX changes
Runpython3 scripts/quality/validate_marketplace.pyfor skills/install UX changes
Files:
skills/cairo-auditor/references/checklists/release-gate.mdskills/cairo-auditor/references/vulnerability-db/IRREVOCABLE-ADMIN.mdskills/cairo-auditor/references/report-formatting.mdskills/cairo-auditor/README.mdskills/cairo-auditor/scripts/quality/escalation_plan.pyskills/cairo-auditor/references/vulnerability-db/CRITICAL-ADDRESS-INIT-WITHOUT-NONZERO-GUARD.mdskills/cairo-auditor/workflows/deep.mdskills/cairo-auditor/tests/validate_component_resolution.pyskills/cairo-auditor/SKILL.mdskills/cairo-auditor/workflows/default.mdskills/cairo-auditor/references/judging.mdskills/cairo-auditor/scripts/quality/component_resolution.pyskills/cairo-auditor/agents/adversarial.mdskills/cairo-auditor/scripts/quality/surface_map.py
skills/**
⚙️ CodeRabbit configuration file
skills/**: Skills are composable agent capabilities. Each skill must have a clear interface contract, proper input validation, deterministic output, and graceful error handling. Flag any skill that makes unscoped network calls, stores state without cleanup, or lacks proper type definitions. Verify skill metadata matches implementation.
Files:
skills/cairo-auditor/references/checklists/release-gate.mdskills/cairo-auditor/references/vulnerability-db/IRREVOCABLE-ADMIN.mdskills/cairo-auditor/references/report-formatting.mdskills/cairo-auditor/README.mdskills/cairo-auditor/scripts/quality/escalation_plan.pyskills/cairo-auditor/references/vulnerability-db/CRITICAL-ADDRESS-INIT-WITHOUT-NONZERO-GUARD.mdskills/cairo-auditor/workflows/deep.mdskills/cairo-auditor/tests/validate_component_resolution.pyskills/cairo-auditor/SKILL.mdskills/cairo-auditor/workflows/default.mdskills/cairo-auditor/references/judging.mdskills/cairo-auditor/scripts/quality/component_resolution.pyskills/cairo-auditor/agents/adversarial.mdskills/cairo-auditor/scripts/quality/surface_map.py
evals/**
⚙️ CodeRabbit configuration file
evals/**: Evaluation code must be deterministic and reproducible. Flag any test that depends on external state, non-deterministic ordering, or hardcoded network endpoints. Verify that eval metrics are properly documented and assertions are meaningful.
Files:
evals/scorecards/v0.2.0-cairo-auditor-recall-coverage.mdevals/scorecards/v0.2.0-cairo-auditor-realworld-benchmark.mdevals/scorecards/v0.2.0-cairo-auditor-recall-coverage.jsonevals/README.mdevals/cases/cairo_auditor_realworld_benchmark.jsonl
scripts/**
⚙️ CodeRabbit configuration file
scripts/**: Flag any script that runs with elevated privileges without justification. Verify input sanitization in deployment scripts. Flag hardcoded credentials, URLs, or environment-specific assumptions.
Files:
scripts/quality/test_cairo_auditor_quality.pyscripts/quality/ab_default_vs_deep.pyscripts/quality/benchmark_cairo_auditor.pyscripts/quality/recall_eval_cairo_auditor.py
.github/workflows/**
📄 CodeRabbit inference engine (Custom checks)
.github/workflows/**: For changes in .github/workflows/, flag any new or expanded permissions
For changes in .github/workflows/, flag unpinned action versions
For changes in .github/workflows/**, flag secrets referenced without need-to-know scope
Files:
.github/workflows/ci.yml.github/workflows/cairo-skills-full-evals.yml
⚙️ CodeRabbit configuration file
.github/workflows/**: Ensure least-privilege permissions and safe CI execution patterns. Flag any workflow with write permissions that doesn't strictly need them. Verify secrets are not exposed in logs or artifacts. Dependency actions must use pinned SHA versions, not tags.
Files:
.github/workflows/ci.yml.github/workflows/cairo-skills-full-evals.yml
skills/**/SKILL.md
📄 CodeRabbit inference engine (CLAUDE.md)
skills/**/SKILL.md: Skills YAML frontmatter must include:name,description,keywords,allowed-tools,user-invocable
Skill name format: lowercase, hyphens only, 1-64 characters
Skills should include code examples with starknet.js patterns
List error codes with recovery steps in skills
Files:
skills/cairo-auditor/SKILL.md
🪛 Ruff (0.15.15)
scripts/quality/test_cairo_auditor_quality.py
[warning] 16-16: Missing return type annotation for private function _load
(ANN202)
[warning] 18-18: Assertion should be broken down into multiple parts
Break down assertion into multiple parts
(PT018)
[warning] 29-29: Missing return type annotation for private function _finding
(ANN202)
[warning] 29-29: Missing type annotation for **kw
(ANN003)
scripts/quality/ab_default_vs_deep.py
[warning] 154-154: Use list.extend to create a transformed list
(PERF401)
skills/cairo-auditor/tests/validate_component_resolution.py
[warning] 58-58: Boolean-typed positional argument in function definition
(FBT001)
scripts/quality/benchmark_cairo_auditor.py
[warning] 16-16: Missing return type annotation for private function _load_component_resolution
(ANN202)
[warning] 40-40: Do not catch blind exception: Exception
(BLE001)
[warning] 53-53: Do not catch blind exception: Exception
(BLE001)
[warning] 62-62: Do not catch blind exception: Exception
(BLE001)
skills/cairo-auditor/scripts/quality/component_resolution.py
[warning] 62-62: Unnecessary assignment to text before return statement
Remove unnecessary assignment
(RET504)
scripts/quality/recall_eval_cairo_auditor.py
[warning] 109-109: for loop variable line overwritten by assignment target
(PLW2901)
[warning] 145-145: Do not catch blind exception: Exception
(BLE001)
[warning] 185-185: Too many statements (56 > 50)
(PLR0915)
skills/cairo-auditor/scripts/quality/surface_map.py
[warning] 403-403: Use list.extend to create a transformed list
(PERF401)
[warning] 409-409: Lambda may be unnecessary; consider inlining inner function
Inline function call
(PLW0108)
🔇 Additional comments (32)
.github/workflows/cairo-skills-full-evals.yml (1)
142-149: LGTM!.github/workflows/ci.yml (1)
406-406: LGTM!Also applies to: 418-418
evals/README.md (1)
38-82: LGTM!evals/cases/cairo_auditor_realworld_benchmark.jsonl (1)
43-43: LGTM!evals/scorecards/v0.2.0-cairo-auditor-realworld-benchmark.md (1)
3-3: LGTM!Also applies to: 9-9, 19-19, 31-31, 85-85
evals/scorecards/v0.2.0-cairo-auditor-recall-coverage.json (1)
1-64: LGTM!evals/scorecards/v0.2.0-cairo-auditor-recall-coverage.md (1)
1-63: LGTM!skills/cairo-auditor/README.md (1)
505-514: LGTM!skills/cairo-auditor/references/checklists/release-gate.md (1)
7-7: LGTM!skills/cairo-auditor/scripts/quality/escalation_plan.py (2)
94-104: LGTM!
67-70: ⚡ Quick winClarify “(scope)” sentinel handling and add scope validation for escalation_plan output
skills/cairo-auditor/scripts/quality/escalation_plan.pyadds the literal"(scope)"toescalate_fileswhen a borderlinedropped_candidatelacks a usablecandidate/file.skills/cairo-auditor/workflows/deep.mddescribes passing the resulting “file list” to Agent 5, but the repo doesn’t show any consumer logic that treats"(scope)"as a non-path sentinel (or that filters escalations to in-scope files). If the orchestrator treatsescalate_filesentries as concrete paths, this could mis-route escalations; if it’s intended to trigger a broad re-check, encode that intent explicitly (separate flag/sentinel with documented schema) and enforce in-scope validation.skills/cairo-auditor/scripts/quality/surface_map.py (6)
10-15: LGTM!
209-209: LGTM!
217-223: LGTM!
451-461: LGTM!
330-369: ⚡ Quick winComplexity gate threshold + adversarial_action semantics match the orchestrator contract.
surface_map.pyhard-codesthreshold = 3and emitsadversarial_actionasfull(score ≥ 3),downgrade(score 1–2), orskip(score == 0); the deep-mode orchestration inskills/cairo-auditor/workflows/deep.mdandskills/cairo-auditor/SKILL.mddescribes the same Agent 5 behaviors, with--force-adversarialoverriding the gate.
313-327: ⚡ Quick winRelevance fallback partitions
{1,2,3,4}are intentional for “no resolved signal” files.
surface_map.py’s fallback (“no resolved signal stays in all partitions so nothing is orphaned”) matchesworkflows/deep.md, which explicitly states that for relevance-routed large scopes, these files remain in all four partitions for Agents 1–4 (and Agent 5 reads the full in-scope set). The tradeoff is deliberate: it prevents orphaning even if it can dilute per-lens focus.skills/cairo-auditor/SKILL.md (4)
94-94: LGTM!
232-242: LGTM!
246-306: ⚡ Quick winTighten the caching determinism claim; host caching support remains unproven.
The shared-prefix bash block itself has notimestamp/date/random/time()usage; therghit for{timestamp}is for the report filename (security-review-{timestamp}.md), not for$PREFIXgeneration. The caching benefit still depends on whetherclaude-code/codexactually reuse identical leading source prefixes, which isn’t established here; also confirm theIN_SCOPE="$WORKDIR/cairo-audit-files.txt"ordering is deterministic (e.g., sorted) since prefix byte-identity depends on that list.
337-337: ⚡ Quick winCost Controls: deep-mode matches for escalation + complexity gate; Vector corroboration needs traceability.
skills/cairo-auditor/workflows/deep.mddocumentsadversarial_actionunderrouting.json.complexityand includesescalation_plan.py --workdir {workdir}to list files surfaced a Wave A pass (before Agent 5).- The exact “Vector findings so far” corroboration block (and how it’s constructed from Agents 1–4 outputs / fed into Agent 5) isn’t present in the workflow docs matched by the check—confirm the source of that block and how
[CROSS-AGENT]is produced.skills/cairo-auditor/agents/adversarial.md (2)
19-26: LGTM!
41-48: LGTM!skills/cairo-auditor/references/judging.md (3)
12-13: LGTM!
14-37: LGTM!
85-85: LGTM!skills/cairo-auditor/references/report-formatting.md (1)
188-188: LGTM!skills/cairo-auditor/references/vulnerability-db/CRITICAL-ADDRESS-INIT-WITHOUT-NONZERO-GUARD.md (1)
27-34: LGTM!skills/cairo-auditor/references/vulnerability-db/IRREVOCABLE-ADMIN.md (1)
29-35: LGTM!skills/cairo-auditor/workflows/deep.md (2)
26-84: LGTM!
48-63: Align routing/escalation docs with implementation
skills/cairo-auditor/scripts/quality/surface_map.pyexposes--output-routingand writes theroutingJSON;routing.complexityincludesscore,threshold(hardcoded3), andadversarial_action(fullwhenscore >= threshold,skipwhenscore == 0, otherwisedowngrade), plusadversarial_recommendedandreasons.skills/cairo-auditor/scripts/quality/escalation_plan.pyexposes--workdirand globscairo-audit-agent-*-findings.jsonfrom that location.skills/cairo-auditor/workflows/default.md (1)
8-10: LGTM!
| def _key(finding: dict[str, Any]) -> str: | ||
| rc = str(finding.get("root_cause", "")).strip().lower() | ||
| if rc: | ||
| return rc | ||
| return "|".join( | ||
| [ | ||
| str(finding.get("class_id", "")).lower(), | ||
| str(finding.get("file", "")).lower(), | ||
| str(finding.get("title", "")).lower(), | ||
| ] | ||
| ) |
There was a problem hiding this comment.
Preserve duplicate findings when comparing the two runs.
default_by_key = {_key(f): f ...} and the deep equivalent overwrite earlier findings with the same key. Because _key() uses only root_cause when present, two distinct findings in different files/contracts can collapse into one, undercounting only_default/only_deep and potentially flipping the verdict. Compare multisets, or include stable location fields in the key even when root_cause exists.
Also applies to: 72-83
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@scripts/quality/ab_default_vs_deep.py` around lines 44 - 54, The current
dedup logic using _key and dict comprehensions (default_by_key / deep_by_key)
overwrites duplicate findings with the same key and undercounts differences;
change it to preserve duplicates by making the key include stable location
fields even when root_cause exists (e.g., include class_id/file/title/line or a
stable location tuple) or switch from dicts to multisets/lists (e.g., map key ->
list of findings or use collections.Counter with a deterministic serialized
finding) so that comparisons operate on counts of identical findings rather than
single entries; update usages that compute only_default/only_deep to consume the
multiset/list counts accordingly.
| if not module_path.is_file(): | ||
| return None | ||
| spec = importlib.util.spec_from_file_location("cairo_auditor_component_resolution", module_path) | ||
| if spec is None or spec.loader is None: | ||
| return None | ||
| module = importlib.util.module_from_spec(spec) | ||
| try: | ||
| spec.loader.exec_module(module) | ||
| except Exception: | ||
| return None | ||
| return module |
There was a problem hiding this comment.
Fail when the shared helper is present but broken.
If component_resolution.py exists but import/execution fails, _load_component_resolution() silently returns None, and later helper failures degrade to {}/False. That changes detector outcomes while letting the deterministic benchmark continue as if the shared logic were absent. Keep the fallback only for a genuinely missing helper; in-repo import/runtime errors should raise with context so CI catches the regression.
Also applies to: 48-63
🧰 Tools
🪛 Ruff (0.15.15)
[warning] 40-40: Do not catch blind exception: Exception
(BLE001)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@scripts/quality/benchmark_cairo_auditor.py` around lines 32 - 42, The current
_load_component_resolution function silently returns None on any import/runtime
error which masks broken in-repo helpers; change the behavior so that only a
genuinely missing file/spec causes a None fallback, but any exception thrown
during spec.loader.exec_module(module) is propagated (or re-raised) with
contextual information (include module_path and a short message) so CI fails
fast; apply the same change to the other loader block referenced at lines 48-63
(the alternate component_resolution import logic) so both loaders only swallow
absence but not execution/import errors.
| for fp in sorted(glob.glob(str(findings_dir / "*.findings.jsonl"))): | ||
| for line in Path(fp).read_text(encoding="utf-8").splitlines(): | ||
| line = line.strip() | ||
| if not line: | ||
| continue | ||
| try: | ||
| row = json.loads(line) | ||
| except json.JSONDecodeError: | ||
| continue | ||
| if isinstance(row, dict) and row.get("finding_id"): | ||
| findings.append(row) |
There was a problem hiding this comment.
Do not silently drop broken findings or crashing detectors.
Malformed JSON rows are skipped, and detector exceptions are converted into misses. Both cases change the denominator/result without surfacing that the evaluation input or harness is broken, which makes the scorecard unreliable as a deterministic gate. Please raise with the offending file/finding/class context instead of continuing.
Also applies to: 141-146
🧰 Tools
🪛 Ruff (0.15.15)
[warning] 109-109: for loop variable line overwritten by assignment target
(PLW2901)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@scripts/quality/recall_eval_cairo_auditor.py` around lines 107 - 117,
Malformed JSON and detector exceptions are currently being swallowed in the loop
that reads findings (the block using findings_dir, Path(fp).read_text,
json.loads and findings.append), which changes the denominator silently;
instead, stop silently skipping: when json.loads fails or when a detector
throws, raise an informative exception (or aggregate and raise after scanning)
that includes the offending file path (fp), the raw line content (or line
number), and the parser/detector exception/class name so the CI fails loudly;
update the JSON parse except block to raise (or append to an errors list) with
fp and line, and ensure the detector-exception-to-miss handling (the similar
logic at lines ~141-146) is changed to surface the offending finding/detector
class and error rather than converting to a miss.
| | Suite | Cases | Precision | Recall | Scorecard | | ||
| | --- | ---: | ---: | ---: | --- | | ||
| | Core deterministic | 42 | 1.000 | 1.000 | [v0.2.0-cairo-auditor-benchmark.md](../../evals/scorecards/v0.2.0-cairo-auditor-benchmark.md) | | ||
| | Real-world corpus | 42 | 1.000 | 1.000 | [v0.2.0-cairo-auditor-realworld-benchmark.md](../../evals/scorecards/v0.2.0-cairo-auditor-realworld-benchmark.md) | |
There was a problem hiding this comment.
Update the case count to match the scorecard.
The benchmark table shows 42 cases for the real-world corpus, but the v0.2.0 scorecard reports 43 cases after adding rw_irrevocable_admin_safe_accesscontrol_mixin_grant_role_03. Please update this line to reflect the current count.
📊 Proposed fix
-| Real-world corpus | 42 | 1.000 | 1.000 | [v0.2.0-cairo-auditor-realworld-benchmark.md](../../evals/scorecards/v0.2.0-cairo-auditor-realworld-benchmark.md) |
+| Real-world corpus | 43 | 1.000 | 1.000 | [v0.2.0-cairo-auditor-realworld-benchmark.md](../../evals/scorecards/v0.2.0-cairo-auditor-realworld-benchmark.md) |📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| | Real-world corpus | 42 | 1.000 | 1.000 | [v0.2.0-cairo-auditor-realworld-benchmark.md](../../evals/scorecards/v0.2.0-cairo-auditor-realworld-benchmark.md) | | |
| | Real-world corpus | 43 | 1.000 | 1.000 | [v0.2.0-cairo-auditor-realworld-benchmark.md](../../evals/scorecards/v0.2.0-cairo-auditor-realworld-benchmark.md) | |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@skills/cairo-auditor/README.md` at line 503, Update the real-world corpus
case count in the README table: find the table row containing the text "|
Real-world corpus | 42 | 1.000 | 1.000 |
[v0.2.0-cairo-auditor-realworld-benchmark.md](../../evals/scorecards/v0.2.0-cairo-auditor-realworld-benchmark.md)
|" and change the "42" to "43" to match the v0.2.0 scorecard (which added
rw_irrevocable_admin_safe_accesscontrol_mixin_grant_role_03).
| nonzero_initializers = sorted( | ||
| { | ||
| match.group(1) | ||
| for match in re.finditer(r"\b([a-z_][a-z0-9_]*)\.initializer\s*\(", lower) | ||
| } | ||
| ) | ||
| enforces_nonzero_on_init = bool(nonzero_initializers) and (ownable or access_control) | ||
|
|
||
| # Upgradeable component performs its own non-zero class-hash check. | ||
| upgrade_nonzero_guard = upgradeable and "upgradeable.upgrade" in lower |
There was a problem hiding this comment.
Restrict non-zero initializer guards to the OZ receiver(s).
resolve_components() currently records every <receiver>.initializer(...) call once any Ownable/AccessControl marker exists, and initializer_guards_param() later treats any matching param as guarded. That means a contract that embeds OwnableComponent but also calls something like foo.initializer(admin) will suppress CRITICAL_ADDRESS_INIT_WITHOUT_NONZERO_GUARD for admin even if foo does no non-zero validation. Please scope this to the actual Ownable/AccessControl storage receivers instead of any .initializer(...) call.
Also applies to: 121-127
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@skills/cairo-auditor/scripts/quality/component_resolution.py` around lines 88
- 97, The code currently collects all "<receiver>.initializer(...)" occurrences
into nonzero_initializers and treats any as guarded when ownable or
access_control are present; change this so only initializer calls on the actual
Ownable/AccessControl storage receivers are considered. In resolve_components()
narrow the set comprehension that builds nonzero_initializers to include only
match.group(1) values that match the identified ownable/access_control receiver
names (e.g., the variables that signal OwnableComponent/AccessControlComponent),
and update initializer_guards_param() to check against that filtered set (not
just the booleans enforces_nonzero_on_init or any ".initializer" occurrence).
Also apply the same receiver-scoped filtering to the related logic around
upgrade_nonzero_guard (lines referenced 121-127) so only initializer calls on
the concrete upgradeable/ownable/access_control receivers suppress the
CRITICAL_ADDRESS_INIT_WITHOUT_NONZERO_GUARD.
| # Interaction followed by state mutation in the same file is the classic | ||
| # cross-function / CEI surface that pattern scanners miss — this is exactly | ||
| # what the adversarial pass exists to catch. |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial | 💤 Low value
Clarify "CEI" acronym for maintainability.
The comment references "the classic cross-function / CEI surface" but doesn't expand the CEI acronym. In smart contract security, CEI typically stands for "Checks-Effects-Interactions," a well-known reentrancy mitigation pattern. Adding the expansion would improve readability for reviewers unfamiliar with the shorthand.
Clarify acronym
- # Interaction followed by state mutation in the same file is the classic
- # cross-function / CEI surface that pattern scanners miss — this is exactly
- # what the adversarial pass exists to catch.
+ # Interaction followed by state mutation in the same file is the classic
+ # cross-function Checks-Effects-Interactions (CEI) violation that pattern
+ # scanners miss — this is exactly what the adversarial pass exists to catch.📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Interaction followed by state mutation in the same file is the classic | |
| # cross-function / CEI surface that pattern scanners miss — this is exactly | |
| # what the adversarial pass exists to catch. | |
| # Interaction followed by state mutation in the same file is the classic | |
| # cross-function Checks-Effects-Interactions (CEI) violation that pattern | |
| # scanners miss — this is exactly what the adversarial pass exists to catch. |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@skills/cairo-auditor/scripts/quality/surface_map.py` around lines 346 - 348,
Update the inline comment that begins "Interaction followed by state mutation in
the same file is the classic cross-function / CEI surface..." to expand the CEI
acronym to "Checks-Effects-Interactions (CEI)" so readers unfamiliar with the
abbreviation understand the reference; keep the rest of the comment unchanged
and ensure the expanded form appears the first time CEI is used in that comment.
| For **large scopes** (any bundle > 1400 lines), prefer relevance-routed source | ||
| per `{workdir}/cairo-audit-routing.json` instead of the full shared prefix: give | ||
| agent N only the files listed under `partition_files["N"]`. See deep mode Cost | ||
| Controls → "Relevance-routed source". | ||
|
|
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial | 💤 Low value
Document relevance-routed source mechanism or defer to deep.md.
The section references "relevance-routed source per cairo-audit-routing.json" and mentions giving "agent N only the files listed under partition_files[\"N\"]" for large scopes (lines 302-306). However, the concrete orchestration logic for how this routing is applied (i.e., how the orchestrator substitutes the full prefix with a filtered prefix per agent) isn't shown in this code block.
If workflows/deep.md documents the relevance-routing substitution logic, consider adding a forward reference here (e.g., "See deep.md Cost Controls → Relevance-routed source for substitution logic"). Otherwise, add a brief example showing how to build a per-agent filtered prefix using partition_files from the routing JSON.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@skills/cairo-auditor/SKILL.md` around lines 302 - 306, The doc mentions
"relevance-routed source per `cairo-audit-routing.json`" and
`partition_files["N"]` but doesn't show how the orchestrator applies that
routing; either add a forward reference to the detailed substitution logic in
`workflows/deep.md` (e.g., "See deep.md Cost Controls → Relevance-routed source
for substitution logic") or include a short example in SKILL.md explaining how
to build a per-agent filtered prefix by reading `cairo-audit-routing.json`'s
`partition_files` entry and replacing the shared prefix with only those files
for agent N (reference `cairo-audit-routing.json`, `partition_files`, and the
orchestrator/agent prefix substitution step).
| `routing.json.complexity` carries `score`, `threshold` (3), and | ||
| `adversarial_action`: |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial | 💤 Low value
Clarify whether complexity threshold is configurable.
The documentation specifies threshold (3) as a fixed value, but doesn't indicate whether users can adjust this via flag or configuration. If the threshold is intentionally fixed, consider adding a brief note explaining the rationale. If it should be configurable, document the mechanism.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@skills/cairo-auditor/workflows/deep.md` around lines 62 - 63, Clarify whether
routing.json.complexity.threshold is configurable: if it is adjustable, update
the docs to state the configuration mechanism (e.g., which CLI flag or config
key to set the threshold, expected value range, and where to place it) and show
an example usage; if it is intentionally fixed at 3, add a short rationale line
near routing.json.complexity explaining why the threshold is fixed and that
adversarial_action behavior depends on that hard-coded value. Ensure you
reference routing.json.complexity, the threshold value, and adversarial_action
in the added sentence so readers can find the relevant behavior.
| - The shared cacheable bundle prefix (#5) applies here too: the four bundles | ||
| differ only in their trailing attack-vector partition, so the leading source + | ||
| reference block is reused across agents on prompt-caching hosts. |
There was a problem hiding this comment.
Fix broken reference to "(#5)".
The text references "The shared cacheable bundle prefix (#5)" but no numbered section #5 exists in this document. Replace with a proper cross-reference format.
📝 Proposed fix
-- The shared cacheable bundle prefix (`#5`) applies here too: the four bundles
+- The shared cacheable bundle prefix (see deep.md "Shared cached source prefix") applies here too: the four bundles📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| - The shared cacheable bundle prefix (#5) applies here too: the four bundles | |
| differ only in their trailing attack-vector partition, so the leading source + | |
| reference block is reused across agents on prompt-caching hosts. | |
| - The shared cacheable bundle prefix (see deep.md "Shared cached source prefix") applies here too: the four bundles | |
| differ only in their trailing attack-vector partition, so the leading source + | |
| reference block is reused across agents on prompt-caching hosts. |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@skills/cairo-auditor/workflows/default.md` around lines 14 - 16, The sentence
referencing "The shared cacheable bundle prefix (`#5`)" uses a non-existent
numeric cross-reference; locate the line containing that exact phrase in
workflows/default.md and replace "(`#5`)" with a proper cross-reference such as a
section title or markdown link (e.g., "(see Shared cacheable bundle prefix)" or
a link to the relevant header) so the reference resolves correctly; ensure you
update the phrase "The shared cacheable bundle prefix (`#5`)" to read "The shared
cacheable bundle prefix (see Shared cacheable bundle prefix)" or similar,
preserving the rest of the sentence.
| - Default mode does **not** run the adversarial agent, so cross-partition chains | ||
| (root causes spanning two vector lenses) are weakly covered. Use `deep` mode | ||
| when boundary-spanning composition matters — see [deep.md](deep.md). |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial | 💤 Low value
Consider clarifying "cross-partition chains" for first-time readers.
The terms "cross-partition chains" and "boundary-spanning composition" assume familiarity with the vector-partition structure (the 4 agents scanning different attack-vector categories). A brief parenthetical like "(root causes spanning two vector lenses, e.g., access control + reentrancy)" would help readers who start with default.md.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@skills/cairo-auditor/workflows/default.md` around lines 17 - 19, Update the
sentence that mentions "cross-partition chains" and "boundary-spanning
composition" to include a brief parenthetical example clarifying the terms for
new readers (e.g., "(root causes spanning two vector lenses, e.g., access
control + reentrancy)"). Locate the phrase "cross-partition chains" /
"boundary-spanning composition" in the default mode description and insert the
parenthetical immediately after "cross-partition chains" so the sentence reads
clearly and preserves the recommendation to use `deep` mode.
Implements the nine reviewer suggestions for the
cairo-auditordeep mode — four quality, four cost, one housekeeping. Each is grounded in the actual code/eval data and individually tested.Quality
component_resolution.py) for embedded OZ Ownable/AccessControl/Upgradeable surfaces. Wired into the runtime surface map (new Component Resolution section), the FP gate (judging.md), vuln-db notes, and the canonical detectors. Directly closes theIRREVOCABLE_ADMIN/CRITICAL_ADDRESS_INITfalse-positive family seen in external triage (e.g. ForgeYieldsOwnableImpl).component_resolution.py,surface_map.py,judging.md,benchmark_cairo_auditor.py, vuln-db[CROSS-AGENT]clarified as vector+adversarial corroboration (it was perceived as dead because vector partitions are disjoint).adversarial.md,deep.md,judging.mdrecall_eval_cairo_auditor.pyab_default_vs_deep.pyThe recall eval is sobering and intentional: taxonomy coverage is ~13% of real findings and 0% of Critical — the supported 13-class taxonomy skews structural/access-control, missing the economic/logic bugs that dominate real audits. This is the gap the review asked to make visible; the scorecard is committed at
evals/scorecards/v0.2.0-cairo-auditor-recall-coverage.md.Cost
surface_map.py --output-routingmaps files to partitions; large scopes route each agent only its relevant files.full/downgrade/skip;--force-adversarialoverrides. Trivial getter-only contracts skip the opus pass; interaction+state surfaces keep it.escalation_plan.pypromotes only files with first-pass signal to the opus re-check.Full detail in
skills/cairo-auditor/workflows/deep.md→ Cost Controls.Housekeeping
cairo_auditor_benchmark.jsonl+ its scorecard): it duplicated the real-world pack's class distribution and 1.000/1.000 result without real-repo provenance. The real-world pack is now the single deterministic regression gate. CI, parity check, and docs updated.ci.yml(unit tests + component-resolution validator) andcairo-skills-full-evals.yml(recall eval replaces the retired synthetic benchmark step).Testing
All local suites pass:
validate_preflight.py(exact-match fixtures unchanged),validate_deep_smoke.py, newvalidate_component_resolution.pytest_cairo_auditor_quality.pyunittests (A/B + recall mapping)validate_skills.pygreen (SKILL.md back under the 500-line cap)Reviewer notes (judgment calls)
VERSIONwithout a matchingv<version>tag/release, so bumping here would break CI. Bump + tag at release time (maintainer step).🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Documentation
Tests