Skip to content

feat(cairo-auditor): quality + cost improvements for deep mode (9 changes)#447

Open
adrienlacombe wants to merge 1 commit into
mainfrom
feat/cairo-auditor-quality-and-cost
Open

feat(cairo-auditor): quality + cost improvements for deep mode (9 changes)#447
adrienlacombe wants to merge 1 commit into
mainfrom
feat/cairo-auditor-quality-and-cost

Conversation

@adrienlacombe

@adrienlacombe adrienlacombe commented Jun 1, 2026

Copy link
Copy Markdown
Collaborator

Implements the nine reviewer suggestions for the cairo-auditor deep mode — four quality, four cost, one housekeeping. Each is grounded in the actual code/eval data and individually tested.

Quality

# Change Where
#3 Component resolution — single source of truth (component_resolution.py) for embedded OZ Ownable/AccessControl/Upgradeable surfaces. Wired into the runtime surface map (new Component Resolution section), the FP gate (judging.md), vuln-db notes, and the canonical detectors. Directly closes the IRREVOCABLE_ADMIN / CRITICAL_ADDRESS_INIT false-positive family seen in external triage (e.g. ForgeYields OwnableImpl). component_resolution.py, surface_map.py, judging.md, benchmark_cairo_auditor.py, vuln-db
#4 Cross-partition pass — the adversarial agent now explicitly hunts root causes that span two vector lenses; [CROSS-AGENT] clarified as vector+adversarial corroboration (it was perceived as dead because vector partitions are disjoint). adversarial.md, deep.md, judging.md
#1 Real recall eval — measures coverage against 217 independent human-confirmed audit findings, replacing the tautological "recall 1.0" (precision on the tool's own output). recall_eval_cairo_auditor.py
#2 Default-vs-deep A/B — quantifies whether Agent 5 earns its premium cost (deep-only findings, adversarial attribution, verdict). ab_default_vs_deep.py

The recall eval is sobering and intentional: taxonomy coverage is ~13% of real findings and 0% of Critical — the supported 13-class taxonomy skews structural/access-control, missing the economic/logic bugs that dominate real audits. This is the gap the review asked to make visible; the scorecard is committed at evals/scorecards/v0.2.0-cairo-auditor-recall-coverage.md.

Cost

# Change
#5 Shared cacheable bundle prefix — source + references built once as a byte-identical prefix; prompt-caching hosts reuse it across the 4 vector agents instead of 4× re-ingestion.
#6 Relevance-routed sourcesurface_map.py --output-routing maps files to partitions; large scopes route each agent only its relevant files.
#7 Complexity-gated Agent 5 — deterministic signal decides full / downgrade / skip; --force-adversarial overrides. Trivial getter-only contracts skip the opus pass; interaction+state surfaces keep it.
#8 Density model escalationescalation_plan.py promotes only files with first-pass signal to the opus re-check.

Full detail in skills/cairo-auditor/workflows/deep.mdCost Controls.

Housekeeping

  • 🗺️ Roadmap Discussion: Strategic Plan for Completing the Agentic Infrastructure #9 Retired the redundant synthetic benchmark (cairo_auditor_benchmark.jsonl + its scorecard): it duplicated the real-world pack's class distribution and 1.000/1.000 result without real-repo provenance. The real-world pack is now the single deterministic regression gate. CI, parity check, and docs updated.
  • New tests/evals wired into ci.yml (unit tests + component-resolution validator) and cairo-skills-full-evals.yml (recall eval replaces the retired synthetic benchmark step).

Testing

All local suites pass:

  • validate_preflight.py (exact-match fixtures unchanged), validate_deep_smoke.py, new validate_component_resolution.py
  • new test_cairo_auditor_quality.py unittests (A/B + recall mapping)
  • real-world benchmark 1.000 / 1.000 at 43 cases (+1 AccessControl-rotation regression case)
  • recall eval passes the 0.10 advisory floors; validate_skills.py green (SKILL.md back under the 500-line cap)

Reviewer notes (judgment calls)

  1. Benchmark retirement (🗺️ Roadmap Discussion: Strategic Plan for Completing the Agentic Infrastructure #9) is the one destructive change — a deletion on this branch, fully reversible via git. I retired the synthetic pack and kept the real-world one; flag if you'd rather keep both or merge them.
  2. No version bump. The release-hygiene gate fails any VERSION without a matching v<version> tag/release, so bumping here would break CI. Bump + tag at release time (maintainer step).
  3. The recall eval's class→class keyword mapping is approximate; treat coverage as directional. The verdict (low taxonomy coverage) holds regardless.

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Real-world benchmark evaluation suite with recall and taxonomy coverage metrics
    • Deep-mode cost controls: shared bundle caching, relevance-based routing, complexity gating, and density-based escalation
    • Component-aware false-positive filtering for critical-address and role-rotation checks
    • A/B comparison tooling for deep-mode performance analysis
  • Documentation

    • Updated evaluation and deep-mode workflow guidance with routing and cost-control specifications
    • Enhanced component-resolution reference for improved false-positive mitigation
  • Tests

    • Unit tests for evaluation harnesses and A/B comparisons
    • Validation suite for component-resolution behavior

Implements nine reviewer suggestions across the cairo-auditor skill.

Quality:
- #3 Component resolution: shared `component_resolution.py` resolves embedded OZ
  Ownable/AccessControl/Upgradeable surfaces; wired into the surface map, the
  FP gate (judging.md), vuln-db notes, and the canonical detectors. Closes the
  IRREVOCABLE_ADMIN / CRITICAL_ADDRESS_INIT external-triage FP family. Adds a
  unit test and an AccessControl-rotation regression case.
- #4 Cross-partition pass: adversarial agent explicitly hunts boundary-spanning
  chains; [CROSS-AGENT] semantics clarified (vector+adversarial corroboration).
- #1 Recall + taxonomy coverage eval against 217 independent real-audit findings
  (recall_eval_cairo_auditor.py) — replaces the tautological "recall 1.0".
  Current coverage is an honest ~13% (0% on Critical), making the gap visible.
- #2 Default-vs-deep A/B harness (ab_default_vs_deep.py) quantifying whether
  Agent 5 earns its premium cost.

Cost:
- #5 Shared cacheable bundle prefix so prompt-caching hosts reuse source across
  vector agents instead of 4x re-ingestion.
- #6 Relevance-routed source per partition for large scopes (surface_map routing).
- #7 Complexity-gated Agent 5 (skip/downgrade/full) via deterministic signal.
- #8 Density-based model escalation (escalation_plan.py): opus only for files
  with first-pass signal.

Housekeeping:
- #9 Retired the redundant synthetic benchmark pack (duplicated the real-world
  pack); real-world pack is now the single deterministic regression gate.
- Wired new tests/evals into ci.yml and cairo-skills-full-evals.yml.

All local suites pass (preflight, deep smoke, component resolution, harness
unittests, realworld benchmark 1.0/1.0, recall eval). VERSION intentionally not
bumped — the release-hygiene gate requires a maintainer tag/release.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@coderabbitai

coderabbitai Bot commented Jun 1, 2026

Copy link
Copy Markdown

Review Change Stack

Walkthrough

This PR implements a comprehensive evaluation and cost-optimization overhaul for the Cairo Auditor skill. It migrates from synthetic to real-world deterministic benchmarks, introduces component-resolution-based false-positive mitigation, adds new evaluation harnesses for recall and A/B comparison, and enables cost-optimized deep-mode orchestration via routing and escalation planning.

Changes

Cairo Auditor Evaluation & False-Positive Mitigation

Layer / File(s) Summary
Component Resolution: Detection & Integration
skills/cairo-auditor/scripts/quality/component_resolution.py, scripts/quality/benchmark_cairo_auditor.py, skills/cairo-auditor/tests/validate_component_resolution.py
New module detects embedded OpenZeppelin Ownable/AccessControl/Upgradeable constructs and derives rotation surfaces and non-zero guard semantics. Integrates into detect_critical_address_init_without_nonzero_guard and detect_irrevocable_admin detectors via optional dynamic imports. Includes validation tests covering ownable, access-control, and no-component cases.
Evaluation Harnesses: Recall Coverage & A/B Comparison
scripts/quality/recall_eval_cairo_auditor.py, scripts/quality/ab_default_vs_deep.py, scripts/quality/test_cairo_auditor_quality.py
Adds two deterministic quality harnesses: recall_eval_cairo_auditor measures detector coverage against ground-truth findings with taxonomy mapping and severity-weighted metrics; ab_default_vs_deep compares default and deep mode outputs to quantify adversarial agent marginal value. Both include unit tests for mapping, aggregation, and verdict logic.
Cost Controls: Escalation Planning & Routing/Complexity
skills/cairo-auditor/scripts/quality/escalation_plan.py, skills/cairo-auditor/scripts/quality/surface_map.py
Adds escalation plan generation to select high-priority files for premium-model re-checking. Enhances surface map with vector-partition relevance mapping, signal-based complexity scoring, and routing JSON output for Agent 5 gating decisions in deep mode.
Benchmark Migration: Synthetic → Real-World
.github/workflows/cairo-skills-full-evals.yml, .github/workflows/ci.yml, evals/README.md, evals/cases/cairo_auditor_realworld_benchmark.jsonl, evals/scorecards/v0.2.0-cairo-auditor-*, skills/cairo-auditor/README.md, skills/cairo-auditor/references/checklists/release-gate.md
Removes synthetic benchmark pack, updates CI workflow to run recall evaluation harness instead of benchmark, adds one real-world benchmark case, updates scorecard references and release-gate checklist, and documents retirement of synthetic benchmarks.
Workflow & Judging Documentation: Deep Mode & Component Semantics
skills/cairo-auditor/SKILL.md, skills/cairo-auditor/agents/adversarial.md, skills/cairo-auditor/references/judging.md, skills/cairo-auditor/references/report-formatting.md, skills/cairo-auditor/references/vulnerability-db/*.md, skills/cairo-auditor/workflows/deep.md, skills/cairo-auditor/workflows/default.md
Adds --force-adversarial deep-mode flag, documents bundle-prefix caching and relevance routing, defines cost-control mechanics including complexity gating and escalation. Adds mandatory component-resolution judging gate, expands adversarial agent guidance for cross-partition boundary passes, and adds false-positive exemptions for OZ component rotation surfaces in vulnerability DB references.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested reviewers

  • omarespejel

Caution

Pre-merge checks failed

Please resolve all errors before merging. Addressing warnings is optional.

  • Ignore

❌ Failed checks (2 errors, 1 warning)

Check name Status Explanation Resolution
Spec Impact Declaration ❌ Error PR modifies 360 files across qualifying paths (.github/workflows/, skills/, docs/, contracts/session-account/, packages/) but lacks required "Spec impact" section in PR description. Add "Spec impact" section to PR description with either "none" or concrete compatibility/migration notes for the changes.
Security Rationale For Account Semantics ❌ Error Account contracts in contracts/agent-account/ and contracts/session-account/ were added but lack documented security rationale and explicit invariant preservation statements as required. Add security design document explicitly covering threat models, key invariants (nonce monotonicity, spending limits, session isolation, signature validation), and design rationale for each security-critical feature.
Docstring Coverage ⚠️ Warning Docstring coverage is 7.69% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (7 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title uses a clear scope prefix ('feat(cairo-auditor)') and accurately summarizes the main behavioral impact (quality + cost improvements for deep mode), matching the subsystem-focused requirement.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Cross-Repo Boundary Awareness ✅ Passed PR does not modify contracts/session-account/** or packages/**; no boundary-directory changes found. Triggering condition not met.
Cairo Contract Safety Gate ✅ Passed No Cairo contract files (contracts/**/*.cairo) were modified in this PR; it focuses on Python quality scripts, workflows, and documentation. The safety check is not applicable.
Ci/Cd Security Gate ✅ Passed All 41 workflow actions pinned to SHA hashes, minimal read-only permissions, no secrets referenced, GITHUB_TOKEN used safely for version validation only.
✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/cairo-auditor-quality-and-cost

Warning

Review ran into problems

🔥 Problems

Stopped waiting for pipeline failures after 30000ms. One of your pipelines takes longer than our 30000ms fetch window to run, so review may not consider pipeline-failure results for inline comments if any failures occurred after the fetch window. Increase the timeout if you want to wait longer or run a @coderabbit review after the pipeline has finished.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@qodo-code-review

Copy link
Copy Markdown

Review Summary by Qodo

Quality and cost improvements for cairo-auditor deep mode with component resolution and recall evaluation

✨ Enhancement 🧪 Tests

Grey Divider

Walkthroughs

Description
• Implements nine quality and cost improvements for cairo-auditor deep mode
• Adds component resolution module to eliminate false positives on embedded OZ components
• Introduces real-world recall evaluation against 217 independent audit findings
• Adds default-vs-deep A/B comparison harness to quantify Agent 5 value
• Implements cost controls: shared cached prefix, relevance routing, complexity gating, density
  escalation
• Retires redundant synthetic benchmark; real-world pack is now single regression gate
Diagram
flowchart LR
  A["Quality Improvements"] --> B["Component Resolution"]
  A --> C["Real Recall Eval"]
  A --> D["Default vs Deep A/B"]
  A --> E["Cross-Partition Pass"]
  F["Cost Controls"] --> G["Shared Cached Prefix"]
  F --> H["Relevance Routing"]
  F --> I["Complexity Gating"]
  F --> J["Density Escalation"]
  K["Housekeeping"] --> L["Retire Synthetic Benchmark"]
  K --> M["Update CI/Evals"]
  B --> N["FP Gate Integration"]
  C --> O["Taxonomy Coverage"]
  D --> P["Verdict Logic"]

Loading

Grey Divider

File Changes

1. scripts/quality/recall_eval_cairo_auditor.py 🧪 Tests +316/-0

Real-world recall evaluation against independent audit findings

scripts/quality/recall_eval_cairo_auditor.py


2. scripts/quality/ab_default_vs_deep.py 🧪 Tests +215/-0

Default-vs-deep A/B comparison harness for Agent 5 value

scripts/quality/ab_default_vs_deep.py


3. skills/cairo-auditor/scripts/quality/component_resolution.py ✨ Enhancement +191/-0

Shared component resolution for embedded OZ surfaces

skills/cairo-auditor/scripts/quality/component_resolution.py


View more (24)
4. skills/cairo-auditor/scripts/quality/escalation_plan.py ✨ Enhancement +117/-0

Density-based model escalation for cost optimization

skills/cairo-auditor/scripts/quality/escalation_plan.py


5. skills/cairo-auditor/scripts/quality/surface_map.py ✨ Enhancement +211/-4

Add component resolution, routing, and complexity signals

skills/cairo-auditor/scripts/quality/surface_map.py


6. scripts/quality/benchmark_cairo_auditor.py ✨ Enhancement +62/-0

Integrate component resolution into deterministic detectors

scripts/quality/benchmark_cairo_auditor.py


7. scripts/quality/test_cairo_auditor_quality.py 🧪 Tests +116/-0

Unit tests for recall eval and A/B comparison harnesses

scripts/quality/test_cairo_auditor_quality.py


8. skills/cairo-auditor/tests/validate_component_resolution.py 🧪 Tests +101/-0

Validation tests for component resolution module

skills/cairo-auditor/tests/validate_component_resolution.py


9. evals/cases/cairo_auditor_realworld_benchmark.jsonl 🧪 Tests +1/-0

Add AccessControl rotation regression test case

evals/cases/cairo_auditor_realworld_benchmark.jsonl


10. evals/scorecards/v0.2.0-cairo-auditor-realworld-benchmark.md 📝 Documentation +5/-4

Update benchmark scorecard with new test case

evals/scorecards/v0.2.0-cairo-auditor-realworld-benchmark.md


11. evals/scorecards/v0.2.0-cairo-auditor-recall-coverage.md 📝 Documentation +62/-0

New recall and taxonomy coverage scorecard

evals/scorecards/v0.2.0-cairo-auditor-recall-coverage.md


12. evals/scorecards/v0.2.0-cairo-auditor-recall-coverage.json 📝 Documentation +64/-0

Recall evaluation metrics in JSON format

evals/scorecards/v0.2.0-cairo-auditor-recall-coverage.json


13. skills/cairo-auditor/SKILL.md 📝 Documentation +40/-18

Document cost controls and new flags

skills/cairo-auditor/SKILL.md


14. skills/cairo-auditor/workflows/deep.md 📝 Documentation +64/-5

Document cost controls and complexity gating

skills/cairo-auditor/workflows/deep.md


15. skills/cairo-auditor/workflows/default.md 📝 Documentation +11/-2

Document shared cached prefix and cost notes

skills/cairo-auditor/workflows/default.md


16. skills/cairo-auditor/agents/adversarial.md 📝 Documentation +16/-2

Add cross-partition boundary pass instructions

skills/cairo-auditor/agents/adversarial.md


17. skills/cairo-auditor/references/judging.md 📝 Documentation +27/-1

Add component resolution FP gate rules

skills/cairo-auditor/references/judging.md


18. skills/cairo-auditor/references/vulnerability-db/IRREVOCABLE-ADMIN.md 📝 Documentation +7/-0

Document OZ component rotation surface exceptions

skills/cairo-auditor/references/vulnerability-db/IRREVOCABLE-ADMIN.md


19. skills/cairo-auditor/references/vulnerability-db/CRITICAL-ADDRESS-INIT-WITHOUT-NONZERO-GUARD.md 📝 Documentation +8/-0

Document OZ initializer non-zero enforcement exceptions

skills/cairo-auditor/references/vulnerability-db/CRITICAL-ADDRESS-INIT-WITHOUT-NONZERO-GUARD.md


20. skills/cairo-auditor/references/report-formatting.md 📝 Documentation +1/-1

Clarify cross-agent corroboration semantics

skills/cairo-auditor/references/report-formatting.md


21. skills/cairo-auditor/references/checklists/release-gate.md 📝 Documentation +1/-1

Update baseline scorecard reference

skills/cairo-auditor/references/checklists/release-gate.md


22. skills/cairo-auditor/README.md 📝 Documentation +7/-1

Document recall eval and retire synthetic benchmark

skills/cairo-auditor/README.md


23. evals/README.md 📝 Documentation +40/-2

Add recall eval and A/B comparison instructions

evals/README.md


24. .github/workflows/ci.yml ⚙️ Configuration changes +2/-0

Wire new quality tests into CI pipeline

.github/workflows/ci.yml


25. .github/workflows/cairo-skills-full-evals.yml ⚙️ Configuration changes +8/-12

Replace synthetic benchmark with recall eval

.github/workflows/cairo-skills-full-evals.yml


26. evals/cases/cairo_auditor_benchmark.jsonl Additional files +0/-42

...

evals/cases/cairo_auditor_benchmark.jsonl


27. evals/scorecards/v0.2.0-cairo-auditor-benchmark.md Additional files +0/-90

...

evals/scorecards/v0.2.0-cairo-auditor-benchmark.md


Grey Divider

Qodo Logo

@qodo-code-review qodo-code-review Bot added enhancement New feature or request Tests labels Jun 1, 2026
@qodo-code-review

qodo-code-review Bot commented Jun 1, 2026

Copy link
Copy Markdown

Code Review by Qodo

🐞 Bugs (1) 📘 Rule violations (0) 📎 Requirement gaps (0)

Context used

Grey Divider


Action required

1. Escalation plan wrong targets 🐞 Bug ≡ Correctness
Description
In escalation_plan.build_plan(), borderline dropped candidates are added to the escalation set using
dropped_candidates[].candidate as if it were a file path, so escalate_files can contain arbitrary
titles/trace text instead of real files. This breaks the documented contract that the plan lists
files to re-check with the adversarial model and can cause downstream consumers to fail when they
treat these entries as paths.
Code

skills/cairo-auditor/scripts/quality/escalation_plan.py[R64-71]

Evidence
The escalation planner currently keys escalation entries by drop['candidate'], but elsewhere in
the codebase dropped_candidates are normalized to objects where candidate is just a label/title.
The deep-mode workflow explicitly expects escalation_plan to output a list of files to re-check, so
these non-path entries violate the intended contract.

skills/cairo-auditor/scripts/quality/escalation_plan.py[39-84]
skills/cairo-auditor/scripts/quality/structured_report.py[95-127]
skills/cairo-auditor/workflows/deep.md[75-83]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`skills/cairo-auditor/scripts/quality/escalation_plan.py` uses `drop["candidate"]` as the `file_path` key when a dropped candidate is borderline. But `candidate` is not a file path (it’s a human label/title), so the generated `escalate_files` list becomes invalid and cannot reliably drive file-based re-check routing.

## Issue Context
- The script is documented to “list the files that surfaced a finding … or a borderline candidate” for targeted adversarial re-check.
- Agent output normalization treats `dropped_candidates[].candidate` as a string label, not a path.

## Fix Focus Areas
- skills/cairo-auditor/scripts/quality/escalation_plan.py[39-85]

## Suggested fix
- For borderline drops:
 - Prefer a real file field if present (e.g., `drop.get("file")` if the upstream schema ever includes it).
 - Otherwise, **do not** use `candidate` as a file key. Instead either:
   1) mark a sentinel like `"(scope)"` (constant) and add a top-level flag like `"escalate_scope": true`, or
   2) escalate all files already seen in first-pass findings (`seen_files`) by calling `mark(f, ...)` for each `f in seen_files`.
- Ensure `escalate_files` contains only valid repo-relative file paths (or a single, well-defined sentinel/flag that downstream tooling can interpret), never arbitrary candidate strings.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

Qodo Logo

labels = routing.get("partition_labels", {})
if isinstance(partition_files, dict):
lines += ["| Partition | Lens | Relevant files |", "|---|---|---|"]
for idx in sorted(partition_files, key=lambda x: int(x)):
Comment on lines +64 to +71
for drop in data.get("dropped_candidates", []) if isinstance(data.get("dropped_candidates"), list) else []:
if not isinstance(drop, dict):
continue
if str(drop.get("drop_reason")) in BORDERLINE_DROP_REASONS:
# Dropped candidates rarely carry a file; attribute to the agent
# scope so the orchestrator re-checks broadly when borderline.
mark(str(drop.get("candidate", "")) or "(scope)", "borderline candidate in first pass")

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Action required

1. Escalation plan wrong targets 🐞 Bug ≡ Correctness

In escalation_plan.build_plan(), borderline dropped candidates are added to the escalation set using
dropped_candidates[].candidate as if it were a file path, so escalate_files can contain arbitrary
titles/trace text instead of real files. This breaks the documented contract that the plan lists
files to re-check with the adversarial model and can cause downstream consumers to fail when they
treat these entries as paths.
Agent Prompt
## Issue description
`skills/cairo-auditor/scripts/quality/escalation_plan.py` uses `drop["candidate"]` as the `file_path` key when a dropped candidate is borderline. But `candidate` is not a file path (it’s a human label/title), so the generated `escalate_files` list becomes invalid and cannot reliably drive file-based re-check routing.

## Issue Context
- The script is documented to “list the files that surfaced a finding … or a borderline candidate” for targeted adversarial re-check.
- Agent output normalization treats `dropped_candidates[].candidate` as a string label, not a path.

## Fix Focus Areas
- skills/cairo-auditor/scripts/quality/escalation_plan.py[39-85]

## Suggested fix
- For borderline drops:
  - Prefer a real file field if present (e.g., `drop.get("file")` if the upstream schema ever includes it).
  - Otherwise, **do not** use `candidate` as a file key. Instead either:
    1) mark a sentinel like `"(scope)"` (constant) and add a top-level flag like `"escalate_scope": true`, or
    2) escalate all files already seen in first-pass findings (`seen_files`) by calling `mark(f, ...)` for each `f in seen_files`.
- Ensure `escalate_files` contains only valid repo-relative file paths (or a single, well-defined sentinel/flag that downstream tooling can interpret), never arbitrary candidate strings.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 11

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@scripts/quality/ab_default_vs_deep.py`:
- Around line 44-54: The current dedup logic using _key and dict comprehensions
(default_by_key / deep_by_key) overwrites duplicate findings with the same key
and undercounts differences; change it to preserve duplicates by making the key
include stable location fields even when root_cause exists (e.g., include
class_id/file/title/line or a stable location tuple) or switch from dicts to
multisets/lists (e.g., map key -> list of findings or use collections.Counter
with a deterministic serialized finding) so that comparisons operate on counts
of identical findings rather than single entries; update usages that compute
only_default/only_deep to consume the multiset/list counts accordingly.

In `@scripts/quality/benchmark_cairo_auditor.py`:
- Around line 32-42: The current _load_component_resolution function silently
returns None on any import/runtime error which masks broken in-repo helpers;
change the behavior so that only a genuinely missing file/spec causes a None
fallback, but any exception thrown during spec.loader.exec_module(module) is
propagated (or re-raised) with contextual information (include module_path and a
short message) so CI fails fast; apply the same change to the other loader block
referenced at lines 48-63 (the alternate component_resolution import logic) so
both loaders only swallow absence but not execution/import errors.

In `@scripts/quality/recall_eval_cairo_auditor.py`:
- Around line 107-117: Malformed JSON and detector exceptions are currently
being swallowed in the loop that reads findings (the block using findings_dir,
Path(fp).read_text, json.loads and findings.append), which changes the
denominator silently; instead, stop silently skipping: when json.loads fails or
when a detector throws, raise an informative exception (or aggregate and raise
after scanning) that includes the offending file path (fp), the raw line content
(or line number), and the parser/detector exception/class name so the CI fails
loudly; update the JSON parse except block to raise (or append to an errors
list) with fp and line, and ensure the detector-exception-to-miss handling (the
similar logic at lines ~141-146) is changed to surface the offending
finding/detector class and error rather than converting to a miss.

In `@skills/cairo-auditor/README.md`:
- Line 503: Update the real-world corpus case count in the README table: find
the table row containing the text "| Real-world corpus | 42 | 1.000 | 1.000 |
[v0.2.0-cairo-auditor-realworld-benchmark.md](../../evals/scorecards/v0.2.0-cairo-auditor-realworld-benchmark.md)
|" and change the "42" to "43" to match the v0.2.0 scorecard (which added
rw_irrevocable_admin_safe_accesscontrol_mixin_grant_role_03).

In `@skills/cairo-auditor/scripts/quality/component_resolution.py`:
- Around line 88-97: The code currently collects all
"<receiver>.initializer(...)" occurrences into nonzero_initializers and treats
any as guarded when ownable or access_control are present; change this so only
initializer calls on the actual Ownable/AccessControl storage receivers are
considered. In resolve_components() narrow the set comprehension that builds
nonzero_initializers to include only match.group(1) values that match the
identified ownable/access_control receiver names (e.g., the variables that
signal OwnableComponent/AccessControlComponent), and update
initializer_guards_param() to check against that filtered set (not just the
booleans enforces_nonzero_on_init or any ".initializer" occurrence). Also apply
the same receiver-scoped filtering to the related logic around
upgrade_nonzero_guard (lines referenced 121-127) so only initializer calls on
the concrete upgradeable/ownable/access_control receivers suppress the
CRITICAL_ADDRESS_INIT_WITHOUT_NONZERO_GUARD.

In `@skills/cairo-auditor/scripts/quality/escalation_plan.py`:
- Around line 39-84: The build_plan currently escalates any file referenced in
agent outputs; update it to accept an optional scope set (e.g., add parameter
scope_files: Optional[set[str]] = None to build_plan) and validate escalated
entries against that set before returning results. Specifically, either have
mark() ignore file_path values not present in scope_files (treat "(scope)"
specially) or, after populating escalate, filter escalate_files/escalate_detail
to only include keys present in scope_files when scope_files is provided; keep
functions build_plan and mark as the touch points to implement this check so
out-of-scope references are not propagated to model_for_escalated.

In `@skills/cairo-auditor/scripts/quality/surface_map.py`:
- Around line 346-348: Update the inline comment that begins "Interaction
followed by state mutation in the same file is the classic cross-function / CEI
surface..." to expand the CEI acronym to "Checks-Effects-Interactions (CEI)" so
readers unfamiliar with the abbreviation understand the reference; keep the rest
of the comment unchanged and ensure the expanded form appears the first time CEI
is used in that comment.

In `@skills/cairo-auditor/SKILL.md`:
- Around line 302-306: The doc mentions "relevance-routed source per
`cairo-audit-routing.json`" and `partition_files["N"]` but doesn't show how the
orchestrator applies that routing; either add a forward reference to the
detailed substitution logic in `workflows/deep.md` (e.g., "See deep.md Cost
Controls → Relevance-routed source for substitution logic") or include a short
example in SKILL.md explaining how to build a per-agent filtered prefix by
reading `cairo-audit-routing.json`'s `partition_files` entry and replacing the
shared prefix with only those files for agent N (reference
`cairo-audit-routing.json`, `partition_files`, and the orchestrator/agent prefix
substitution step).

In `@skills/cairo-auditor/workflows/deep.md`:
- Around line 62-63: Clarify whether routing.json.complexity.threshold is
configurable: if it is adjustable, update the docs to state the configuration
mechanism (e.g., which CLI flag or config key to set the threshold, expected
value range, and where to place it) and show an example usage; if it is
intentionally fixed at 3, add a short rationale line near
routing.json.complexity explaining why the threshold is fixed and that
adversarial_action behavior depends on that hard-coded value. Ensure you
reference routing.json.complexity, the threshold value, and adversarial_action
in the added sentence so readers can find the relevant behavior.

In `@skills/cairo-auditor/workflows/default.md`:
- Around line 14-16: The sentence referencing "The shared cacheable bundle
prefix (`#5`)" uses a non-existent numeric cross-reference; locate the line
containing that exact phrase in workflows/default.md and replace "(`#5`)" with a
proper cross-reference such as a section title or markdown link (e.g., "(see
Shared cacheable bundle prefix)" or a link to the relevant header) so the
reference resolves correctly; ensure you update the phrase "The shared cacheable
bundle prefix (`#5`)" to read "The shared cacheable bundle prefix (see Shared
cacheable bundle prefix)" or similar, preserving the rest of the sentence.
- Around line 17-19: Update the sentence that mentions "cross-partition chains"
and "boundary-spanning composition" to include a brief parenthetical example
clarifying the terms for new readers (e.g., "(root causes spanning two vector
lenses, e.g., access control + reentrancy)"). Locate the phrase "cross-partition
chains" / "boundary-spanning composition" in the default mode description and
insert the parenthetical immediately after "cross-partition chains" so the
sentence reads clearly and preserves the recommendation to use `deep` mode.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 2a91bfd6-1efc-4b96-a7ab-89570514a79a

📥 Commits

Reviewing files that changed from the base of the PR and between 51c4f7e and 41f7195.

📒 Files selected for processing (27)
  • .github/workflows/cairo-skills-full-evals.yml
  • .github/workflows/ci.yml
  • evals/README.md
  • evals/cases/cairo_auditor_benchmark.jsonl
  • evals/cases/cairo_auditor_realworld_benchmark.jsonl
  • evals/scorecards/v0.2.0-cairo-auditor-benchmark.md
  • evals/scorecards/v0.2.0-cairo-auditor-realworld-benchmark.md
  • evals/scorecards/v0.2.0-cairo-auditor-recall-coverage.json
  • evals/scorecards/v0.2.0-cairo-auditor-recall-coverage.md
  • scripts/quality/ab_default_vs_deep.py
  • scripts/quality/benchmark_cairo_auditor.py
  • scripts/quality/recall_eval_cairo_auditor.py
  • scripts/quality/test_cairo_auditor_quality.py
  • skills/cairo-auditor/README.md
  • skills/cairo-auditor/SKILL.md
  • skills/cairo-auditor/agents/adversarial.md
  • skills/cairo-auditor/references/checklists/release-gate.md
  • skills/cairo-auditor/references/judging.md
  • skills/cairo-auditor/references/report-formatting.md
  • skills/cairo-auditor/references/vulnerability-db/CRITICAL-ADDRESS-INIT-WITHOUT-NONZERO-GUARD.md
  • skills/cairo-auditor/references/vulnerability-db/IRREVOCABLE-ADMIN.md
  • skills/cairo-auditor/scripts/quality/component_resolution.py
  • skills/cairo-auditor/scripts/quality/escalation_plan.py
  • skills/cairo-auditor/scripts/quality/surface_map.py
  • skills/cairo-auditor/tests/validate_component_resolution.py
  • skills/cairo-auditor/workflows/deep.md
  • skills/cairo-auditor/workflows/default.md
💤 Files with no reviewable changes (2)
  • evals/cases/cairo_auditor_benchmark.jsonl
  • evals/scorecards/v0.2.0-cairo-auditor-benchmark.md
📜 Review details
🧰 Additional context used
📓 Path-based instructions (6)
skills/**/

📄 CodeRabbit inference engine (AGENTS.md)

skills/**/: Run python3 scripts/quality/validate_skills.py and python3 scripts/skills_manifest.py --check for skills/install UX changes
Run python3 scripts/quality/check_codex_distribution.py and python3 -m unittest scripts/quality/test_codex_distribution.py for skills/install UX changes
Run python3 scripts/quality/validate_marketplace.py for skills/install UX changes

Files:

  • skills/cairo-auditor/references/checklists/release-gate.md
  • skills/cairo-auditor/references/vulnerability-db/IRREVOCABLE-ADMIN.md
  • skills/cairo-auditor/references/report-formatting.md
  • skills/cairo-auditor/README.md
  • skills/cairo-auditor/scripts/quality/escalation_plan.py
  • skills/cairo-auditor/references/vulnerability-db/CRITICAL-ADDRESS-INIT-WITHOUT-NONZERO-GUARD.md
  • skills/cairo-auditor/workflows/deep.md
  • skills/cairo-auditor/tests/validate_component_resolution.py
  • skills/cairo-auditor/SKILL.md
  • skills/cairo-auditor/workflows/default.md
  • skills/cairo-auditor/references/judging.md
  • skills/cairo-auditor/scripts/quality/component_resolution.py
  • skills/cairo-auditor/agents/adversarial.md
  • skills/cairo-auditor/scripts/quality/surface_map.py
skills/**

⚙️ CodeRabbit configuration file

skills/**: Skills are composable agent capabilities. Each skill must have a clear interface contract, proper input validation, deterministic output, and graceful error handling. Flag any skill that makes unscoped network calls, stores state without cleanup, or lacks proper type definitions. Verify skill metadata matches implementation.

Files:

  • skills/cairo-auditor/references/checklists/release-gate.md
  • skills/cairo-auditor/references/vulnerability-db/IRREVOCABLE-ADMIN.md
  • skills/cairo-auditor/references/report-formatting.md
  • skills/cairo-auditor/README.md
  • skills/cairo-auditor/scripts/quality/escalation_plan.py
  • skills/cairo-auditor/references/vulnerability-db/CRITICAL-ADDRESS-INIT-WITHOUT-NONZERO-GUARD.md
  • skills/cairo-auditor/workflows/deep.md
  • skills/cairo-auditor/tests/validate_component_resolution.py
  • skills/cairo-auditor/SKILL.md
  • skills/cairo-auditor/workflows/default.md
  • skills/cairo-auditor/references/judging.md
  • skills/cairo-auditor/scripts/quality/component_resolution.py
  • skills/cairo-auditor/agents/adversarial.md
  • skills/cairo-auditor/scripts/quality/surface_map.py
evals/**

⚙️ CodeRabbit configuration file

evals/**: Evaluation code must be deterministic and reproducible. Flag any test that depends on external state, non-deterministic ordering, or hardcoded network endpoints. Verify that eval metrics are properly documented and assertions are meaningful.

Files:

  • evals/scorecards/v0.2.0-cairo-auditor-recall-coverage.md
  • evals/scorecards/v0.2.0-cairo-auditor-realworld-benchmark.md
  • evals/scorecards/v0.2.0-cairo-auditor-recall-coverage.json
  • evals/README.md
  • evals/cases/cairo_auditor_realworld_benchmark.jsonl
scripts/**

⚙️ CodeRabbit configuration file

scripts/**: Flag any script that runs with elevated privileges without justification. Verify input sanitization in deployment scripts. Flag hardcoded credentials, URLs, or environment-specific assumptions.

Files:

  • scripts/quality/test_cairo_auditor_quality.py
  • scripts/quality/ab_default_vs_deep.py
  • scripts/quality/benchmark_cairo_auditor.py
  • scripts/quality/recall_eval_cairo_auditor.py
.github/workflows/**

📄 CodeRabbit inference engine (Custom checks)

.github/workflows/**: For changes in .github/workflows/, flag any new or expanded permissions
For changes in .github/workflows/
, flag unpinned action versions
For changes in .github/workflows/**, flag secrets referenced without need-to-know scope

Files:

  • .github/workflows/ci.yml
  • .github/workflows/cairo-skills-full-evals.yml

⚙️ CodeRabbit configuration file

.github/workflows/**: Ensure least-privilege permissions and safe CI execution patterns. Flag any workflow with write permissions that doesn't strictly need them. Verify secrets are not exposed in logs or artifacts. Dependency actions must use pinned SHA versions, not tags.

Files:

  • .github/workflows/ci.yml
  • .github/workflows/cairo-skills-full-evals.yml
skills/**/SKILL.md

📄 CodeRabbit inference engine (CLAUDE.md)

skills/**/SKILL.md: Skills YAML frontmatter must include: name, description, keywords, allowed-tools, user-invocable
Skill name format: lowercase, hyphens only, 1-64 characters
Skills should include code examples with starknet.js patterns
List error codes with recovery steps in skills

Files:

  • skills/cairo-auditor/SKILL.md
🪛 Ruff (0.15.15)
scripts/quality/test_cairo_auditor_quality.py

[warning] 16-16: Missing return type annotation for private function _load

(ANN202)


[warning] 18-18: Assertion should be broken down into multiple parts

Break down assertion into multiple parts

(PT018)


[warning] 29-29: Missing return type annotation for private function _finding

(ANN202)


[warning] 29-29: Missing type annotation for **kw

(ANN003)

scripts/quality/ab_default_vs_deep.py

[warning] 154-154: Use list.extend to create a transformed list

(PERF401)

skills/cairo-auditor/tests/validate_component_resolution.py

[warning] 58-58: Boolean-typed positional argument in function definition

(FBT001)

scripts/quality/benchmark_cairo_auditor.py

[warning] 16-16: Missing return type annotation for private function _load_component_resolution

(ANN202)


[warning] 40-40: Do not catch blind exception: Exception

(BLE001)


[warning] 53-53: Do not catch blind exception: Exception

(BLE001)


[warning] 62-62: Do not catch blind exception: Exception

(BLE001)

skills/cairo-auditor/scripts/quality/component_resolution.py

[warning] 62-62: Unnecessary assignment to text before return statement

Remove unnecessary assignment

(RET504)

scripts/quality/recall_eval_cairo_auditor.py

[warning] 109-109: for loop variable line overwritten by assignment target

(PLW2901)


[warning] 145-145: Do not catch blind exception: Exception

(BLE001)


[warning] 185-185: Too many statements (56 > 50)

(PLR0915)

skills/cairo-auditor/scripts/quality/surface_map.py

[warning] 403-403: Use list.extend to create a transformed list

(PERF401)


[warning] 409-409: Lambda may be unnecessary; consider inlining inner function

Inline function call

(PLW0108)

🔇 Additional comments (32)
.github/workflows/cairo-skills-full-evals.yml (1)

142-149: LGTM!

.github/workflows/ci.yml (1)

406-406: LGTM!

Also applies to: 418-418

evals/README.md (1)

38-82: LGTM!

evals/cases/cairo_auditor_realworld_benchmark.jsonl (1)

43-43: LGTM!

evals/scorecards/v0.2.0-cairo-auditor-realworld-benchmark.md (1)

3-3: LGTM!

Also applies to: 9-9, 19-19, 31-31, 85-85

evals/scorecards/v0.2.0-cairo-auditor-recall-coverage.json (1)

1-64: LGTM!

evals/scorecards/v0.2.0-cairo-auditor-recall-coverage.md (1)

1-63: LGTM!

skills/cairo-auditor/README.md (1)

505-514: LGTM!

skills/cairo-auditor/references/checklists/release-gate.md (1)

7-7: LGTM!

skills/cairo-auditor/scripts/quality/escalation_plan.py (2)

94-104: LGTM!


67-70: ⚡ Quick win

Clarify “(scope)” sentinel handling and add scope validation for escalation_plan output

skills/cairo-auditor/scripts/quality/escalation_plan.py adds the literal "(scope)" to escalate_files when a borderline dropped_candidate lacks a usable candidate/file. skills/cairo-auditor/workflows/deep.md describes passing the resulting “file list” to Agent 5, but the repo doesn’t show any consumer logic that treats "(scope)" as a non-path sentinel (or that filters escalations to in-scope files). If the orchestrator treats escalate_files entries as concrete paths, this could mis-route escalations; if it’s intended to trigger a broad re-check, encode that intent explicitly (separate flag/sentinel with documented schema) and enforce in-scope validation.

skills/cairo-auditor/scripts/quality/surface_map.py (6)

10-15: LGTM!


209-209: LGTM!


217-223: LGTM!


451-461: LGTM!


330-369: ⚡ Quick win

Complexity gate threshold + adversarial_action semantics match the orchestrator contract.
surface_map.py hard-codes threshold = 3 and emits adversarial_action as full (score ≥ 3), downgrade (score 1–2), or skip (score == 0); the deep-mode orchestration in skills/cairo-auditor/workflows/deep.md and skills/cairo-auditor/SKILL.md describes the same Agent 5 behaviors, with --force-adversarial overriding the gate.


313-327: ⚡ Quick win

Relevance fallback partitions {1,2,3,4} are intentional for “no resolved signal” files.

surface_map.py’s fallback (“no resolved signal stays in all partitions so nothing is orphaned”) matches workflows/deep.md, which explicitly states that for relevance-routed large scopes, these files remain in all four partitions for Agents 1–4 (and Agent 5 reads the full in-scope set). The tradeoff is deliberate: it prevents orphaning even if it can dilute per-lens focus.

skills/cairo-auditor/SKILL.md (4)

94-94: LGTM!


232-242: LGTM!


246-306: ⚡ Quick win

Tighten the caching determinism claim; host caching support remains unproven.
The shared-prefix bash block itself has no timestamp/date/random/time() usage; the rg hit for {timestamp} is for the report filename (security-review-{timestamp}.md), not for $PREFIX generation. The caching benefit still depends on whether claude-code / codex actually reuse identical leading source prefixes, which isn’t established here; also confirm the IN_SCOPE="$WORKDIR/cairo-audit-files.txt" ordering is deterministic (e.g., sorted) since prefix byte-identity depends on that list.


337-337: ⚡ Quick win

Cost Controls: deep-mode matches for escalation + complexity gate; Vector corroboration needs traceability.

  • skills/cairo-auditor/workflows/deep.md documents adversarial_action under routing.json.complexity and includes escalation_plan.py --workdir {workdir} to list files surfaced a Wave A pass (before Agent 5).
  • The exact “Vector findings so far” corroboration block (and how it’s constructed from Agents 1–4 outputs / fed into Agent 5) isn’t present in the workflow docs matched by the check—confirm the source of that block and how [CROSS-AGENT] is produced.
skills/cairo-auditor/agents/adversarial.md (2)

19-26: LGTM!


41-48: LGTM!

skills/cairo-auditor/references/judging.md (3)

12-13: LGTM!


14-37: LGTM!


85-85: LGTM!

skills/cairo-auditor/references/report-formatting.md (1)

188-188: LGTM!

skills/cairo-auditor/references/vulnerability-db/CRITICAL-ADDRESS-INIT-WITHOUT-NONZERO-GUARD.md (1)

27-34: LGTM!

skills/cairo-auditor/references/vulnerability-db/IRREVOCABLE-ADMIN.md (1)

29-35: LGTM!

skills/cairo-auditor/workflows/deep.md (2)

26-84: LGTM!


48-63: Align routing/escalation docs with implementation

  • skills/cairo-auditor/scripts/quality/surface_map.py exposes --output-routing and writes the routing JSON; routing.complexity includes score, threshold (hardcoded 3), and adversarial_action (full when score >= threshold, skip when score == 0, otherwise downgrade), plus adversarial_recommended and reasons.
  • skills/cairo-auditor/scripts/quality/escalation_plan.py exposes --workdir and globs cairo-audit-agent-*-findings.json from that location.
skills/cairo-auditor/workflows/default.md (1)

8-10: LGTM!

Comment on lines +44 to +54
def _key(finding: dict[str, Any]) -> str:
rc = str(finding.get("root_cause", "")).strip().lower()
if rc:
return rc
return "|".join(
[
str(finding.get("class_id", "")).lower(),
str(finding.get("file", "")).lower(),
str(finding.get("title", "")).lower(),
]
)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Preserve duplicate findings when comparing the two runs.

default_by_key = {_key(f): f ...} and the deep equivalent overwrite earlier findings with the same key. Because _key() uses only root_cause when present, two distinct findings in different files/contracts can collapse into one, undercounting only_default/only_deep and potentially flipping the verdict. Compare multisets, or include stable location fields in the key even when root_cause exists.

Also applies to: 72-83

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/quality/ab_default_vs_deep.py` around lines 44 - 54, The current
dedup logic using _key and dict comprehensions (default_by_key / deep_by_key)
overwrites duplicate findings with the same key and undercounts differences;
change it to preserve duplicates by making the key include stable location
fields even when root_cause exists (e.g., include class_id/file/title/line or a
stable location tuple) or switch from dicts to multisets/lists (e.g., map key ->
list of findings or use collections.Counter with a deterministic serialized
finding) so that comparisons operate on counts of identical findings rather than
single entries; update usages that compute only_default/only_deep to consume the
multiset/list counts accordingly.

Comment on lines +32 to +42
if not module_path.is_file():
return None
spec = importlib.util.spec_from_file_location("cairo_auditor_component_resolution", module_path)
if spec is None or spec.loader is None:
return None
module = importlib.util.module_from_spec(spec)
try:
spec.loader.exec_module(module)
except Exception:
return None
return module

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fail when the shared helper is present but broken.

If component_resolution.py exists but import/execution fails, _load_component_resolution() silently returns None, and later helper failures degrade to {}/False. That changes detector outcomes while letting the deterministic benchmark continue as if the shared logic were absent. Keep the fallback only for a genuinely missing helper; in-repo import/runtime errors should raise with context so CI catches the regression.

Also applies to: 48-63

🧰 Tools
🪛 Ruff (0.15.15)

[warning] 40-40: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/quality/benchmark_cairo_auditor.py` around lines 32 - 42, The current
_load_component_resolution function silently returns None on any import/runtime
error which masks broken in-repo helpers; change the behavior so that only a
genuinely missing file/spec causes a None fallback, but any exception thrown
during spec.loader.exec_module(module) is propagated (or re-raised) with
contextual information (include module_path and a short message) so CI fails
fast; apply the same change to the other loader block referenced at lines 48-63
(the alternate component_resolution import logic) so both loaders only swallow
absence but not execution/import errors.

Comment on lines +107 to +117
for fp in sorted(glob.glob(str(findings_dir / "*.findings.jsonl"))):
for line in Path(fp).read_text(encoding="utf-8").splitlines():
line = line.strip()
if not line:
continue
try:
row = json.loads(line)
except json.JSONDecodeError:
continue
if isinstance(row, dict) and row.get("finding_id"):
findings.append(row)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Do not silently drop broken findings or crashing detectors.

Malformed JSON rows are skipped, and detector exceptions are converted into misses. Both cases change the denominator/result without surfacing that the evaluation input or harness is broken, which makes the scorecard unreliable as a deterministic gate. Please raise with the offending file/finding/class context instead of continuing.

Also applies to: 141-146

🧰 Tools
🪛 Ruff (0.15.15)

[warning] 109-109: for loop variable line overwritten by assignment target

(PLW2901)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/quality/recall_eval_cairo_auditor.py` around lines 107 - 117,
Malformed JSON and detector exceptions are currently being swallowed in the loop
that reads findings (the block using findings_dir, Path(fp).read_text,
json.loads and findings.append), which changes the denominator silently;
instead, stop silently skipping: when json.loads fails or when a detector
throws, raise an informative exception (or aggregate and raise after scanning)
that includes the offending file path (fp), the raw line content (or line
number), and the parser/detector exception/class name so the CI fails loudly;
update the JSON parse except block to raise (or append to an errors list) with
fp and line, and ensure the detector-exception-to-miss handling (the similar
logic at lines ~141-146) is changed to surface the offending finding/detector
class and error rather than converting to a miss.

| Suite | Cases | Precision | Recall | Scorecard |
| --- | ---: | ---: | ---: | --- |
| Core deterministic | 42 | 1.000 | 1.000 | [v0.2.0-cairo-auditor-benchmark.md](../../evals/scorecards/v0.2.0-cairo-auditor-benchmark.md) |
| Real-world corpus | 42 | 1.000 | 1.000 | [v0.2.0-cairo-auditor-realworld-benchmark.md](../../evals/scorecards/v0.2.0-cairo-auditor-realworld-benchmark.md) |

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Update the case count to match the scorecard.

The benchmark table shows 42 cases for the real-world corpus, but the v0.2.0 scorecard reports 43 cases after adding rw_irrevocable_admin_safe_accesscontrol_mixin_grant_role_03. Please update this line to reflect the current count.

📊 Proposed fix
-| Real-world corpus | 42 | 1.000 | 1.000 | [v0.2.0-cairo-auditor-realworld-benchmark.md](../../evals/scorecards/v0.2.0-cairo-auditor-realworld-benchmark.md) |
+| Real-world corpus | 43 | 1.000 | 1.000 | [v0.2.0-cairo-auditor-realworld-benchmark.md](../../evals/scorecards/v0.2.0-cairo-auditor-realworld-benchmark.md) |
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
| Real-world corpus | 42 | 1.000 | 1.000 | [v0.2.0-cairo-auditor-realworld-benchmark.md](../../evals/scorecards/v0.2.0-cairo-auditor-realworld-benchmark.md) |
| Real-world corpus | 43 | 1.000 | 1.000 | [v0.2.0-cairo-auditor-realworld-benchmark.md](../../evals/scorecards/v0.2.0-cairo-auditor-realworld-benchmark.md) |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@skills/cairo-auditor/README.md` at line 503, Update the real-world corpus
case count in the README table: find the table row containing the text "|
Real-world corpus | 42 | 1.000 | 1.000 |
[v0.2.0-cairo-auditor-realworld-benchmark.md](../../evals/scorecards/v0.2.0-cairo-auditor-realworld-benchmark.md)
|" and change the "42" to "43" to match the v0.2.0 scorecard (which added
rw_irrevocable_admin_safe_accesscontrol_mixin_grant_role_03).

Comment on lines +88 to +97
nonzero_initializers = sorted(
{
match.group(1)
for match in re.finditer(r"\b([a-z_][a-z0-9_]*)\.initializer\s*\(", lower)
}
)
enforces_nonzero_on_init = bool(nonzero_initializers) and (ownable or access_control)

# Upgradeable component performs its own non-zero class-hash check.
upgrade_nonzero_guard = upgradeable and "upgradeable.upgrade" in lower

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Restrict non-zero initializer guards to the OZ receiver(s).

resolve_components() currently records every <receiver>.initializer(...) call once any Ownable/AccessControl marker exists, and initializer_guards_param() later treats any matching param as guarded. That means a contract that embeds OwnableComponent but also calls something like foo.initializer(admin) will suppress CRITICAL_ADDRESS_INIT_WITHOUT_NONZERO_GUARD for admin even if foo does no non-zero validation. Please scope this to the actual Ownable/AccessControl storage receivers instead of any .initializer(...) call.

Also applies to: 121-127

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@skills/cairo-auditor/scripts/quality/component_resolution.py` around lines 88
- 97, The code currently collects all "<receiver>.initializer(...)" occurrences
into nonzero_initializers and treats any as guarded when ownable or
access_control are present; change this so only initializer calls on the actual
Ownable/AccessControl storage receivers are considered. In resolve_components()
narrow the set comprehension that builds nonzero_initializers to include only
match.group(1) values that match the identified ownable/access_control receiver
names (e.g., the variables that signal OwnableComponent/AccessControlComponent),
and update initializer_guards_param() to check against that filtered set (not
just the booleans enforces_nonzero_on_init or any ".initializer" occurrence).
Also apply the same receiver-scoped filtering to the related logic around
upgrade_nonzero_guard (lines referenced 121-127) so only initializer calls on
the concrete upgradeable/ownable/access_control receivers suppress the
CRITICAL_ADDRESS_INIT_WITHOUT_NONZERO_GUARD.

Comment on lines +346 to +348
# Interaction followed by state mutation in the same file is the classic
# cross-function / CEI surface that pattern scanners miss — this is exactly
# what the adversarial pass exists to catch.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial | 💤 Low value

Clarify "CEI" acronym for maintainability.

The comment references "the classic cross-function / CEI surface" but doesn't expand the CEI acronym. In smart contract security, CEI typically stands for "Checks-Effects-Interactions," a well-known reentrancy mitigation pattern. Adding the expansion would improve readability for reviewers unfamiliar with the shorthand.

Clarify acronym
-    # Interaction followed by state mutation in the same file is the classic
-    # cross-function / CEI surface that pattern scanners miss — this is exactly
-    # what the adversarial pass exists to catch.
+    # Interaction followed by state mutation in the same file is the classic
+    # cross-function Checks-Effects-Interactions (CEI) violation that pattern
+    # scanners miss — this is exactly what the adversarial pass exists to catch.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Interaction followed by state mutation in the same file is the classic
# cross-function / CEI surface that pattern scanners miss — this is exactly
# what the adversarial pass exists to catch.
# Interaction followed by state mutation in the same file is the classic
# cross-function Checks-Effects-Interactions (CEI) violation that pattern
# scanners miss — this is exactly what the adversarial pass exists to catch.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@skills/cairo-auditor/scripts/quality/surface_map.py` around lines 346 - 348,
Update the inline comment that begins "Interaction followed by state mutation in
the same file is the classic cross-function / CEI surface..." to expand the CEI
acronym to "Checks-Effects-Interactions (CEI)" so readers unfamiliar with the
abbreviation understand the reference; keep the rest of the comment unchanged
and ensure the expanded form appears the first time CEI is used in that comment.

Comment on lines +302 to +306
For **large scopes** (any bundle > 1400 lines), prefer relevance-routed source
per `{workdir}/cairo-audit-routing.json` instead of the full shared prefix: give
agent N only the files listed under `partition_files["N"]`. See deep mode Cost
Controls → "Relevance-routed source".

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial | 💤 Low value

Document relevance-routed source mechanism or defer to deep.md.

The section references "relevance-routed source per cairo-audit-routing.json" and mentions giving "agent N only the files listed under partition_files[\"N\"]" for large scopes (lines 302-306). However, the concrete orchestration logic for how this routing is applied (i.e., how the orchestrator substitutes the full prefix with a filtered prefix per agent) isn't shown in this code block.

If workflows/deep.md documents the relevance-routing substitution logic, consider adding a forward reference here (e.g., "See deep.md Cost Controls → Relevance-routed source for substitution logic"). Otherwise, add a brief example showing how to build a per-agent filtered prefix using partition_files from the routing JSON.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@skills/cairo-auditor/SKILL.md` around lines 302 - 306, The doc mentions
"relevance-routed source per `cairo-audit-routing.json`" and
`partition_files["N"]` but doesn't show how the orchestrator applies that
routing; either add a forward reference to the detailed substitution logic in
`workflows/deep.md` (e.g., "See deep.md Cost Controls → Relevance-routed source
for substitution logic") or include a short example in SKILL.md explaining how
to build a per-agent filtered prefix by reading `cairo-audit-routing.json`'s
`partition_files` entry and replacing the shared prefix with only those files
for agent N (reference `cairo-audit-routing.json`, `partition_files`, and the
orchestrator/agent prefix substitution step).

Comment on lines +62 to +63
`routing.json.complexity` carries `score`, `threshold` (3), and
`adversarial_action`:

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial | 💤 Low value

Clarify whether complexity threshold is configurable.

The documentation specifies threshold (3) as a fixed value, but doesn't indicate whether users can adjust this via flag or configuration. If the threshold is intentionally fixed, consider adding a brief note explaining the rationale. If it should be configurable, document the mechanism.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@skills/cairo-auditor/workflows/deep.md` around lines 62 - 63, Clarify whether
routing.json.complexity.threshold is configurable: if it is adjustable, update
the docs to state the configuration mechanism (e.g., which CLI flag or config
key to set the threshold, expected value range, and where to place it) and show
an example usage; if it is intentionally fixed at 3, add a short rationale line
near routing.json.complexity explaining why the threshold is fixed and that
adversarial_action behavior depends on that hard-coded value. Ensure you
reference routing.json.complexity, the threshold value, and adversarial_action
in the added sentence so readers can find the relevant behavior.

Comment on lines +14 to +16
- The shared cacheable bundle prefix (#5) applies here too: the four bundles
differ only in their trailing attack-vector partition, so the leading source +
reference block is reused across agents on prompt-caching hosts.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Fix broken reference to "(#5)".

The text references "The shared cacheable bundle prefix (#5)" but no numbered section #5 exists in this document. Replace with a proper cross-reference format.

📝 Proposed fix
-- The shared cacheable bundle prefix (`#5`) applies here too: the four bundles
+- The shared cacheable bundle prefix (see deep.md "Shared cached source prefix") applies here too: the four bundles
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
- The shared cacheable bundle prefix (#5) applies here too: the four bundles
differ only in their trailing attack-vector partition, so the leading source +
reference block is reused across agents on prompt-caching hosts.
- The shared cacheable bundle prefix (see deep.md "Shared cached source prefix") applies here too: the four bundles
differ only in their trailing attack-vector partition, so the leading source +
reference block is reused across agents on prompt-caching hosts.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@skills/cairo-auditor/workflows/default.md` around lines 14 - 16, The sentence
referencing "The shared cacheable bundle prefix (`#5`)" uses a non-existent
numeric cross-reference; locate the line containing that exact phrase in
workflows/default.md and replace "(`#5`)" with a proper cross-reference such as a
section title or markdown link (e.g., "(see Shared cacheable bundle prefix)" or
a link to the relevant header) so the reference resolves correctly; ensure you
update the phrase "The shared cacheable bundle prefix (`#5`)" to read "The shared
cacheable bundle prefix (see Shared cacheable bundle prefix)" or similar,
preserving the rest of the sentence.

Comment on lines +17 to +19
- Default mode does **not** run the adversarial agent, so cross-partition chains
(root causes spanning two vector lenses) are weakly covered. Use `deep` mode
when boundary-spanning composition matters — see [deep.md](deep.md).

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial | 💤 Low value

Consider clarifying "cross-partition chains" for first-time readers.

The terms "cross-partition chains" and "boundary-spanning composition" assume familiarity with the vector-partition structure (the 4 agents scanning different attack-vector categories). A brief parenthetical like "(root causes spanning two vector lenses, e.g., access control + reentrancy)" would help readers who start with default.md.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@skills/cairo-auditor/workflows/default.md` around lines 17 - 19, Update the
sentence that mentions "cross-partition chains" and "boundary-spanning
composition" to include a brief parenthetical example clarifying the terms for
new readers (e.g., "(root causes spanning two vector lenses, e.g., access
control + reentrancy)"). Locate the phrase "cross-partition chains" /
"boundary-spanning composition" in the default mode description and insert the
parenthetical immediately after "cross-partition chains" so the sentence reads
clearly and preserves the recommendation to use `deep` mode.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request Tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant