refactor(github-review): replace confidence scoring with evidence-based validation #628

AndyMik90 · 2026-01-03T19:14:55Z

Summary

Replace confidence scoring (0-100) with binary evidence-based validation in PR review prompts
Add "NEVER ASSUME - ALWAYS VERIFY" instructions to all reviewer agent prompts
Update Pydantic models to use evidence_verified_in_file boolean instead of confidence floats
Update tests to use new evidence-based validation fields

Changes

Prompt Updates

pr_reviewer.md: Added comprehensive "NEVER ASSUME - ALWAYS VERIFY" section with 5 rules
pr_followup.md: Replaced confidence threshold with evidence-based verification
pr_followup_orchestrator.md: Added critical verification section for all invoked agents
pr_finding_validator.md: Strengthened Step 2 with NEVER ASSUME language
pr_followup_resolution_agent.md: Added verification section with evidence requirements
pr_followup_newcode_agent.md: Added "NEVER ASSUME - ALWAYS VERIFY" section

Code Updates

confidence.py: Added deprecation warning, fixed import order
models.py: Updated to use evidence field
pydantic_models.py: Use evidence_verified_in_file boolean instead of confidence
response_parsers.py: Updated parsing for new schema

Test Updates

test_finding_validation.py: Rewrote to use evidence_verified_in_file
test_structured_outputs.py: Updated to use evidence field

Why

The confidence scoring approach (0-100 percentages) was subjective and led to false positives. The new evidence-based approach is binary: either you can prove the issue with actual code, or you don't report it.

Key principle: No evidence = No finding

Test plan

All 1227 tests pass (1148 passed, 10 skipped, 1 xfailed, 1 xpassed)

🤖 Generated with Claude Code

Summary by CodeRabbit

New Features
- Evidence-based validation: findings now require concrete code snippets and line numbers; confidence scores removed.
- New "NEVER ASSUME - ALWAYS VERIFY" guidance and updated templates enforcing explicit code proof for reported issues.
Refactor
- Shifted from probabilistic confidence to binary, evidence-first validation and verification flags.
- Expanded examples and guidelines for confirmed, dismissed, and needs-human-review outcomes.
Tests
- Updated tests to cover evidence-driven fields and unverified/hallucinated findings.
Bug Fixes
- Fixed terminal recreation race to avoid premature resets.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

…ed validation Removes confidence scores (0-100) from PR review system in favor of evidence-based validation. This addresses the root cause of false positives: reviews reporting issues they couldn't prove with actual code. Key changes: - Remove confidence field from PRReviewFinding and pydantic models - Add required 'evidence' field for code snippets proving issues exist - Deprecate confidence.py module with migration guidance - Update response_parsers.py to filter by evidence presence, not score - Add "NEVER ASSUME - ALWAYS VERIFY" sections to all reviewer prompts - Update finding-validator to use binary evidence verification - Update tests for new evidence-based validation model The new model is simple: either you can show the problematic code, or you don't report the finding. No more "80% confident" hedging. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

coderabbitai · 2026-01-03T19:15:07Z

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

📝 Walkthrough

Walkthrough

Migration from confidence-score-based review to an evidence-first validation process: prompts, schemas, parsers, tests, and docs now require concrete code evidence and boolean verification flags; the confidence module is deprecated and some reviewer/orchestrator enums updated.

Changes

Cohort / File(s)	Summary
Prompt Documentation Updates `apps/backend/prompts/github/pr_finding_validator.md`, `apps/backend/prompts/github/pr_followup.md`, `apps/backend/prompts/github/pr_followup_newcode_agent.md`, `apps/backend/prompts/github/pr_followup_orchestrator.md`, `apps/backend/prompts/github/pr_followup_resolution_agent.md`, `apps/backend/prompts/github/pr_reviewer.md`	Rewrote guidance to require explicit code evidence; removed confidence thresholds; added "NEVER ASSUME - ALWAYS VERIFY" blocks, PR-scope and line-existence checks, updated templates/examples to use `evidence` instead of `confidence`, and expanded hallucination/false-positive guidance.
Core Data Models `apps/backend/runners/github/models.py`	Replaced `confidence: float` with `evidence: str
Pydantic Service Models `apps/backend/runners/github/services/pydantic_models.py`	Removed many `confidence` fields and added `evidence` where applicable. `FindingValidationResult` now uses `evidence_verified_in_file: bool` (binary validation) and docstrings/descriptions updated; confidence validators removed.
Response Parsing & Filtering `apps/backend/runners/github/services/response_parsers.py`	Replaced confidence-based filtering with evidence gating (introduced `MIN_EVIDENCE_LENGTH`); changed `parse_review_findings(..., require_evidence=True)` signature; extract/use `evidence`/`code_snippet` and drop findings with insufficient evidence.
Orchestrator / Reviewer Services `apps/backend/runners/github/services/parallel_orchestrator_reviewer.py`, `apps/backend/runners/github/services/parallel_followup_reviewer.py`, ...	Updated constructors/parsing to pass `evidence` instead of `confidence`; removed propagation of `validation_confidence` in followup reviewer paths.
Deprecation Notice `apps/backend/runners/github/confidence.py`	Module marked DEPRECATED; emits `DeprecationWarning` on import and docstring updated to note evidence-based replacement.
Tests `tests/test_finding_validation.py`, `tests/test_structured_outputs.py`	Updated tests to use `evidence` and `evidence_verified_in_file: bool` instead of numeric `confidence`; removed confidence-range/normalization tests; added hallucinated/unverified evidence tests; adjusted assertions/fixtures.
Misc / Frontend & Utilities `apps/backend/merge/git_utils.py`, `apps/frontend/src/renderer/components/terminal/usePtyProcess.ts`	`git_utils.find_worktree` adds fallback to legacy `.worktrees` path. Terminal recreation logic guarded to avoid race conditions by not unconditionally clearing `isCreatingRef` during controlled recreation.

Sequence Diagram(s)

(No sequence diagrams generated — changes are primarily schema/prompts/tests and do not introduce a new multi-component control flow that requires visualization.)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

feat(terminal): add worktree support for terminals #625 — touches find_worktree/worktree path logic; overlaps the git_utils.find_worktree fallback changes.
fix(github): improve PR review with structured outputs and fork support #363 — added structured models and confidence normalization; overlaps the migration from confidence to evidence in pydantic/models and parsing.
chore: bump version to 2.7.2-beta.12 #460 — modifies PRReviewFinding and related serialization; likely to conflict with the evidence vs. confidence field changes.

Suggested labels

area/backend, feature, size/XL

Suggested reviewers

MikeeBuilds

Poem

🐇 I nibble lines where truths abide,
No fuzzy scores — just code beside.
Show me the snippet, exact and bright,
I hop, I check, I stamp it right. 🥕

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes the main change: replacing confidence scoring with evidence-based validation, which is the core refactoring across prompts, models, and tests.
Docstring Coverage	✅ Passed	Docstring coverage is 90.32% which is sufficient. The required threshold is 80.00%.

✨ Finishing touches

📝 Generate docstrings

📜 Recent review details

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 906b61b and 47ec87b.

📒 Files selected for processing (2)

apps/backend/runners/github/services/parallel_followup_reviewer.py
apps/backend/runners/github/services/parallel_orchestrator_reviewer.py

💤 Files with no reviewable changes (1)

apps/backend/runners/github/services/parallel_followup_reviewer.py

🧰 Additional context used

📓 Path-based instructions (1)

apps/backend/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

apps/backend/**/*.py: Always use the Claude Agent SDK (claude-agent-sdk package) for all AI interactions, never use the Anthropic API directly
Use the create_client() function from apps/backend/core/client.py to instantiate Claude SDK clients, not direct ClaudeSDKClient initialization

Files:

apps/backend/runners/github/services/parallel_orchestrator_reviewer.py

⚙️ CodeRabbit configuration file

apps/backend/**/*.py: Focus on Python best practices, type hints, and async patterns.
Check for proper error handling and security considerations.
Verify compatibility with Python 3.12+.

Files:

apps/backend/runners/github/services/parallel_orchestrator_reviewer.py

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)

GitHub Check: Cursor Bugbot
GitHub Check: CodeQL (javascript-typescript)
GitHub Check: CodeQL (python)

🔇 Additional comments (2)

apps/backend/runners/github/services/parallel_orchestrator_reviewer.py (2)

972-972: No action required — the code is correct.

The evidence field in PRReviewFinding is explicitly defined as str | None = None (line 218 of apps/backend/runners/github/models.py), which means None is not only acceptable but is the intended default value. The change evidence=f_data.get("evidence") correctly passes None when the key is missing, which aligns with the model's design.

Likely an incorrect or invalid review comment.

589-589: Remove the unused _normalize_confidence method (lines 998-1002).

This method is dead code from the deprecated confidence-based validation approach. Since the evidence field is optional in the PRReviewFinding model, both line 589's direct attribute access and line 972's dictionary .get() will work correctly—missing evidence will default to None without raising an error. The _normalize_confidence method is no longer needed and should be removed.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist · 2026-01-03T19:15:23Z

Summary of Changes

Hello @AndyMik90, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request fundamentally refactors the PR review process by moving from a subjective confidence scoring system to an objective, evidence-based validation approach. The core principle is that any reported issue must be verifiable with actual code evidence, eliminating speculative or hallucinated findings. This change aims to significantly reduce false positives and improve the reliability and trustworthiness of the review system by ensuring that all reported issues are grounded in concrete code examples.

Highlights

Evidence-Based Validation: Replaced the subjective 0-100 confidence scoring system with a binary, evidence-based validation approach for PR review findings. Findings now require concrete code evidence to be reported.
Prompt Updates: All reviewer agent prompts (pr_reviewer, pr_followup, pr_followup_orchestrator, pr_finding_validator, pr_followup_resolution_agent, pr_followup_newcode_agent) have been updated with 'NEVER ASSUME - ALWAYS VERIFY' instructions and explicit requirements for code evidence.
Pydantic Model Refactor: Pydantic models (models.py, pydantic_models.py) were updated to remove confidence fields and introduce evidence (string for code snippet) and evidence_verified_in_file (boolean) fields to support the new validation methodology.
Confidence Module Deprecation: The confidence.py module has been marked as deprecated, signaling a complete shift away from confidence scoring.
Test Suite Alignment: Existing tests (test_finding_validation.py, test_structured_outputs.py) have been rewritten or updated to reflect the new evidence-based validation fields and logic, ensuring the system functions as intended.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

Updated the find_worktree function to first check the new worktree path at .auto-claude/worktrees/tasks/ for task directories. Added a fallback to check the legacy path at .worktrees/ for backwards compatibility, ensuring that existing workflows are not disrupted. This change improves the robustness of worktree handling in the project. 🤖 Generated with [Claude Code](https://claude.com/claude-code)

gemini-code-assist

Code Review

This pull request introduces a significant and well-executed refactoring, replacing the subjective confidence scoring system with a more robust, binary evidence-based validation approach. The changes are comprehensive, touching prompts, Pydantic models, data classes, response parsers, and tests to consistently enforce the new 'No evidence = No finding' principle. The deprecation of the old confidence module is handled excellently with clear warnings and documentation. The updated prompts provide strong, clear guidance to the AI agents. The tests have been thoroughly updated to reflect the new data models and logic. Overall, this is a high-quality change that will improve the reliability of review findings. I have one minor suggestion to improve the clarity of a field description in a Pydantic model, but otherwise, the implementation is outstanding.

gemini-code-assist · 2026-01-03T19:17:18Z

apps/backend/runners/github/services/pydantic_models.py

    fixable: bool = Field(False, description="Whether this can be auto-fixed")
+    evidence: str | None = Field(
+        None,
+        description="Actual code snippet proving the issue exists. Required for validation.",


The description for the evidence field says "Required for validation", which could be slightly misleading since the field is optional in the model (str | None). To improve clarity and accurately reflect that the requirement is enforced by process rather than by this model's validation, consider rephrasing the description.

This change would apply to all models that inherit this field from BaseFinding, making the schema more self-documenting and clearer for developers.

Suggested change

description="Actual code snippet proving the issue exists. Required for validation.",

description="A code snippet that proves the issue exists. This is required by prompts and downstream validation; findings without it may be filtered.",

coderabbitai

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

apps/backend/runners/github/confidence.py (1)
2-30: Excellent deprecation documentation.

The updated docstring clearly communicates the deprecation, explains the new evidence-based approach, and provides concrete migration guidance. The structure is well-organized and actionable.

Consider adding a removal timeline.

While the deprecation is well-documented, consider adding a target version or timeline for removal (e.g., "will be removed in version 2.0" or "will be removed after Q2 2026") to help downstream code plan their migration.
Optional: Add removal timeline
 """
-DEPRECATED: Review Confidence Scoring
+DEPRECATED (removal planned for v2.0): Review Confidence Scoring
 =====================================
 
-This module is DEPRECATED and will be removed in a future version.
+This module is DEPRECATED and will be removed in version 2.0.
apps/backend/prompts/github/pr_followup_orchestrator.md (1)
178-194: JSON example still contains deprecated confidence field.

The example shows "confidence": 0.92 and "confidence": 0.88 for finding validations, but the PR removes confidence scoring in favor of evidence_verified_in_file. This example is inconsistent with the updated FindingValidationResult schema in pydantic_models.py.
🔎 Proposed fix
     {
       "finding_id": "SEC-001",
       "validation_status": "confirmed_valid",
       "code_evidence": "const query = `SELECT * FROM users WHERE id = ${userId}`;",
       "line_range": [45, 45],
       "explanation": "SQL injection is present - user input is concatenated...",
-      "confidence": 0.92
+      "evidence_verified_in_file": true
     },
     {
       "finding_id": "QUAL-002",
       "validation_status": "dismissed_false_positive",
       "code_evidence": "const sanitized = DOMPurify.sanitize(data);",
       "line_range": [23, 26],
       "explanation": "Original finding claimed XSS but code uses DOMPurify...",
-      "confidence": 0.88
+      "evidence_verified_in_file": true
     }

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 13535f1 and 7524909.

📒 Files selected for processing (12)

apps/backend/prompts/github/pr_finding_validator.md
apps/backend/prompts/github/pr_followup.md
apps/backend/prompts/github/pr_followup_newcode_agent.md
apps/backend/prompts/github/pr_followup_orchestrator.md
apps/backend/prompts/github/pr_followup_resolution_agent.md
apps/backend/prompts/github/pr_reviewer.md
apps/backend/runners/github/confidence.py
apps/backend/runners/github/models.py
apps/backend/runners/github/services/pydantic_models.py
apps/backend/runners/github/services/response_parsers.py
tests/test_finding_validation.py
tests/test_structured_outputs.py

🧰 Additional context used

📓 Path-based instructions (3)

apps/backend/prompts/**/*.md

📄 CodeRabbit inference engine (CLAUDE.md)

Agent prompts should be stored as markdown files in apps/backend/prompts/ with descriptive names matching their purpose

Files:

apps/backend/prompts/github/pr_followup_orchestrator.md
apps/backend/prompts/github/pr_followup_resolution_agent.md
apps/backend/prompts/github/pr_followup_newcode_agent.md
apps/backend/prompts/github/pr_followup.md
apps/backend/prompts/github/pr_finding_validator.md
apps/backend/prompts/github/pr_reviewer.md

apps/backend/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

apps/backend/**/*.py: Always use the Claude Agent SDK (claude-agent-sdk package) for all AI interactions, never use the Anthropic API directly
Use the create_client() function from apps/backend/core/client.py to instantiate Claude SDK clients, not direct ClaudeSDKClient initialization

Files:

apps/backend/runners/github/models.py
apps/backend/runners/github/confidence.py
apps/backend/runners/github/services/response_parsers.py
apps/backend/runners/github/services/pydantic_models.py

⚙️ CodeRabbit configuration file

apps/backend/**/*.py: Focus on Python best practices, type hints, and async patterns.
Check for proper error handling and security considerations.
Verify compatibility with Python 3.12+.

Files:

apps/backend/runners/github/models.py
apps/backend/runners/github/confidence.py
apps/backend/runners/github/services/response_parsers.py
apps/backend/runners/github/services/pydantic_models.py

tests/**

⚙️ CodeRabbit configuration file

tests/**: Ensure tests are comprehensive and follow pytest conventions.
Check for proper mocking and test isolation.

Files:

tests/test_finding_validation.py
tests/test_structured_outputs.py

🧬 Code graph analysis (3)

tests/test_finding_validation.py (1)

apps/backend/runners/github/services/pydantic_models.py (1)

FindingValidationResult (584-630)

tests/test_structured_outputs.py (1)

apps/backend/runners/github/services/pydantic_models.py (2)

OrchestratorFinding (296-322)

DeepAnalysisFinding (74-87)

apps/backend/runners/github/services/response_parsers.py (2)

apps/backend/runners/github/models.py (1)

PRReviewFinding (204-274)

apps/frontend/src/preload/api/modules/github-api.ts (1)

PRReviewFinding (295-306)

🪛 Gitleaks (8.30.0)

apps/backend/prompts/github/pr_followup.md

[high] 146-146: Detected a Generic API Key, potentially exposing access to various services and sensitive operations.

(generic-api-key)

tests/test_structured_outputs.py

[high] 271-271: Detected a Generic API Key, potentially exposing access to various services and sensitive operations.

(generic-api-key)

🪛 markdownlint-cli2 (0.18.1)

apps/backend/prompts/github/pr_followup_resolution_agent.md

54-54: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below

(MD022, blanks-around-headings)

60-60: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below

(MD022, blanks-around-headings)

66-66: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below

(MD022, blanks-around-headings)

apps/backend/prompts/github/pr_finding_validator.md

82-82: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below

(MD022, blanks-around-headings)

89-89: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below

(MD022, blanks-around-headings)

97-97: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below

(MD022, blanks-around-headings)

apps/backend/prompts/github/pr_reviewer.md

19-19: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)

GitHub Check: Cursor Bugbot
GitHub Check: CodeQL (javascript-typescript)
GitHub Check: CodeQL (python)

🔇 Additional comments (39)

apps/backend/prompts/github/pr_reviewer.md (6)

17-25: Evidence requirements section is clear and well-structured.

This section effectively establishes the core principle of evidence-based validation with unambiguous language and actionable requirements. The inclusion of the decision test ("Can I prove this with actual code from the file?") provides reviewers with a clear decision framework.

27-49: "NEVER ASSUME - ALWAYS VERIFY" section is comprehensive and practical.

This section effectively translates the evidence-based principle into procedural guidance with concrete rules, verification steps, and real-world false positive examples. The inclusion of specific mistakes (hallucination on line numbers, missing validation context, framework misidentification) demonstrates understanding of actual review challenges. The structure—rules → verification steps → common mistakes—provides clear decision-making guidance.

236-282: JSON output examples properly demonstrate evidence field requirements.

All three example findings include concrete evidence fields with actual code snippets, effectively demonstrating the evidence-based validation approach. The evidence varies appropriately by finding type (SQL injection query, endpoint handler, function signature) and supports the described issues. The examples serve as clear implementation guidance for parsers and validators.

285-308: Evidence field correctly marked as REQUIRED with clear definition.

The evidence field is properly categorized as required (line 301) with explicit definition emphasizing copy-pasted code. This consistent reinforcement across multiple sections (Evidence Requirements, Field Definitions, Example Finding) effectively communicates the non-negotiable nature of evidence for all findings.

310-321: Guidelines reinforce evidence requirements prominently.

Guideline 6 correctly positions evidence as a critical quality criterion, using consistent language ("no code, no finding") with the core principle. The placement among the top quality guidelines appropriately signals importance to reviewers.

323-352: Example finding and important notes effectively reinforce evidence-based approach.

The JWT secret hardcoding example (lines 332-352) effectively demonstrates evidence-based finding with realistic code, proportionate fix guidance, and proper field population. The example serves as a capstone to earlier guidance, tying together all requirements in a complete, actionable finding.

apps/backend/prompts/github/pr_followup_resolution_agent.md (2)

51-71: Evidence and verification sections align well with PR objectives.

The new "Provide Evidence" (lines 51–56), "NEVER ASSUME - ALWAYS VERIFY" (lines 57–71), and "Evidence Requirements" (lines 102–107) sections clearly establish the shift from confidence-based to evidence-based validation. The explicit directives ("Read the actual code," "Verify the pattern still exists," etc.) and the principle "No evidence = No finding" are well-expressed and actionable.

113-134: JSON examples correctly reflect evidence-based schema.

The output examples show the evidence field (e.g., line 118: "evidence": "cursor.execute(...)") and resolution_notes replacing the previous confidence field. The code snippets in evidence (lines 118, 124, 130–131) are concrete and verifiable, consistent with the PR's core principle.

apps/backend/prompts/github/pr_followup_newcode_agent.md (3)

94-117: Evidence and verification requirements are clear and consistent across prompts.

The "NEVER ASSUME - ALWAYS VERIFY" section (lines 94–108) and "Evidence Requirements" (lines 109–116) enforce the same evidence-first principle as pr_followup_resolution_agent.md. The statement "No evidence = No finding" (line 116) is a strong, clear mandate that aligns with the PR's core objective.

122-154: JSON output format correctly uses evidence field.

The examples at lines 122–154 properly demonstrate the evidence field with actual code snippets (e.g., lines 133, 147) and include context (file, line number, category, severity). The format is consistent with the schema changes described in the PR objectives.

146-146: Gitleaks warning is a false positive; example is intentional.

The static analysis tool flagged line 146 as exposing an API key. This is expected—the line intentionally demonstrates a vulnerable pattern that the agent should catch and flag. No action needed.

apps/backend/prompts/github/pr_followup.md (3)

74-79: NEVER ASSUME directive is present but less detailed than in other follow-up prompts.

Lines 74–79 introduce the NEVER ASSUME - ALWAYS VERIFY principle but with fewer explicit checks than pr_followup_resolution_agent.md and pr_followup_newcode_agent.md. Consider aligning the level of detail: this is the orchestrating follow-up prompt that spawns specialized agents, so the brevity is acceptable, but ensure it adequately emphasizes reading code and verifying exact lines before reporting.

Consider whether the directive needs more procedural detail (e.g., "Actually READ the code at the file/line you cite" and "Verify there's no validation in surrounding code"), similar to the specialized agent prompts, to maintain consistency.

146-146: Evidence field documentation and examples are consistent with schema changes.

Line 184 explicitly marks evidence as REQUIRED in the new_findings schema, and the example at line 146 shows the evidence field populated with actual code (const API_KEY = 'sk-prod-abc123xyz789';). This aligns with the PR's evidence-based validation objective.

Also applies to: 184-184

146-146: Gitleaks warning is a false positive; example is intentional.

The static analysis tool flagged line 146 as exposing an API key. This is expected—the line intentionally demonstrates a hardcoded secret that the reviewer should identify and flag. No action needed.

apps/backend/prompts/github/pr_finding_validator.md (3)

3-5: Clear articulation of the evidence-based validation principle.

The core principle "Evidence, not confidence scores" and the binary validation model are well-defined. This provides clear guidance for the AI agent to produce consistent, verifiable findings.

48-70: Strong anti-hallucination guidance.

The "NEVER ASSUME" block with explicit verification questions and the list of behaviors to avoid is excellent for reducing false positives. This directly addresses the PR objective of preventing unsubstantiated findings.

152-167: Well-structured evidence guidelines table.

The decision rules clearly map evidence scenarios to validation statuses. The binary logic (evidence proves/disproves, or escalate) aligns with the PR's goal of eliminating ambiguous confidence scores.

apps/backend/prompts/github/pr_followup_orchestrator.md (1)

208-226: Consistent verification guidance across prompts.

The "NEVER ASSUME - ALWAYS VERIFY" block mirrors the guidance in pr_finding_validator.md, ensuring consistent behavior across the agent system.

apps/backend/runners/github/services/pydantic_models.py (5)

49-52: Evidence field added to BaseFinding.

The optional evidence field with a clear description aligns with the PR's evidence-based validation approach. This field propagates to all finding subclasses.

319-322: Consistent evidence field across orchestrator findings.

Both OrchestratorFinding and ParallelOrchestratorFinding now use the evidence field instead of confidence, maintaining consistency across the finding models.

Also applies to: 395-398

461-464: ResolutionVerification requires non-empty evidence.

The min_length=1 constraint ensures evidence is always provided, which is appropriate since resolution verification should always include proof of the resolution status.

584-630: FindingValidationResult schema properly implements evidence-based validation.

The model includes:

evidence_verified_in_file: bool replacing confidence scores

code_evidence with min_length=1 ensuring non-empty proof

explanation with min_length=20 for meaningful context

Clear docstring explaining the binary validation principle

This is well-designed for the PR's objectives.

540-548: Forward reference to FindingValidationResult is valid.

The from __future__ import annotations import (line 25) enables forward references, so using FindingValidationResult at line 542 before its definition at line 584 is valid in Python 3.12+.

tests/test_structured_outputs.py (4)

225-240: Test updated to verify evidence field.

The test properly validates the new evidence field with assertions checking for non-null value and expected content.

242-252: Evidence optionality test replaces confidence normalization.

Good test coverage for the optional nature of the evidence field, ensuring the model accepts findings without evidence when appropriate.

268-272: Gitleaks false positive - intentional test data.

The static analysis flagged API_KEY = 'sk-prod-12345abcdef' as a potential API key leak. This is intentional test data demonstrating a "hardcoded secret" security finding scenario - not an actual leaked credential. The test correctly exercises the security finding model.

381-394: DeepAnalysisFinding evidence test.

Test correctly validates that evidence field works on DeepAnalysisFinding (which inherits from BaseFinding). The test data demonstrates a concrete code evidence pattern.

apps/backend/runners/github/models.py (4)

217-222: Evidence field replaces confidence in PRReviewFinding.

The evidence field with clear documentation ("Actual code snippet showing the issue") properly implements the evidence-based validation model. The removal of the confidence field aligns with the PR objectives.

243-246: Serialization includes evidence field.

The to_dict method correctly serializes the new evidence field while maintaining all existing validation fields.

266-268: Deserialization loads evidence field.

The from_dict method correctly loads the evidence field using data.get("evidence"), maintaining backward compatibility with existing data that may not have this field.

44-45: New review categories for verification and redundancy.

The VERIFICATION_FAILED and REDUNDANCY categories support the enhanced evidence-based review workflow where unverifiable findings and duplicate code detection are distinct categories.

apps/backend/runners/github/services/response_parsers.py (4)

36-38: Evidence length threshold defined.

MIN_EVIDENCE_LENGTH = 20 provides a reasonable threshold for valid evidence. This matches the min_length=20 constraint on the explanation field in FindingValidationResult, maintaining consistency.

68-75: Evidence-based parsing with clear documentation.

The require_evidence parameter with default True enforces evidence validation by default while allowing flexibility. The docstring clearly explains the evidence-based validation approach.

85-95: Evidence extraction and filtering logic.

The code correctly:

Extracts evidence from either evidence or code_snippet fields (backward compatibility)

Filters findings with insufficient evidence when require_evidence=True

Logs dropped findings with clear reason (character count vs threshold)

113-116: PRReviewFinding constructed with evidence field.

The evidence field is correctly populated (or set to None if empty), replacing the previous confidence field. This completes the parser's transition to evidence-based validation.

tests/test_finding_validation.py (4)

7-10: Clear documentation of the evidence-based validation change.

The updated module docstring explains the shift from confidence scores to evidence_verified_in_file, helping future maintainers understand the validation model.

95-106: New test case for hallucinated findings.

The test_hallucinated_finding_not_verified test is crucial for validating the scenario where original findings referenced non-existent code. This ensures the evidence_verified_in_file=False case is properly handled.

136-148: Required field validation test.

The test correctly verifies that evidence_verified_in_file is a required field and that omitting it raises a ValidationError. This ensures the schema enforces evidence verification.

400-412: All validation status values tested.

The test iterates through all valid status values (confirmed_valid, dismissed_false_positive, needs_human_review) and verifies each can be used with the new evidence-based model.

coderabbitai · 2026-01-03T19:18:27Z

apps/backend/prompts/github/pr_finding_validator.md

 ### `confirmed_valid`
-Use when you verify the issue IS real:
+Use when your code evidence PROVES the issue IS real:
 - The problematic code pattern exists exactly as described
- The vulnerability/bug is present and exploitable
+- You can point to the specific lines showing the vulnerability/bug
 - The code quality issue genuinely impacts the codebase
+- **Key question**: Does your code_evidence field contain the actual problematic code?

 ### `dismissed_false_positive`
-Use when you verify the issue does NOT exist:
- The described code pattern is not actually present
- The original finding misunderstood the code
- There is mitigating code that prevents the issue (e.g., input validation elsewhere)
- The finding was based on incorrect assumptions
+Use when your code evidence PROVES the issue does NOT exist:
+- The described code pattern is not actually present (code_evidence shows different code)
+- There is mitigating code that prevents the issue (code_evidence shows the mitigation)
+- The finding was based on incorrect assumptions (code_evidence shows reality)
+- The line number doesn't exist or contains different code than claimed
+- **Key question**: Does your code_evidence field show code that disproves the original finding?

 ### `needs_human_review`
-Use when you cannot determine with confidence:
- The issue requires runtime analysis to verify
+Use when you CANNOT find definitive evidence either way:
+- The issue requires runtime analysis to verify (static code doesn't prove/disprove)
 - The code is too complex to analyze statically
- You have conflicting evidence
- Your confidence is below 0.70
+- You found the code but can't determine if it's actually a problem
+- **Key question**: Is your code_evidence inconclusive?


🧹 Nitpick | 🔵 Trivial

Minor markdown formatting: missing blank lines before headings.

The static analysis tool flagged missing blank lines before headings at lines 82, 89, and 97. This is a minor formatting issue.

🔎 Proposed fix

### `confirmed_valid` + Use when your code evidence PROVES the issue IS real: - The problematic code pattern exists exactly as described ... ### `dismissed_false_positive` + Use when your code evidence PROVES the issue does NOT exist: ... ### `needs_human_review` + Use when you CANNOT find definitive evidence either way:

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

### `confirmed_valid`

Use when you verify the issue IS real:

Use when your code evidence PROVES the issue IS real:

- The problematic code pattern exists exactly as described

- The vulnerability/bug is present and exploitable

- You can point to the specific lines showing the vulnerability/bug

- The code quality issue genuinely impacts the codebase

- **Key question**: Does your code_evidence field contain the actual problematic code?

### `dismissed_false_positive`

Use when you verify the issue does NOT exist:

- The described code pattern is not actually present

- The original finding misunderstood the code

- There is mitigating code that prevents the issue (e.g., input validation elsewhere)

- The finding was based on incorrect assumptions

Use when your code evidence PROVES the issue does NOT exist:

- The described code pattern is not actually present (code_evidence shows different code)

- There is mitigating code that prevents the issue (code_evidence shows the mitigation)

- The finding was based on incorrect assumptions (code_evidence shows reality)

- The line number doesn't exist or contains different code than claimed

- **Key question**: Does your code_evidence field show code that disproves the original finding?

### `needs_human_review`

Use when you cannot determine with confidence:

- The issue requires runtime analysis to verify

Use when you CANNOT find definitive evidence either way:

- The issue requires runtime analysis to verify (static code doesn't prove/disprove)

- The code is too complex to analyze statically

- You have conflicting evidence

- Your confidence is below 0.70

- You found the code but can't determine if it's actually a problem

- **Key question**: Is your code_evidence inconclusive?

### `confirmed_valid`

Use when your code evidence PROVES the issue IS real:

- The problematic code pattern exists exactly as described

- You can point to the specific lines showing the vulnerability/bug

- The code quality issue genuinely impacts the codebase

- **Key question**: Does your code_evidence field contain the actual problematic code?

### `dismissed_false_positive`

Use when your code evidence PROVES the issue does NOT exist:

- The described code pattern is not actually present (code_evidence shows different code)

- There is mitigating code that prevents the issue (code_evidence shows the mitigation)

- The finding was based on incorrect assumptions (code_evidence shows reality)

- The line number doesn't exist or contains different code than claimed

- **Key question**: Does your code_evidence field show code that disproves the original finding?

### `needs_human_review`

Use when you CANNOT find definitive evidence either way:

- The issue requires runtime analysis to verify (static code doesn't prove/disprove)

- The code is too complex to analyze statically

- You found the code but can't determine if it's actually a problem

- **Key question**: Is your code_evidence inconclusive?

🧰 Tools

🪛 markdownlint-cli2 (0.18.1)

82-82: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below

(MD022, blanks-around-headings)

89-89: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below

(MD022, blanks-around-headings)

97-97: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below

(MD022, blanks-around-headings)

🤖 Prompt for AI Agents

In apps/backend/prompts/github/pr_finding_validator.md around lines 82 to 102, there are headings at lines 82, 89, and 97 missing the required blank lines before them; fix by inserting a single blank line immediately before each of those heading lines so each heading is separated from the preceding paragraph or list item, then save and optionally run your markdown linter (e.g., markdownlint) to confirm formatting.

coderabbitai · 2026-01-03T19:18:27Z

apps/backend/prompts/github/pr_followup_resolution_agent.md

+- **Show what changed** - before vs after
+- **Explain WHY** this proves resolution/non-resolution
+


⚠️ Potential issue | 🟡 Minor

Fix markdown formatting: headings require blank lines.

Headings at lines 54, 60, and 66 are missing blank lines above them, violating MD022 style rules. Add a blank line before each heading.

🔎 Proposed fix

-### 4. Provide Evidence + +### 4. Provide Evidence

-## NEVER ASSUME - ALWAYS VERIFY + +## NEVER ASSUME - ALWAYS VERIFY

-**You MUST:** + +**You MUST:**

Also applies to: 60-64, 66-70

🧰 Tools

🪛 markdownlint-cli2 (0.18.1)

54-54: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below

(MD022, blanks-around-headings)

🤖 Prompt for AI Agents

In apps/backend/prompts/github/pr_followup_resolution_agent.md around lines 54-56 (and similarly for ranges 60-64 and 66-70), several Markdown headings are missing a blank line above them (violating MD022); fix by inserting a single blank line immediately before each heading at those line ranges so every heading is separated from the preceding content by an empty line.

coderabbitai · 2026-01-03T19:18:27Z

apps/backend/prompts/github/pr_reviewer.md

+## Evidence Requirements

-**CRITICAL: Quality over quantity**
+**CRITICAL: No evidence = No finding**


⚠️ Potential issue | 🟡 Minor

Fix markdown formatting: use heading syntax instead of emphasis.

Line 19 uses bold emphasis (**CRITICAL: No evidence = No finding**) where a proper heading would be more appropriate and comply with markdown linting rules.

🔎 Proposed fix

-**CRITICAL: No evidence = No finding** +### CRITICAL: No evidence = No finding

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

**CRITICAL: No evidence = No finding**

### CRITICAL: No evidence = No finding

🧰 Tools

🪛 markdownlint-cli2 (0.18.1)

19-19: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)

🤖 Prompt for AI Agents

In apps/backend/prompts/github/pr_reviewer.md around line 19, the text currently uses bold emphasis ("**CRITICAL: No evidence = No finding**") but should be a Markdown heading; replace the bold markup with an appropriate heading level (e.g., prepend one or more '#' characters) so the line becomes a proper heading (for example "# CRITICAL: No evidence = No finding" or "## CRITICAL: No evidence = No finding") to satisfy markdown linting and improve document structure.

coderabbitai · 2026-01-03T19:18:28Z

apps/backend/runners/github/confidence.py

+import warnings
+
+warnings.warn(
+    "The confidence module is deprecated. Use evidence-based validation instead. "
+    "See models.py 'evidence' field and pr_finding_validator.md for the new approach.",
+    DeprecationWarning,
+    stacklevel=2,
+)


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# First, verify the file exists and show the specified lines cat -n apps/backend/runners/github/confidence.py | head -50

Repository: AndyMik90/Auto-Claude

Length of output: 2127

🏁 Script executed:

# Search for imports of the confidence module rg -n --type=py 'from.*confidence import|import.*confidence' --max-count=20

Repository: AndyMik90/Auto-Claude

Length of output: 47

🏁 Script executed:

# Search for usage of confidence-related classes/functions rg -n --type=py '\b(ConfidenceScorer|ScoredFinding|ConfidenceLevel|score_finding|confidence\.)\b' --max-count=30

Repository: AndyMik90/Auto-Claude

Length of output: 1904

🏁 Script executed:

# Check if models.py and pr_finding_validator.md exist as referenced in the warning message find . -type f -name "models.py" -o -name "pr_finding_validator.md" | grep -v ".git"

Repository: AndyMik90/Auto-Claude

Length of output: 668

🏁 Script executed:

# Search for pr_finding_validator.md find . -type f -name "*pr_finding_validator*" -o -name "*finding_validator*" | grep -v ".git"

Repository: AndyMik90/Auto-Claude

Length of output: 47

🏁 Script executed:

# More comprehensive search for confidence module imports with different patterns rg -n --type=py "confidence" apps/backend/spec/critique.py -B2 -A2

Repository: AndyMik90/Auto-Claude

Length of output: 816

🏁 Script executed:

# Check if confidence module is imported dynamically or through __all__ rg -n --type=py "getattr|importlib|__all__.*confidence|from \. import" apps/backend/runners/github/ | head -20

Repository: AndyMik90/Auto-Claude

Length of output: 2254

🏁 Script executed:

# Verify models.py files contain the 'evidence' field mentioned in the warning rg -n "evidence.*:" apps/backend/runners/ai_analyzer/models.py apps/backend/spec/validate_pkg/models.py | head -10

Repository: AndyMik90/Auto-Claude

Length of output: 47

Fix deprecation warning message: referenced documentation and fields do not exist.

The warning implementation uses correct Python deprecation mechanics (warnings.warn(), DeprecationWarning, stacklevel=2), but the warning message directs developers to non-existent resources:

pr_finding_validator.md - This documentation file does not exist in the repository

models.py 'evidence' field - No 'evidence' field found in the referenced model files

Update the warning message to either:

Reference actual documentation and fields that exist in the codebase, or

Remove specific file/field references and provide more general guidance on the migration path

Additionally, since no imports of this module were found in active code (only internal class definitions), the transition period may already be complete or the module is unused. Consider verifying whether this deprecation warning is still necessary.

🤖 Prompt for AI Agents

In apps/backend/runners/github/confidence.py around lines 34 to 41, the DeprecationWarning message points to non-existent resources; update the warning to reference actual, existing migration guidance (or remove specific file/field names). Replace the current message with either a concise general migration note (e.g., "This module is deprecated; migrate to the evidence-based validation used elsewhere — see <actual-doc-or-path> or contact the backend team for migration steps.") or point to the correct existing doc/field if one exists, and verify whether the module is imported anywhere; if it is unused, consider removing the warning entirely. Ensure the warning still uses warnings.warn(..., DeprecationWarning, stacklevel=2).

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

apps/backend/merge/git_utils.py (1)
20-28: Update docstring to document the two-path fallback behavior.

The docstring should explain that the function checks the new location (.auto-claude/worktrees/tasks/) first and falls back to the legacy location (.worktrees/) for backward compatibility.
🔎 Proposed docstring enhancement
 def find_worktree(project_dir: Path, task_id: str) -> Path | None:
     """
     Find the worktree path for a task.
+    
+    Searches for the task worktree in two locations:
+    1. New location: .auto-claude/worktrees/tasks/
+    2. Legacy location: .worktrees/ (for backward compatibility)

     Args:
         project_dir: The project root directory
         task_id: The task identifier

     Returns:
         Path to the worktree, or None if not found
     """

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7524909 and 906b61b.

📒 Files selected for processing (2)

apps/backend/merge/git_utils.py
apps/frontend/src/renderer/components/terminal/usePtyProcess.ts

🧰 Additional context used

📓 Path-based instructions (4)

apps/backend/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

apps/backend/**/*.py: Always use the Claude Agent SDK (claude-agent-sdk package) for all AI interactions, never use the Anthropic API directly
Use the create_client() function from apps/backend/core/client.py to instantiate Claude SDK clients, not direct ClaudeSDKClient initialization

Files:

apps/backend/merge/git_utils.py

⚙️ CodeRabbit configuration file

apps/backend/**/*.py: Focus on Python best practices, type hints, and async patterns.
Check for proper error handling and security considerations.
Verify compatibility with Python 3.12+.

Files:

apps/backend/merge/git_utils.py

apps/frontend/src/**/*.{ts,tsx,jsx}

📄 CodeRabbit inference engine (CLAUDE.md)

Always use i18n translation keys for all user-facing text in the frontend instead of hardcoded strings

Files:

apps/frontend/src/renderer/components/terminal/usePtyProcess.ts

apps/frontend/src/**/*.{ts,tsx}

📄 CodeRabbit inference engine (CLAUDE.md)

Use useTranslation() hook with namespace prefixes (e.g., 'navigation:items.key') for accessing translation strings in React components

Files:

apps/frontend/src/renderer/components/terminal/usePtyProcess.ts

apps/frontend/**/*.{ts,tsx}

⚙️ CodeRabbit configuration file

apps/frontend/**/*.{ts,tsx}: Review React patterns and TypeScript type safety.
Check for proper state management and component composition.

Files:

apps/frontend/src/renderer/components/terminal/usePtyProcess.ts

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Cursor Bugbot

🔇 Additional comments (2)

apps/frontend/src/renderer/components/terminal/usePtyProcess.ts (2)

114-123: Lifecycle helpers provide clean API for controlled recreation.

The prepareForRecreate and resetForRecreate functions give external callers explicit control over the recreation sequence. The naming is clear, and the implementation correctly sets/clears the ref guards.

Note: Callers must ensure resetForRecreate() is always called after prepareForRecreate(), even in error paths. Otherwise, the terminal will remain stuck in the "creating" state. Consider documenting this requirement in JSDoc comments or the component's documentation.

32-40: Guard correctly prevents race condition during terminal recreation.

The condition if (isCreatedRef.current && !isCreatingRef.current) properly gates the reset logic, ensuring that isCreatedRef is only cleared when a terminal exists and no external recreation is in progress. The currentCwdRef.current = cwd assignment outside the guard correctly tracks the latest working directory regardless of recreation state.

External callers in Terminal.tsx follow the required sequence correctly: prepareForRecreate() is called before updating the cwd, and resetForRecreate() is called after destroying the terminal. This prevents the creation effect from running until the old terminal is fully destroyed.

coderabbitai · 2026-01-03T19:26:29Z

apps/backend/merge/git_utils.py

+    # Check new path first
+    new_worktrees_dir = project_dir / ".auto-claude" / "worktrees" / "tasks"
+    if new_worktrees_dir.exists():
+        for entry in new_worktrees_dir.iterdir():
            if entry.is_dir() and task_id in entry.name:
                return entry
+
+    # Legacy fallback for backwards compatibility
+    legacy_worktrees_dir = project_dir / ".worktrees"
+    if legacy_worktrees_dir.exists():
+        for entry in legacy_worktrees_dir.iterdir():
+            if entry.is_dir() and task_id in entry.name:
+                return entry


⚠️ Potential issue | 🟠 Major

Add error handling to ensure legacy fallback works even if new path is inaccessible.

If new_worktrees_dir.iterdir() raises a filesystem error (e.g., PermissionError, OSError), the function will crash without checking the legacy path, breaking the backward compatibility guarantee.

🔎 Proposed fix to wrap each path check in try-except

- # Check new path first - new_worktrees_dir = project_dir / ".auto-claude" / "worktrees" / "tasks" - if new_worktrees_dir.exists(): - for entry in new_worktrees_dir.iterdir(): - if entry.is_dir() and task_id in entry.name: - return entry - - # Legacy fallback for backwards compatibility - legacy_worktrees_dir = project_dir / ".worktrees" - if legacy_worktrees_dir.exists(): - for entry in legacy_worktrees_dir.iterdir(): - if entry.is_dir() and task_id in entry.name: - return entry + # Check new path first + new_worktrees_dir = project_dir / ".auto-claude" / "worktrees" / "tasks" + if new_worktrees_dir.exists(): + try: + for entry in new_worktrees_dir.iterdir(): + if entry.is_dir() and task_id in entry.name: + return entry + except (PermissionError, OSError): + # If new path is inaccessible, continue to legacy fallback + pass + + # Legacy fallback for backwards compatibility + legacy_worktrees_dir = project_dir / ".worktrees" + if legacy_worktrees_dir.exists(): + try: + for entry in legacy_worktrees_dir.iterdir(): + if entry.is_dir() and task_id in entry.name: + return entry + except (PermissionError, OSError): + # If legacy path is also inaccessible, return None + pass

🤖 Prompt for AI Agents

In apps/backend/merge/git_utils.py around lines 30 to 42, the current iteration over new_worktrees_dir can raise filesystem errors and prevent the legacy fallback from running; wrap the check/iteration over the new path in a try/except that catches OSError and PermissionError (or a general OSError) and on exception silently skip to the legacy path, then likewise wrap the legacy directory's exists/iterdir in a try/except to handle filesystem errors and return None if both checks fail; ensure you do not swallow unexpected exceptions beyond OSError family and keep behavior of returning the matching entry when found.

coderabbitai · 2026-01-03T19:26:29Z

apps/backend/merge/git_utils.py

+    new_worktrees_dir = project_dir / ".auto-claude" / "worktrees" / "tasks"
+    if new_worktrees_dir.exists():
+        for entry in new_worktrees_dir.iterdir():
            if entry.is_dir() and task_id in entry.name:


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # First, let's look at the git_utils.py file around lines 34 and 41 wc -l apps/backend/merge/git_utils.py

Repository: AndyMik90/Auto-Claude

Length of output: 99

🏁 Script executed:

# Read the file with context around lines 34 and 41 sed -n '20,50p' apps/backend/merge/git_utils.py

Repository: AndyMik90/Auto-Claude

Length of output: 1002

🏁 Script executed:

# Search for worktree creation and directory naming patterns rg -n -C5 --type=py 'worktrees|task.*dir|mkdir.*task' apps/backend/ -g '!test*'

Repository: AndyMik90/Auto-Claude

Length of output: 50378

🏁 Script executed:

# Look for any constants or functions that define task directory naming rg -n -C3 --type=py 'task_id|worktree.*name|TASK.*DIR' apps/backend/ -g '!test*'

Repository: AndyMik90/Auto-Claude

Length of output: 50380

🏁 Script executed:

# Find all callers of find_worktree_path rg -n "find_worktree_path" apps/backend/ --type=py -A 2 -B 2

Repository: AndyMik90/Auto-Claude

Length of output: 47

🏁 Script executed:

# Search for how task_id is typically formatted or extracted rg -n "task_id\s*=" apps/backend/merge/ --type=py | head -30

Repository: AndyMik90/Auto-Claude

Length of output: 2343

🏁 Script executed:

# Check if there's any documentation or test cases showing the format find apps/backend -name "*test*.py" -type f | xargs rg -l "find_worktree_path" 2>/dev/null

Repository: AndyMik90/Auto-Claude

Length of output: 49

🏁 Script executed:

# Search more broadly for find_worktree_path usage rg -n "find_worktree_path" apps/backend/ --type=py

Repository: AndyMik90/Auto-Claude

Length of output: 47

🏁 Script executed:

# Look at how this function is defined and exported grep -n "def find_worktree_path\|find_worktree_path" apps/backend/merge/git_utils.py -A 5 -B 2

Repository: AndyMik90/Auto-Claude

Length of output: 47

🏁 Script executed:

# Check what imports this function rg "from.*git_utils.*import|import.*git_utils" apps/backend/ --type=py

Repository: AndyMik90/Auto-Claude

Length of output: 1060

🏁 Script executed:

# Look at the broader context of how specs/tasks are named rg -n "spec_name|spec-" apps/backend/core/worktree.py -B 2 -A 2 | head -50

Repository: AndyMik90/Auto-Claude

Length of output: 1721

🏁 Script executed:

# Search for the actual function name - it might be find_worktree not find_worktree_path rg -n "def find_worktree" apps/backend/merge/git_utils.py -A 10

Repository: AndyMik90/Auto-Claude

Length of output: 380

🏁 Script executed:

# Look at the git_utils.py file to see all exported functions head -100 apps/backend/merge/git_utils.py

Repository: AndyMik90/Auto-Claude

Length of output: 1931

🏁 Script executed:

# Search for how task_id is used when calling worktree functions rg -n "find_worktree.*task|task_id.*worktree" apps/backend/merge/ --type=py -B 3 -A 3

Repository: AndyMik90/Auto-Claude

Length of output: 3953

🏁 Script executed:

# Check how spec names are actually used vs task IDs rg -n "spec.name|spec_name" apps/backend/merge/ --type=py | head -20

Repository: AndyMik90/Auto-Claude

Length of output: 1121

🏁 Script executed:

# Let me see how task_id is converted to spec_name more clearly sed -n '180,210p' apps/backend/merge/timeline_git.py

Repository: AndyMik90/Auto-Claude

Length of output: 1101

🏁 Script executed:

# Search for TaskMergeRequest to understand the task_id format better rg -n "TaskMergeRequest\|task_id.*=" apps/backend/merge/orchestrator.py -B 2 -A 2 | head -40

Repository: AndyMik90/Auto-Claude

Length of output: 47

🏁 Script executed:

# Look for actual task_id values being used in the codebase rg -n "task_id\s*=\s*[\"']task-|task_id\s*=\s*[\"'][0-9]" apps/backend/ --type=py | head -20

Repository: AndyMik90/Auto-Claude

Length of output: 448

🏁 Script executed:

# Check the actual format of spec directories being created rg -n "spec.*dir\|spec.*name" apps/backend/spec/pipeline/models.py -B 2 -A 2 | head -40

Repository: AndyMik90/Auto-Claude

Length of output: 47

🏁 Script executed:

# Look for how spec directories are named (the pattern) rg -n "create_spec_dir\|[0-9][0-9][0-9]-" apps/backend/ --type=py | head -15

Repository: AndyMik90/Auto-Claude

Length of output: 47

🏁 Script executed:

# Test the actual substring matching concern - create a scenario python3 << 'EOF' # Simulate the find_worktree function behavior def test_substring_matching(): # Example directories that could exist dirs = ["001-auth", "0010-database", "100-setup", "10-legacy"] # Test cases test_cases = [ ("001", "001-auth"), # Should match 001-auth (exact prefix) ("001", "0010-database"), # BUG: matches unintended! ("100", "100-setup"), # Should match 100-setup ("100", "1000-other"), # BUG: matches unintended! ("10", "10-legacy"), # Should match 10-legacy ("10", "100-setup"), # BUG: matches unintended! ] for task_id, dirname in test_cases: matches = task_id in dirname print(f"task_id='{task_id}' in '{dirname}': {matches}") test_substring_matching() EOF

Repository: AndyMik90/Auto-Claude

Length of output: 151

🏁 Script executed:

# Check actual calls to find_worktree to see what task_id format is passed rg -n "find_worktree\(" apps/backend/ --type=py -B 5 -A 2

Repository: AndyMik90/Auto-Claude

Length of output: 1187

🏁 Script executed:

# Look for how task_id is set in orchestrator where it's used sed -n '220,260p' apps/backend/merge/orchestrator.py

Repository: AndyMik90/Auto-Claude

Length of output: 1699

🏁 Script executed:

# Check the TaskMergeRequest class definition rg -n "class TaskMergeRequest\|task_id:" apps/backend/merge/ --type=py -B 2 -A 5 | head -30

Repository: AndyMik90/Auto-Claude

Length of output: 47

🏁 Script executed:

# Look at the actual types and usage more carefully rg -n "class TaskMergeRequest" apps/backend/merge/ --type=py -A 10

Repository: AndyMik90/Auto-Claude

Length of output: 655

🏁 Script executed:

# Check how task_id is extracted or derived before calling find_worktree rg -n "TaskMergeRequest\|merge_task\|task_id.*=" apps/backend/merge/orchestrator.py -B 3 -A 3 | head -60

Repository: AndyMik90/Auto-Claude

Length of output: 47

🏁 Script executed:

# Verify the current behavior by looking at what's actually passed sed -n '1,50p' apps/backend/merge/orchestrator.py

Repository: AndyMik90/Auto-Claude

Length of output: 1493

🏁 Script executed:

# Let me trace through an actual example - check how this function is really used # by looking at what gets passed to it and what directories look like rg -n "merge_tasks\|merge_task" apps/backend/merge/orchestrator.py -B 5 -A 15 | head -80

Repository: AndyMik90/Auto-Claude

Length of output: 47

🏁 Script executed:

# Create a test to demonstrate the substring matching issue python3 << 'EOF' # Test the actual substring matching behavior with realistic spec names def simulate_find_worktree(): """Simulate the current find_worktree logic""" # Realistic directory names that would exist worktree_dirs = [ "001-auth", "0010-database", "002-api", "100-setup", "10-legacy" ] # Test cases with different task_id formats test_cases = [ ("001", "Should match only 001-auth"), # 001 in entry.name ("0010", "Should match only 0010-database"), # 0010 in entry.name ("001-auth", "Should match only 001-auth"), # Full spec name ("task-001", "Should match only if stripped"), # With task- prefix ] for task_id, expected in test_cases: matches = [d for d in worktree_dirs if task_id in d] print(f"task_id='{task_id}': Found {matches}") print(f" Expected: {expected}") if len(matches) > 1: print(f" ⚠️ ISSUE: Substring matching found multiple results!") print() simulate_find_worktree() EOF

Repository: AndyMik90/Auto-Claude

Length of output: 447

🏁 Script executed:

# Check if find_worktree is actually called with just the numeric ID # or with the full spec name (which would avoid the issue) rg -n "find_worktree" apps/backend/merge/ --type=py -B 10 | grep -A 10 "find_worktree"

Repository: AndyMik90/Auto-Claude

Length of output: 2796

🏁 Script executed:

# Check the actual usage pattern in orchestrator more carefully sed -n '100,130p' apps/backend/merge/orchestrator.py

Repository: AndyMik90/Auto-Claude

Length of output: 1057

🏁 Script executed:

# Look for where task_id originates in the merge flow rg -n "def merge_tasks" apps/backend/merge/orchestrator.py -A 20

Repository: AndyMik90/Auto-Claude

Length of output: 809

🏁 Script executed:

# Check if worktree_path is usually provided (avoiding find_worktree call) rg -n "TaskMergeRequest.*task_id.*worktree_path" apps/backend/ --type=py -B 2 -A 2

Repository: AndyMik90/Auto-Claude

Length of output: 534

🏁 Script executed:

# See when find_worktree is actually called (when worktree_path is None) rg -n "worktree_path.*None\|find_worktree" apps/backend/merge/orchestrator.py -B 3 -A 3

Repository: AndyMik90/Auto-Claude

Length of output: 47

🏁 Script executed:

# Check the TaskMergeRequest definition to see if worktree_path is optional sed -n '70,85p' apps/backend/merge/models.py

Repository: AndyMik90/Auto-Claude

Length of output: 391

🏁 Script executed:

# Look for actual usages where worktree_path might be None or not provided rg -n "TaskMergeRequest" apps/backend/ --type=py -B 2 -A 2 | grep -v "^--$"

Repository: AndyMik90/Auto-Claude

Length of output: 2797

Substring matching for task_id can match unintended directories.

The condition task_id in entry.name is problematic. For example, with worktree directories like "001-auth" and "0010-database", searching for task_id="001" would match both instead of just "001-auth". Consider using more precise matching such as checking if task_id matches as an exact prefix followed by a dash: entry.name.startswith(f"{task_id}-") or entry.name == task_id.

Note: This function is only called as a fallback when worktree_path is not provided in TaskMergeRequest (which has both fields as required), so the practical impact is limited. However, it's still a code quality issue worth fixing for robustness.

🤖 Prompt for AI Agents

In apps/backend/merge/git_utils.py around line 34, the condition uses substring matching (`task_id in entry.name`) which can incorrectly match names like "0010-database" when task_id="001"; change the check to be more precise by verifying either exact equality or a prefix followed by a dash (e.g., entry.name == task_id or entry.name.startswith(f"{task_id}-")) so only the intended worktree directory is matched.

apps/backend/runners/github/models.py

AndyMik90 · 2026-01-03T19:30:06Z

✅ Auto Claude PR Review - APPROVED

Merge Verdict: ✅ READY TO MERGE

No blocking issues found

Risk Assessment

Factor	Level	Notes
Complexity	High	Based on lines changed
Security Impact	None	Based on security findings
Scope Coherence	Good	Based on structural review

Generated by Auto Claude PR Review

This approval was generated by Auto Claude.

AndyMik90 · 2026-01-03T19:31:00Z

✅ Auto Claude PR Review - APPROVED

Merge Verdict: ✅ READY TO MERGE

No blocking issues found

Risk Assessment

Factor	Level	Notes
Complexity	High	Based on lines changed
Security Impact	None	Based on security findings
Scope Coherence	Good	Based on structural review

Generated by Auto Claude PR Review

This approval was generated by Auto Claude.

…untime errors The evidence-based validation migration missed updating: - parallel_followup_reviewer.py:647 - accessed non-existent validation.confidence - parallel_followup_reviewer.py:663 - passed validation_confidence to PRReviewFinding - parallel_orchestrator_reviewer.py:589,972 - passed confidence to PRReviewFinding PRReviewFinding no longer has confidence field (replaced with evidence). FindingValidationResult no longer has confidence field (replaced with evidence_verified_in_file). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

coderabbitai

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

apps/backend/runners/github/services/parallel_orchestrator_reviewer.py (1)
998-1002: Remove unused _normalize_confidence method.

This method is no longer used after migrating from confidence scoring to evidence-based validation. Removing dead code improves maintainability and reduces confusion.
🔎 Verify no other usages exist before removal
#!/bin/bash
# Description: Verify that _normalize_confidence is not used elsewhere in the codebase

# Search for any calls to _normalize_confidence
rg -nP '\b_normalize_confidence\b' apps/backend/

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 906b61b and 47ec87b.

📒 Files selected for processing (2)

apps/backend/runners/github/services/parallel_followup_reviewer.py
apps/backend/runners/github/services/parallel_orchestrator_reviewer.py

💤 Files with no reviewable changes (1)

apps/backend/runners/github/services/parallel_followup_reviewer.py

🧰 Additional context used

📓 Path-based instructions (1)

apps/backend/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

apps/backend/**/*.py: Always use the Claude Agent SDK (claude-agent-sdk package) for all AI interactions, never use the Anthropic API directly
Use the create_client() function from apps/backend/core/client.py to instantiate Claude SDK clients, not direct ClaudeSDKClient initialization

Files:

apps/backend/runners/github/services/parallel_orchestrator_reviewer.py

⚙️ CodeRabbit configuration file

apps/backend/**/*.py: Focus on Python best practices, type hints, and async patterns.
Check for proper error handling and security considerations.
Verify compatibility with Python 3.12+.

Files:

apps/backend/runners/github/services/parallel_orchestrator_reviewer.py

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)

GitHub Check: Cursor Bugbot
GitHub Check: CodeQL (javascript-typescript)
GitHub Check: CodeQL (python)

🔇 Additional comments (2)

apps/backend/runners/github/services/parallel_orchestrator_reviewer.py (2)

972-972: No action required — the code is correct.

The evidence field in PRReviewFinding is explicitly defined as str | None = None (line 218 of apps/backend/runners/github/models.py), which means None is not only acceptable but is the intended default value. The change evidence=f_data.get("evidence") correctly passes None when the key is missing, which aligns with the model's design.

Likely an incorrect or invalid review comment.

589-589: Remove the unused _normalize_confidence method (lines 998-1002).

This method is dead code from the deprecated confidence-based validation approach. Since the evidence field is optional in the PRReviewFinding model, both line 589's direct attribute access and line 972's dictionary .get() will work correctly—missing evidence will default to None without raising an error. The _normalize_confidence method is no longer needed and should be removed.

AndyMik90

🤖 Auto Claude PR Review

✅ Follow-up Review: Ready To Merge

Resolution Status

✅ Resolved: 0 previous findings addressed
❌ Unresolved: 0 previous findings remain
🆕 New Issues: 4 new findings in recent changes

Verdict

The confidence-to-evidence refactoring is complete and correct. All 4 findings are LOW severity quality suggestions (dead code cleanup) that do not block merge. The core migration in pydantic_models.py, models.py, and response_parsers.py properly replaces confidence/validation_confidence with evidence-based validation. The PTY race condition fix in usePtyProcess.ts is correctly implemented. No critical, high, or medium issues found. Previous review was already READY_TO_MERGE with no findings, and this commit successfully addresses the runtime errors mentioned in its commit message.

Review Process

Agents invoked: new-code-reviewer

This is an AI-generated follow-up review using parallel specialist analysis with finding validation.

Findings (4 selected of 4 total)

🔵 [LOW] Unused _normalize_confidence method left in refactoring

📁 apps/backend/runners/github/services/parallel_orchestrator_reviewer.py:998

The _normalize_confidence method was left in the codebase but is never called after the confidence-to-evidence refactoring. This is dead code that should have been removed as part of the cleanup.

Suggested fix:

Remove the _normalize_confidence method as it is no longer used after the evidence-based validation refactoring.

🔵 [LOW] IssueTriageResponse still uses confidence field (intentional)

📁 apps/backend/runners/github/services/pydantic_models.py:266

The IssueTriageResponse model still contains a confidence field. This is intentional because issue triage is a separate feature from PR review finding validation - confidence scoring is valid for categorization decisions. Documented for completeness.

Suggested fix:

No fix needed - IssueTriageResponse is for issue categorization, not PR finding validation.

🔵 [LOW] IssueBatch still uses validation_confidence (intentional)

📁 apps/backend/runners/github/batch_issues.py:306

The IssueBatch dataclass still uses validation_confidence for batch grouping validation, which is a separate concept from PR finding validation. This is intentional and not a bug.

Suggested fix:

No fix needed - batch validation is a different concept from PR finding validation.

🔵 [LOW] Dead code branch checking for confidence attribute

📁 apps/backend/runners/github/output_validator.py:412

The output_validator.py checks for hasattr(finding, 'confidence'). Since PRReviewFinding no longer has a confidence field, this branch is effectively dead code. The code works correctly due to defensive programming.

Suggested fix:

Consider removing the confidence attribute checks since PRReviewFinding no longer has a confidence field.

This review was generated by Auto Claude.

AndyMik90 · 2026-01-03T19:45:34Z

Auto Claude Review - APPROVED

Status: Ready to Merge

Summary: ## ✅ Follow-up Review: Ready To Merge

Resolution Status

✅ Resolved: 0 previous findings addressed
❌ Unresolved: 0 previous findings remain
🆕 New Issues: 4 new findings in recent changes

Verdict

The confidence-to-evidence refactoring is complete and correct. All 4 findings are LOW severity quality suggestions (dead code cleanup) that do not block merge. The core migration in pydantic_models.py, models.py, and response_parsers.py properly replaces confidence/validation_confidence with evidence-based validation. The PTY race condition fix in usePtyProcess.ts is correctly implemented. No critical, high, or medium issues found. Previous review was already READY_TO_MERGE with no findings, and this commit successfully addresses the runtime errors mentioned in its commit message.

Review Process

Agents invoked: new-code-reviewer

This is an AI-generated follow-up review using parallel specialist analysis with finding validation.

Review Details:

4 low-severity suggestions posted above
Reviewed at: Jan 3, 2026, 08:45 PM
Follow-up review: All previous blocking issues resolved

This automated review found no blocking issues. The PR can be safely merged.

Generated by Auto Claude

cursor · 2026-01-03T19:51:51Z

apps/backend/runners/github/services/parallel_followup_reviewer.py

                                    fixable=original.fixable,
                                    validation_status=validation_status,
                                    validation_evidence=validation_evidence,
-                                    validation_confidence=validation_confidence,


Evidence field not passed for new followup findings

The PR adds an evidence field to ParallelFollowupFinding and updates parallel_orchestrator_reviewer.py to pass evidence=finding_data.evidence when creating PRReviewFinding objects. However, the equivalent code in parallel_followup_reviewer.py that processes new_findings and comment_findings does not pass the evidence field. This means evidence collected by the AI for new findings during followup reviews is silently lost, defeating the purpose of the evidence-based validation system. The evidence=nf.evidence and evidence=cf.evidence parameters need to be added to the PRReviewFinding constructor calls.

Additional Locations (1)

apps/backend/runners/github/services/parallel_followup_reviewer.py#L690-L703

Merged upstream changes: - fix(ci): include update manifests for architecture-specific auto-updates (AndyMik90#611) - fix: security hook cwd extraction and PATH issues (AndyMik90#555, AndyMik90#556) (AndyMik90#587) - fix(frontend): filter empty env vars to prevent OAuth token override (AndyMik90#520) - refactor(github-review): replace confidence scoring with evidence-based validation (AndyMik90#628) - feat(terminal): add worktree support for terminals (AndyMik90#625) - fix: human_review status persistence bug (worktree plan path fix) (AndyMik90#605) - fix(frontend): resolve PATH and PYTHONPATH issues in insights and changelog services (AndyMik90#558) (AndyMik90#610) - fix: pass electron version explicitly to electron-rebuild on Windows (AndyMik90#622) - fix(kanban): complete refresh button implementation (AndyMik90#584) Conflict resolution: - insights-executor.ts: Combined local diagnostics with upstream error message improvements 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

Merged commits: - fix: check .claude.json for OAuth auth in profile scorer (AndyMik90#652) - fix(mcp): use shell mode for Windows command spawning (AndyMik90#572) - fix(ui): update TaskCard description truncation (AndyMik90#637) - fix: change hardcoded Opus defaults to Sonnet (fix AndyMik90#433) (AndyMik90#633) - Fix/small fixes 2.7.3 (AndyMik90#631) - fix(ci): include update manifests for architecture-specific auto-updates (AndyMik90#611) - fix: security hook cwd extraction and PATH issues (AndyMik90#555, AndyMik90#556) (AndyMik90#587) - fix(frontend): filter empty env vars to prevent OAuth token override (AndyMik90#520) - refactor(github-review): replace confidence scoring with evidence-based validation (AndyMik90#628) - feat(terminal): add worktree support for terminals (AndyMik90#625) Resolved conflicts: - Model defaults: Accepted upstream change from opus to sonnet (fix AndyMik90#433) - i18n namespaces: Combined both branches (analytics + terminal) 🤖 Generated with [Claude Code](https://claude.com/claude-code)

github-actions bot added area/backend This is backend only refactor Code improvement, no new features size/L Large (500-999 lines) 🔄 Checking Checking PR Status labels Jan 3, 2026

gemini-code-assist bot reviewed Jan 3, 2026

View reviewed changes

github-actions bot added 🔄 Checking Checking PR Status area/fullstack This is Frontend + Backend and removed 🔄 Checking Checking PR Status area/backend This is backend only labels Jan 3, 2026

coderabbitai bot reviewed Jan 3, 2026

View reviewed changes

AndyMik90 self-assigned this Jan 3, 2026

coderabbitai bot reviewed Jan 3, 2026

View reviewed changes

cursor bot reviewed Jan 3, 2026

View reviewed changes

apps/backend/runners/github/models.py Show resolved Hide resolved

github-actions bot added 🔄 Checking Checking PR Status and removed 🔄 Checking Checking PR Status labels Jan 3, 2026

coderabbitai bot reviewed Jan 3, 2026

View reviewed changes

AndyMik90 commented Jan 3, 2026

View reviewed changes

AndyMik90 merged commit acdd7d9 into develop Jan 3, 2026
14 checks passed

AndyMik90 deleted the feat/evidence-based-validation branch January 3, 2026 19:45

cursor bot reviewed Jan 3, 2026

View reviewed changes

	description="Actual code snippet proving the issue exists. Required for validation.",
	description="A code snippet that proves the issue exists. This is required by prompts and downstream validation; findings without it may be filtered.",

		- Show what changed - before vs after
		- Explain WHY this proves resolution/non-resolution

	CRITICAL: No evidence = No finding
	### CRITICAL: No evidence = No finding

Uh oh!

refactor(github-review): replace confidence scoring with evidence-based validation #628

refactor(github-review): replace confidence scoring with evidence-based validation #628

Uh oh!

Conversation

AndyMik90 commented Jan 3, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Prompt Updates

Code Updates

Test Updates

Why

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Jan 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Other AI code review bot(s) detected

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

Pre-merge checks and finishing touches

Uh oh!

gemini-code-assist bot commented Jan 3, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jan 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 3, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

AndyMik90 commented Jan 3, 2026

✅ Auto Claude PR Review - APPROVED

Merge Verdict: ✅ READY TO MERGE

Risk Assessment

Uh oh!

AndyMik90 commented Jan 3, 2026

✅ Auto Claude PR Review - APPROVED

Merge Verdict: ✅ READY TO MERGE

Risk Assessment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

AndyMik90 left a comment

AndyMik90 commented Jan 3, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 3, 2026 •

edited

Loading