feat(diagnostics): unified 3-layer DiagnosticResult model (#204)#206
Conversation
Establishes the structured verification diagnostic contract: Layer 1 — agent_message (agent-safe, no internals leaked) Layer 2 — developer_fields (constraint_id, advisory_checks, evidence) Layer 3 — proof_ref (sha256 hash of retained proof artifact) Key design: - DiagnosticStatus: tri-state only (VERIFIED / UNVERIFIABLE / BLOCKED) — no HEURISTIC/AMBIGUOUS proliferation; distinction lives in developer_fields.constraint_id - proof_ref is the authority bit: present = admissible for control flow, None = reject. No separate 'authoritative' boolean needed (resolves #190 Keesan/Rahul debate) - VERIFIED requires proof_ref is not None — structurally enforced in __post_init__, not by caller discipline - AdvisoryCheck: non-proof-bearing analysis (LLM fallback, NLI, VLM) populates developer_fields.advisory_checks, never status or proof_ref - compute_proof_ref: deterministic sha256 hash of JSON-serialized evidence - from_legacy_dict: migration helper for ad-hoc engine dicts (fail-closed states only — raises for legacy VERIFIED since proof artifacts were discarded by pre-#204 engines) Unblocks: #129, #130, #131, #133, #134, #162, #163, #164, #190, #205 Does NOT refactor existing engines — those are separate PRs against this contract. Tests: 68 new tests covering status taxonomy, all 3 layers, authority contract, fail-closed enforcement, advisory checks, proof hashing, serialization round-trip, legacy migration, and realistic scenarios drawn from the blocked issues. Version: 5.1.2 -> 5.2.0 (architectural enhancement)
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
|
Warning Review limit reached
More reviews will be available in 36 minutes and 28 seconds. Learn how PR review limits work. Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file). ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based credits. 🚦 How do rate limits work?CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan refill rate. For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, the refill rate gradually slows as usage increases. The highest same-day bursts are limited more strictly. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughIntroduces Changes3-Layer Structured Verification Diagnostics
Sequence Diagram(s)sequenceDiagram
participant Engine
participant DiagnosticResult
participant compute_proof_ref
participant DownstreamGate
Engine->>DiagnosticResult: verified(agent_message, proof_evidence, developer_fields)
DiagnosticResult->>compute_proof_ref: hash(proof_evidence) with sort_keys=True
compute_proof_ref-->>DiagnosticResult: "sha256:<hex>"
DiagnosticResult-->>Engine: DiagnosticResult(status=VERIFIED, proof_ref="sha256:...", is_authoritative=True)
Engine->>DownstreamGate: pass DiagnosticResult
DownstreamGate->>DownstreamGate: check is_authoritative == True
DownstreamGate-->>Engine: admitted
Engine->>DiagnosticResult: from_legacy_dict(legacy_engine_dict)
DiagnosticResult->>DiagnosticResult: map status/error patterns → UNVERIFIABLE or BLOCKED
DiagnosticResult-->>Engine: DiagnosticResult(status=UNVERIFIABLE|BLOCKED, proof_ref=None, is_fail_closed=True)
Engine->>DownstreamGate: pass DiagnosticResult
DownstreamGate->>DownstreamGate: check is_authoritative == False
DownstreamGate-->>Engine: rejected (fail-closed)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related issues
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (2)
src/qwed_new/core/diagnostics.py (1)
444-450: ⚖️ Poor tradeoffConsider raising on truly unrecognized legacy patterns.
This fallback silently returns
UNVERIFIABLEfor inputs that don't match any known legacy pattern. WhileUNVERIFIABLEis fail-closed (safe), per QWED_RULES "fail loudly" principle, unrecognized data could indicate unexpected engine behavior that should surface as an error rather than be silently absorbed.The current approach is pragmatically safe; raising here is more strict and aids debugging when legacy engines produce unexpected output formats.
Optional stricter approach
- return cls.unverifiable( - agent_message="Verification inconclusive — unrecognized legacy result", - developer_fields={ - "constraint_id": f"{engine}.legacy_unrecognized", - "legacy_status": legacy_status, - }, - ) + raise ValueError( + f"from_legacy_dict cannot interpret unrecognized legacy data from {engine!r}: " + f"status={legacy_status!r}, is_correct={is_correct!r}. " + "Review engine output format and add explicit handling." + )🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/qwed_new/core/diagnostics.py` around lines 444 - 450, The fallback case that returns cls.unverifiable() for unrecognized legacy patterns should raise an exception instead to follow the "fail loudly" principle. Replace the cls.unverifiable() return statement with raising an appropriate exception (such as ValueError or a domain-specific exception) that includes the legacy_status and engine information in the error message. This will surface unexpected engine behavior rather than silently absorbing it as an UNVERIFIABLE result.Source: Coding guidelines
tests/test_diagnostics.py (1)
504-533: Avoid "deterministic_confidence" naming in test validation metadata.The test at line 510 uses
"deterministic_confidence": 0.4indeveloper_fieldsto validate advisory-only validation scenarios. While theDiagnosticResultstatus is correctlyUNVERIFIABLE(fail-closed), the field naming creates conceptual confusion because "deterministic" implies binary outcome, not probabilistic scoring.Since this appears only in test data and is placed in
developer_fields(which per QWED_RULES holds advisory-only information that does not affect proof authority), the concern is primarily about clarity and preventing misinterpretation by future maintainers rather than a functional bypass.Consider renaming
deterministic_confidenceto something that clarifies it represents evidence sufficiency or test context (e.g.,evidence_coverage_ratioortest_scenario_confidence_for_advisory_path) to prevent confusion with authority-level confidence scores in production verification.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/test_diagnostics.py` around lines 504 - 533, In the test_fact_verifier_llm_advisory_only method, rename the field "deterministic_confidence" in the developer_fields dictionary to a name that clarifies it represents advisory test context rather than a deterministic outcome. Use a name like "evidence_coverage_ratio" or "test_scenario_confidence_for_advisory_path" to better reflect that this is advisory-only metadata that does not affect proof authority, preventing conceptual confusion for future maintainers.Source: Coding guidelines
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/qwed_new/core/diagnostics.py`:
- Around line 148-155: The json.dumps call in the proof_ref generation logic
uses default=str which causes non-deterministic hashing because str() on
non-serializable objects includes memory addresses that vary between runs,
violating the determinism requirement stated in the docstring. To fix this,
remove the default=str parameter from the json.dumps call so that
non-serializable evidence raises a TypeError instead (fail-closed approach),
which aligns with the documented requirement that callers must pre-convert
non-serializable values before calling this function. Alternatively, if you need
to support non-serializable types, use a deterministic fallback like
default=lambda o: f"<{type(o).__name__}>" instead of default=str.
In `@tests/test_diagnostics.py`:
- Around line 229-236: The tests test_non_serializable_falls_back_to_str and
test_nested_non_serializable_falls_back_to_str currently verify that
non-serializable objects are accepted and produce a hash, but this violates
determinism requirements since str(object()) includes memory addresses that vary
across runs. Update both tests to verify fail-closed behavior by asserting that
compute_proof_ref raises ValueError when passed non-serializable objects, rather
than checking that it returns a sha256 hash. Remove the assertions checking for
the sha256: prefix and instead use self.assertRaises(ValueError) or similar to
verify that the function rejects non-serializable input.
---
Nitpick comments:
In `@src/qwed_new/core/diagnostics.py`:
- Around line 444-450: The fallback case that returns cls.unverifiable() for
unrecognized legacy patterns should raise an exception instead to follow the
"fail loudly" principle. Replace the cls.unverifiable() return statement with
raising an appropriate exception (such as ValueError or a domain-specific
exception) that includes the legacy_status and engine information in the error
message. This will surface unexpected engine behavior rather than silently
absorbing it as an UNVERIFIABLE result.
In `@tests/test_diagnostics.py`:
- Around line 504-533: In the test_fact_verifier_llm_advisory_only method,
rename the field "deterministic_confidence" in the developer_fields dictionary
to a name that clarifies it represents advisory test context rather than a
deterministic outcome. Use a name like "evidence_coverage_ratio" or
"test_scenario_confidence_for_advisory_path" to better reflect that this is
advisory-only metadata that does not affect proof authority, preventing
conceptual confusion for future maintainers.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 881c1e98-0a83-4a96-8fc8-7c5357a75c8d
📒 Files selected for processing (4)
pyproject.tomlsrc/qwed_new/core/__init__.pysrc/qwed_new/core/diagnostics.pytests/test_diagnostics.py
- compute_proof_ref: remove default=str — non-serializable evidence now raises ValueError (fail-closed). Prevents non-deterministic memory- address-dependent hashes from entering proof contract - AdvisoryCheck: add __post_init__ enforcing advisory_only is True — structurally enforced, no longer just docstring claim - from_legacy_dict: replace is True/is False identity checks with bool() truthiness — catches legacy engines using 1/0 instead of True/False (Sentry MEDIUM, Greptile P1) - from_legacy_dict: raise ValueError on unrecognized legacy patterns instead of silent UNVERIFIABLE — fail-loudly per QWED_RULES - from_dict: raise clear ValueError on missing/empty agent_message instead of crashing via __post_init__ with confusing error (Sentry HIGH) - tests: update non-serializable tests to expect ValueError - tests: add AdvisoryCheck(advisory_only=False) raises test - tests: add integer is_correct (1/0) bypass tests - tests: add from_dict missing agent_message raises test - tests: add unrecognized truthy legacy raises test - tests: rename deterministic_confidence -> evidence_coverage_ratio 73 tests pass (up from 68).
…ntry) Early guard was rejecting all truthy is_correct as VERIFIED, making the final unrecognized-pattern raise unreachable dead code. Now only explicit status=='VERIFIED' is rejected early; truthy is_correct with unknown status falls through to the correct unrecognized-pattern raise with the appropriate error message.
There was a problem hiding this comment.
🧹 Nitpick comments (1)
src/qwed_new/core/diagnostics.py (1)
247-254: 💤 Low valueConsider stricter type validation for fail-closed consistency.
The property passes through non-dict items assuming they're already
AdvisoryCheckinstances. Ifdeveloper_fields["advisory_checks"]contains malformed data (e.g.,[1, "str"]), these would be returned in a list typed asList[AdvisoryCheck], violating the type contract.Per QWED_RULES fail-closed principle, consider validating or filtering items:
♻️ Suggested refinement
`@property` def advisory_checks(self) -> List[AdvisoryCheck]: """Advisory checks from developer_fields, deserialized to AdvisoryCheck.""" raw = self.developer_fields.get("advisory_checks", []) if not isinstance(raw, list): return [] - return [AdvisoryCheck.from_dict(item) if isinstance(item, dict) else item - for item in raw] + result = [] + for item in raw: + if isinstance(item, dict): + result.append(AdvisoryCheck.from_dict(item)) + elif isinstance(item, AdvisoryCheck): + result.append(item) + # Skip malformed items (fail-closed: don't propagate garbage) + return result🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/qwed_new/core/diagnostics.py` around lines 247 - 254, The advisory_checks property method contains a type safety issue where non-dict items are passed through without validation, assuming they are already AdvisoryCheck instances. This violates the type contract List[AdvisoryCheck] if malformed data (like integers or strings) exists in developer_fields["advisory_checks"]. To fix this following the fail-closed principle, modify the list comprehension to strictly validate items: only include items that are either dicts that can be converted to AdvisoryCheck via the from_dict method, or items that are already valid AdvisoryCheck instances. Filter out or skip any items that don't meet these criteria rather than passing them through unchecked.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@src/qwed_new/core/diagnostics.py`:
- Around line 247-254: The advisory_checks property method contains a type
safety issue where non-dict items are passed through without validation,
assuming they are already AdvisoryCheck instances. This violates the type
contract List[AdvisoryCheck] if malformed data (like integers or strings) exists
in developer_fields["advisory_checks"]. To fix this following the fail-closed
principle, modify the list comprehension to strictly validate items: only
include items that are either dicts that can be converted to AdvisoryCheck via
the from_dict method, or items that are already valid AdvisoryCheck instances.
Filter out or skip any items that don't meet these criteria rather than passing
them through unchecked.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 57955ff1-b84f-47c2-8e99-b8bb732a0d86
📒 Files selected for processing (2)
src/qwed_new/core/diagnostics.pytests/test_diagnostics.py
🚧 Files skipped from review as they are similar to previous changes (1)
- tests/test_diagnostics.py
- AdvisoryCheck.from_dict: coerce advisory_only to bool — fixes cross-language APIs that use 1 instead of True (Sentry HIGH) - advisory_checks property: skip malformed/invalid items instead of raising ValueError at access time (Greptile P1, CodeRabbit) - tests: add integer coercion tests, invalid-item skip tests, existing-instance passthrough test 77 tests pass (up from 73).
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
src/qwed_new/core/diagnostics.py (1)
263-266: ⚡ Quick winMake the intentional skip explicit in the
exceptblock.The current
passpreserves the intended defensive skip, but CodeQL flags the empty handler. A commentedcontinuekeeps behavior unchanged and makes the audit intent clear.Proposed cleanup
try: result.append(AdvisoryCheck.from_dict(item)) except ValueError: - pass + # Invalid advisory metadata is non-authoritative; skip it. + continue🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/qwed_new/core/diagnostics.py` around lines 263 - 266, The empty except block handling ValueError in the try-except statement around AdvisoryCheck.from_dict contains only a pass statement, which CodeQL flags as an issue. Replace the pass statement with a commented continue statement to make the intentional skip explicit and audit-intent clear while preserving the existing behavior of skipping invalid items during the append operation.Source: Linters/SAST tools
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/qwed_new/core/diagnostics.py`:
- Line 126: Replace the `bool(data.get("advisory_only", True))` call with
explicit validation that accepts only boolean values or integers 0 and 1,
rejecting all other types including malformed string values like "false" or "0".
Raise a validation error at construction time when the advisory_only field
contains an invalid type or value outside of the accepted boolean/integer range,
rather than silently coercing it with the bool() function.
---
Nitpick comments:
In `@src/qwed_new/core/diagnostics.py`:
- Around line 263-266: The empty except block handling ValueError in the
try-except statement around AdvisoryCheck.from_dict contains only a pass
statement, which CodeQL flags as an issue. Replace the pass statement with a
commented continue statement to make the intentional skip explicit and
audit-intent clear while preserving the existing behavior of skipping invalid
items during the append operation.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: c708e206-e752-4a8a-a36c-70d61abb1dc5
📒 Files selected for processing (2)
src/qwed_new/core/diagnostics.pytests/test_diagnostics.py
🚧 Files skipped from review as they are similar to previous changes (1)
- tests/test_diagnostics.py
…tion
- AdvisoryCheck.from_dict: reject malformed strings ('false', '0') —
only accept bool or int 0/1, not truthy strings (CodeRabbit MAJOR)
- to_dict: serialize AdvisoryCheck instances to dicts — prevents
TypeError on JSON serialization (Sentry MEDIUM)
- advisory_checks property: explicit comment on except block (CodeQL)
- tests: add string rejection test, AdvisoryCheck serialization test,
JSON serializability test
80 tests pass.
…try HIGH) - DiagnosticStatus(invalid_str) now caught and re-raised with valid options and from_legacy_dict pointer, instead of unhelpful enum error - tests: add invalid status raises with guidance test 81 tests pass.
… P1) - __post_init__: change proof_ref is None to not proof_ref — catches empty string '' bypassing authority invariant - DiagnosticResult + AdvisoryCheck: add frozen=True — prevents post-construction proof_ref/status mutation bypassing the authority contract without explicit object.__setattr__ - tests: add empty string proof_ref raises test, frozen mutation test 83 tests pass.
|



Summary
Establishes the structured verification diagnostic contract from #204 — a unified
DiagnosticResultmodel with three disclosure layers:agent_message: strdeveloper_fields: dictproof_ref: Optional[str]Key Design Decisions
Tri-state status only — no proliferation
DiagnosticStatushas exactly 3 values:VERIFIED,UNVERIFIABLE,BLOCKED. NoHEURISTIC,AMBIGUOUS, orCORRECTION_NEEDED. Richer distinctions live indeveloper_fields.constraint_id. This resolves the #190 discussion (Keesan/Rahul debate) — ambiguity IS unverifiability; the distinction is structured, not status-level.proof_refis the authority bitNo separate
authoritativeboolean.proof_ref is not None= authoritative (admissible for control flow);proof_ref is None= reject. Downstream gates get a mechanical rule:VERIFIED requires proof — structurally enforced
__post_init__raisesValueErrorifstatus == VERIFIEDandproof_ref is None. This makes "VERIFIED without proof" impossible to construct — not a caller convention, a type-level invariant.Advisory checks never touch status
AdvisoryCheck(LLM fallback, NLI entailment, VLM interpretation) populatesdeveloper_fields.advisory_checkswithadvisory_only=True. They never setstatusorproof_ref. This structurally enforces the #204 constraint: "diagnostics must never originate from model reasoning, confidence, or self-assessment."Migration helper for existing engines
from_legacy_dict()converts ad-hoc engine dicts toDiagnosticResultfor fail-closed states (CORRECTION_NEEDED→UNVERIFIABLE,ERROR→BLOCKED). It raises for legacyVERIFIEDresults — proof artifacts were discarded by pre-#204 engines, so backfilling is impossible. Callers must useDiagnosticResult.verified()with explicit evidence.What This PR Does NOT Do
Files
src/qwed_new/core/diagnostics.py(new) —DiagnosticStatus,DiagnosticResult,AdvisoryCheck,compute_proof_reftests/test_diagnostics.py(new) — 68 testssrc/qwed_new/core/__init__.py(modified) — exportspyproject.toml(modified) — version 5.1.2 → 5.2.0Test Results
Unblocks
#129, #130, #131, #133, #134, #162, #163, #164, #190, #205
Closes #204.
Summary by CodeRabbit