fix(audit): make the live Phoenix-MCP calibration loop genuinely work#83
Conversation
The Phoenix-MCP consultant/writer path had three latent bugs that were never hit
because the MCP loop never ran against a live Phoenix (the deployed demo always
fell back to the table prior on an empty endpoint). Provisioning a real Phoenix
(Cloud Run + Cloud SQL) surfaced all three:
1. **p25/p75 semantics.** apply_correction clips the *score* to [p25, p75]; the
table/anchor paths use (0.0, 10.0). PhoenixMcpConsultant returned *delta*
percentiles (~1.45) → the live correction over-clipped onto the ±2 cap
(9.0→7.0 instead of 7.84). Now returns the score-bound convention so the live
MCP correction equals the deterministic prior.
2. **Wrong tool arg.** phoenix-mcp's get-/add-dataset-examples want `dataset_name`,
not `dataset` — the consultant + writer passed `dataset` → "datasetName is
required" at runtime. Fixed both.
3. **Response wrapper.** get-dataset-examples returns `{"data": {"examples": [...]}}`;
_parse_examples only read top-level `examples` → zero rows parsed. Unwrap `data`.
The fake-MCP tests never caught these because `_result` used a hand-rolled shape;
it now mirrors phoenix-mcp@4.0.13's real `{"data": {"examples": ...}}` envelope and
the arg-name assertions are corrected — a genuine regression guard.
Adds `scripts/seed_phoenix_calibration.py` (seeds the spike-D YELLOW prior as
input={hat,criterion,bucket}/output={delta} examples). Verified end-to-end against
a live Cloud-SQL-backed Phoenix: read (ConsultResult mean/n correct for all 39
cells) AND write-back (add-dataset-examples grows n) both work.
Gates: ruff · ruff format · mypy clean; adk_runtime + engine tests pass.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
|
Ready to act? Review this PR in Change Stack to turn feedback into patch suggestions you can inspect and refine. Warning Review limit reached
More reviews will be available in 54 minutes and 4 seconds. Learn how PR review limits work. Your organization has run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
개요Phoenix 캘리브레이션 데이터셋을 생성하는 새로운 시드 스크립트를 도입하고, MCP 툴 규약을 dataset 파라미터에서 dataset_name으로 통합하며, 응답 구조를 data.examples 래퍼로 변경하고, 클립 범위를 고정값으로 하드코딩하는 변경을 진행. 테스트 픽스처 및 단언을 새로운 규약에 맞게 업데이트. 변경 사항캘리브레이션 파이프라인 통합
예상 코드 리뷰 난이도🎯 3 (중간) | ⏱️ ~20분 시 (Poem)
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request introduces a seeding script for the glasshat-calibration Phoenix dataset and updates the PhoenixMcpConsultant to align with phoenix-mcp's updated API. Specifically, it changes the MCP tool argument from dataset to dataset_name, unwraps the nested data key in dataset examples, and sets fixed p25 and p75 score clip bounds. The review feedback highlights two main improvements: making the payload parsing in _parse_examples robust against non-dictionary/non-list JSON scalars to prevent potential AttributeError crashes, and ensuring the seeding script is idempotent by deleting any existing dataset with the same name before recreation.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| if isinstance(payload, dict) and isinstance(payload.get("data"), dict): | ||
| payload = payload["data"] # phoenix-mcp wraps rows under "data" | ||
| examples = payload if isinstance(payload, list) else payload.get("examples", []) |
There was a problem hiding this comment.
If payload is parsed as a non-dictionary/non-list JSON scalar (such as a boolean true/false, a number, or null), isinstance(payload, list) will evaluate to False, and the code will attempt to call payload.get("examples", []). Since these scalar types do not have a .get method, this will raise an unhandled AttributeError and crash the parser.
We can make this parsing logic fully robust against any JSON structure by explicitly checking if payload is a dictionary or a list before accessing attributes.
| if isinstance(payload, dict) and isinstance(payload.get("data"), dict): | |
| payload = payload["data"] # phoenix-mcp wraps rows under "data" | |
| examples = payload if isinstance(payload, list) else payload.get("examples", []) | |
| if isinstance(payload, dict): | |
| if isinstance(payload.get("data"), dict): | |
| payload = payload["data"] | |
| examples = payload.get("examples", []) | |
| elif isinstance(payload, list): | |
| examples = payload | |
| else: | |
| continue |
| client = Client(base_url=base_url, api_key=api_key) | ||
| ds = client.datasets.create_dataset( |
There was a problem hiding this comment.
The seeding script is currently not idempotent. If you run it more than once, client.datasets.create_dataset will raise an HTTP conflict/duplicate error because a dataset with the name glasshat-calibration already exists.
To make the script safely reusable and idempotent (e.g., for local resets or CI/CD pipelines), we should check for and delete any existing dataset with the same name before creating the new one.
client = Client(base_url=base_url, api_key=api_key)
for dataset in client.list_datasets():
if dataset.name == _DATASET:
client.delete_dataset(id=dataset.id)
break
ds = client.datasets.create_dataset(There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
scripts/seed_phoenix_calibration.py (1)
27-27: ⚡ Quick win데이터셋 이름을 설정값과 동기화해 계약 드리프트를 막아주세요.
_DATASET하드코딩으로 인해 런타임에서PHOENIX_CALIBRATION_DATASET가 변경되면, 시딩 대상과 실제 조회 대상이 달라질 수 있습니다.제안 패치
-_DATASET = "glasshat-calibration" +_DEFAULT_DATASET = "glasshat-calibration" @@ def main() -> None: @@ api_key = os.environ.get("PHOENIX_API_KEY") or None + dataset_name = os.environ.get("PHOENIX_CALIBRATION_DATASET", _DEFAULT_DATASET) @@ ds = client.datasets.create_dataset( - name=_DATASET, + name=dataset_name, inputs=inputs, outputs=outputs, @@ - print(f"created dataset '{_DATASET}' with {len(inputs)} examples") + print(f"created dataset '{dataset_name}' with {len(inputs)} examples")Also applies to: 72-73, 80-81
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@scripts/seed_phoenix_calibration.py` at line 27, The script currently hardcodes _DATASET = "glasshat-calibration", which can diverge from the runtime config PHOENIX_CALIBRATION_DATASET; change the code to read the dataset name from the configuration/environment (PHOENIX_CALIBRATION_DATASET) and assign it to _DATASET (or a single source-of-truth variable) so seeding and lookup use the same value, and update the other occurrences in this file that reference the hardcoded string (the other uses of _DATASET in this script) to use that variable instead.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@services/pipeline-orchestrator/src/glasshat/pipeline/adk_runtime.py`:
- Around line 197-200: The code currently assumes payload is a dict when calling
payload.get("examples", []) which raises AttributeError for other types; change
the examples extraction to robustly handle types: if isinstance(payload, list):
examples = payload; elif isinstance(payload, dict): examples =
payload.get("examples", []); else: examples = [] so the subsequent for ex in
examples loop never errors on non-dict/list payloads. Ensure the initial check
using isinstance(payload, dict) and payload.get("data") remains intact and apply
this safe-typing logic around the payload → examples assignment.
---
Nitpick comments:
In `@scripts/seed_phoenix_calibration.py`:
- Line 27: The script currently hardcodes _DATASET = "glasshat-calibration",
which can diverge from the runtime config PHOENIX_CALIBRATION_DATASET; change
the code to read the dataset name from the configuration/environment
(PHOENIX_CALIBRATION_DATASET) and assign it to _DATASET (or a single
source-of-truth variable) so seeding and lookup use the same value, and update
the other occurrences in this file that reference the hardcoded string (the
other uses of _DATASET in this script) to use that variable instead.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 3c845630-262c-4fe6-a0e8-3e7242f2add3
📒 Files selected for processing (3)
scripts/seed_phoenix_calibration.pyservices/pipeline-orchestrator/src/glasshat/pipeline/adk_runtime.pyservices/pipeline-orchestrator/tests/test_adk_runtime.py
- _parse_examples: handle scalar JSON payloads (bool/number/null) without an AttributeError on .get; explicit dict/list/else branches (CodeRabbit HIGH). - seed_phoenix_calibration: delete any existing same-name dataset via REST before create, so re-running is idempotent (the phoenix client exposes no delete) (CodeRabbit medium). Verified: re-run removes the old dataset + recreates 429. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Stands up a real Phoenix on GCP (Cloud Run + Cloud SQL Postgres) and fixes the three latent bugs that surfaced once the MCP loop actually ran live (it never did before — the deployed demo always fell back to the table prior on an empty endpoint, which is why the
Phoenix MCPchip is amber).Bugs fixed (all in
adk_runtime.py, none ever exercised live)apply_correctionclips the score to[p25, p75](table/anchor use(0.0, 10.0)), butPhoenixMcpConsultantreturned delta percentiles (~1.45) → the live correction over-clipped onto the ±2 cap (9.0→7.0 instead of 7.84). Now uses the score-bound convention so the live MCP correction equals the deterministic prior.dataset_name, notdataset(consultant + writer). CauseddatasetName is requiredat runtime.get-dataset-examplesreturns{"data": {"examples": [...]}}; the parser only read top-levelexamples→ 0 rows.The fake-MCP tests missed all three because
_resultused a hand-rolled shape; it now mirrors phoenix-mcp@4.0.13's real{"data": {"examples": ...}}envelope + corrected arg-name assertions = a genuine regression guard.Verified end-to-end against the live Cloud-SQL-backed Phoenix
get-dataset-examples→ConsultResultwith correctmean_delta/nfor all 39 cells (13 criteria × 3 buckets),p25/p75=0.0/10.0.add-dataset-exampleslands (n grew 7→8, then reset to a clean baseline).Adds
scripts/seed_phoenix_calibration.py(reusable; seeds the spike-D YELLOW prior).Next (separate, user-approved): redeploy
glasshat-apiwithPHOENIX_COLLECTOR_ENDPOINTset → thePhoenix MCPchip goes green and the audit does a genuine per-request MCP round-trip.Gates: ruff · format · mypy clean; adk_runtime + engine tests pass (CI runs the full suite).
Summary by CodeRabbit
릴리스 노트
New Features
Bug Fixes
Tests