Skip to content

fix(audit): make the live Phoenix-MCP calibration loop genuinely work#83

Merged
ComBba merged 2 commits into
mainfrom
feat/phoenix-mcp-live-calibration
Jun 6, 2026
Merged

fix(audit): make the live Phoenix-MCP calibration loop genuinely work#83
ComBba merged 2 commits into
mainfrom
feat/phoenix-mcp-live-calibration

Conversation

@ComBba

@ComBba ComBba commented Jun 6, 2026

Copy link
Copy Markdown
Contributor

Stands up a real Phoenix on GCP (Cloud Run + Cloud SQL Postgres) and fixes the three latent bugs that surfaced once the MCP loop actually ran live (it never did before — the deployed demo always fell back to the table prior on an empty endpoint, which is why the Phoenix MCP chip is amber).

Bugs fixed (all in adk_runtime.py, none ever exercised live)

  1. p25/p75 semanticsapply_correction clips the score to [p25, p75] (table/anchor use (0.0, 10.0)), but PhoenixMcpConsultant returned delta percentiles (~1.45) → the live correction over-clipped onto the ±2 cap (9.0→7.0 instead of 7.84). Now uses the score-bound convention so the live MCP correction equals the deterministic prior.
  2. Wrong tool arg — phoenix-mcp wants dataset_name, not dataset (consultant + writer). Caused datasetName is required at runtime.
  3. Response wrapperget-dataset-examples returns {"data": {"examples": [...]}}; the parser only read top-level examples → 0 rows.

The fake-MCP tests missed all three because _result used a hand-rolled shape; it now mirrors phoenix-mcp@4.0.13's real {"data": {"examples": ...}} envelope + corrected arg-name assertions = a genuine regression guard.

Verified end-to-end against the live Cloud-SQL-backed Phoenix

  • Read: get-dataset-examplesConsultResult with correct mean_delta/n for all 39 cells (13 criteria × 3 buckets), p25/p75=0.0/10.0.
  • Write-back: add-dataset-examples lands (n grew 7→8, then reset to a clean baseline).
  • Works with a bearer key against the no-auth Phoenix (matches how prod passes the secret).

Adds scripts/seed_phoenix_calibration.py (reusable; seeds the spike-D YELLOW prior).

Next (separate, user-approved): redeploy glasshat-api with PHOENIX_COLLECTOR_ENDPOINT set → the Phoenix MCP chip goes green and the audit does a genuine per-request MCP round-trip.

Gates: ruff · format · mypy clean; adk_runtime + engine tests pass (CI runs the full suite).

Summary by CodeRabbit

릴리스 노트

  • New Features

    • Phoenix 플랫폼 설정을 간소화하기 위한 캘리브레이션 데이터셋 시딩 기능이 추가되었습니다.
  • Bug Fixes

    • 파이프라인 오케스트레이터의 캘리브레이션 데이터 파싱 및 명명 규칙이 업데이트되었습니다.
    • 캘리브레이션 범위 계산이 개선되었습니다.
  • Tests

    • 캘리브레이션 예시 검증 테스트가 강화되었습니다.

The Phoenix-MCP consultant/writer path had three latent bugs that were never hit
because the MCP loop never ran against a live Phoenix (the deployed demo always
fell back to the table prior on an empty endpoint). Provisioning a real Phoenix
(Cloud Run + Cloud SQL) surfaced all three:

1. **p25/p75 semantics.** apply_correction clips the *score* to [p25, p75]; the
   table/anchor paths use (0.0, 10.0). PhoenixMcpConsultant returned *delta*
   percentiles (~1.45) → the live correction over-clipped onto the ±2 cap
   (9.0→7.0 instead of 7.84). Now returns the score-bound convention so the live
   MCP correction equals the deterministic prior.
2. **Wrong tool arg.** phoenix-mcp's get-/add-dataset-examples want `dataset_name`,
   not `dataset` — the consultant + writer passed `dataset` → "datasetName is
   required" at runtime. Fixed both.
3. **Response wrapper.** get-dataset-examples returns `{"data": {"examples": [...]}}`;
   _parse_examples only read top-level `examples` → zero rows parsed. Unwrap `data`.

The fake-MCP tests never caught these because `_result` used a hand-rolled shape;
it now mirrors phoenix-mcp@4.0.13's real `{"data": {"examples": ...}}` envelope and
the arg-name assertions are corrected — a genuine regression guard.

Adds `scripts/seed_phoenix_calibration.py` (seeds the spike-D YELLOW prior as
input={hat,criterion,bucket}/output={delta} examples). Verified end-to-end against
a live Cloud-SQL-backed Phoenix: read (ConsultResult mean/n correct for all 39
cells) AND write-back (add-dataset-examples grows n) both work.

Gates: ruff · ruff format · mypy clean; adk_runtime + engine tests pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@chatgpt-codex-connector

Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

@coderabbitai

coderabbitai Bot commented Jun 6, 2026

Copy link
Copy Markdown

Ready to act? Review this PR in Change Stack to turn feedback into patch suggestions you can inspect and refine.

Review Change Stack

Warning

Review limit reached

@ComBba, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 54 minutes and 4 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 4c9b4c2a-3f51-4f8a-bf13-1735fa164737

📥 Commits

Reviewing files that changed from the base of the PR and between 8d567a0 and 2512586.

📒 Files selected for processing (2)
  • scripts/seed_phoenix_calibration.py
  • services/pipeline-orchestrator/src/glasshat/pipeline/adk_runtime.py

개요

Phoenix 캘리브레이션 데이터셋을 생성하는 새로운 시드 스크립트를 도입하고, MCP 툴 규약을 dataset 파라미터에서 dataset_name으로 통합하며, 응답 구조를 data.examples 래퍼로 변경하고, 클립 범위를 고정값으로 하드코딩하는 변경을 진행. 테스트 픽스처 및 단언을 새로운 규약에 맞게 업데이트.

변경 사항

캘리브레이션 파이프라인 통합

레이어 / 파일(s) 요약
캘리브레이션 데이터 시딩 스크립트
scripts/seed_phoenix_calibration.py
YELLOW 델타 사전분포를 버킷별 (mean_delta, n) 값으로 정의하고, 글래스햇 프리셋의 모든 크라이테리온을 순회하여 중복을 피하면서 (seen) 각 버킷당 n개의 입출력 예시를 생성. Phoenix 클라이언트를 통해 "glasshat-calibration" 데이터셋을 생성하고 결과를 출력.
MCP 툴 규약 및 클립 범위 업데이트
services/pipeline-orchestrator/src/glasshat/pipeline/adk_runtime.py
get-dataset-examples와 add-dataset-examples 호출에서 dataset 파라미터를 dataset_name으로 변경. 응답 파싱 시 {"data": {"examples": [...]}} 래퍼를 언래핑하도록 수정. 캘리브레이션 결과의 p25/p75 클립 범위를 백분위수 계산 대신 고정값 (0.0, 10.0)으로 변경하고 _percentile 헬퍼 제거.
테스트 픽스처 및 검증 업데이트
services/pipeline-orchestrator/tests/test_adk_runtime.py
_result() 헬퍼 응답 구조를 {"data": {"examples": ...}} 형태로 변경하여 언래핑 동작 검증. get-dataset-examples와 add-dataset-examples 호출 인자/페이로드에서 dataset_name 파라미터 기대 확인.

예상 코드 리뷰 난이도

🎯 3 (중간) | ⏱️ ~20분

시 (Poem)

🐰 캘리브레이션의 새로운 길을 닦고,
dataset이라는 이름을 dataset_name으로 바꾸니,
Phoenix와 MCP가 손을 잡고,
데이터는 고정된 경계 안에 담겨,
시드에서 피어나는 예시들이여! 🌱✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 18.18% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed PR 제목은 주요 변경 사항을 명확하게 요약하고 있습니다. 'fix(audit): make the live Phoenix-MCP calibration loop genuinely work'는 실시간 Phoenix-MCP 보정 루프의 버그 수정이라는 핵심 목표를 직접적으로 반영하고 있으며, PR 내용의 세 가지 주요 버그 수정(p25/p75 의미, 도구 인수, 응답 래퍼)과 스크립트 추가를 포함한 변경사항 전체를 대표합니다.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/phoenix-mcp-live-calibration

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a seeding script for the glasshat-calibration Phoenix dataset and updates the PhoenixMcpConsultant to align with phoenix-mcp's updated API. Specifically, it changes the MCP tool argument from dataset to dataset_name, unwraps the nested data key in dataset examples, and sets fixed p25 and p75 score clip bounds. The review feedback highlights two main improvements: making the payload parsing in _parse_examples robust against non-dictionary/non-list JSON scalars to prevent potential AttributeError crashes, and ensuring the seeding script is idempotent by deleting any existing dataset with the same name before recreation.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines 197 to 199
if isinstance(payload, dict) and isinstance(payload.get("data"), dict):
payload = payload["data"] # phoenix-mcp wraps rows under "data"
examples = payload if isinstance(payload, list) else payload.get("examples", [])

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

If payload is parsed as a non-dictionary/non-list JSON scalar (such as a boolean true/false, a number, or null), isinstance(payload, list) will evaluate to False, and the code will attempt to call payload.get("examples", []). Since these scalar types do not have a .get method, this will raise an unhandled AttributeError and crash the parser.

We can make this parsing logic fully robust against any JSON structure by explicitly checking if payload is a dictionary or a list before accessing attributes.

Suggested change
if isinstance(payload, dict) and isinstance(payload.get("data"), dict):
payload = payload["data"] # phoenix-mcp wraps rows under "data"
examples = payload if isinstance(payload, list) else payload.get("examples", [])
if isinstance(payload, dict):
if isinstance(payload.get("data"), dict):
payload = payload["data"]
examples = payload.get("examples", [])
elif isinstance(payload, list):
examples = payload
else:
continue

Comment on lines +70 to +71
client = Client(base_url=base_url, api_key=api_key)
ds = client.datasets.create_dataset(

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The seeding script is currently not idempotent. If you run it more than once, client.datasets.create_dataset will raise an HTTP conflict/duplicate error because a dataset with the name glasshat-calibration already exists.

To make the script safely reusable and idempotent (e.g., for local resets or CI/CD pipelines), we should check for and delete any existing dataset with the same name before creating the new one.

    client = Client(base_url=base_url, api_key=api_key)
    for dataset in client.list_datasets():
        if dataset.name == _DATASET:
            client.delete_dataset(id=dataset.id)
            break

    ds = client.datasets.create_dataset(

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
scripts/seed_phoenix_calibration.py (1)

27-27: ⚡ Quick win

데이터셋 이름을 설정값과 동기화해 계약 드리프트를 막아주세요.

_DATASET 하드코딩으로 인해 런타임에서 PHOENIX_CALIBRATION_DATASET가 변경되면, 시딩 대상과 실제 조회 대상이 달라질 수 있습니다.

제안 패치
-_DATASET = "glasshat-calibration"
+_DEFAULT_DATASET = "glasshat-calibration"
@@
 def main() -> None:
@@
     api_key = os.environ.get("PHOENIX_API_KEY") or None
+    dataset_name = os.environ.get("PHOENIX_CALIBRATION_DATASET", _DEFAULT_DATASET)
@@
     ds = client.datasets.create_dataset(
-        name=_DATASET,
+        name=dataset_name,
         inputs=inputs,
         outputs=outputs,
@@
-    print(f"created dataset '{_DATASET}' with {len(inputs)} examples")
+    print(f"created dataset '{dataset_name}' with {len(inputs)} examples")

Also applies to: 72-73, 80-81

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/seed_phoenix_calibration.py` at line 27, The script currently
hardcodes _DATASET = "glasshat-calibration", which can diverge from the runtime
config PHOENIX_CALIBRATION_DATASET; change the code to read the dataset name
from the configuration/environment (PHOENIX_CALIBRATION_DATASET) and assign it
to _DATASET (or a single source-of-truth variable) so seeding and lookup use the
same value, and update the other occurrences in this file that reference the
hardcoded string (the other uses of _DATASET in this script) to use that
variable instead.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@services/pipeline-orchestrator/src/glasshat/pipeline/adk_runtime.py`:
- Around line 197-200: The code currently assumes payload is a dict when calling
payload.get("examples", []) which raises AttributeError for other types; change
the examples extraction to robustly handle types: if isinstance(payload, list):
examples = payload; elif isinstance(payload, dict): examples =
payload.get("examples", []); else: examples = [] so the subsequent for ex in
examples loop never errors on non-dict/list payloads. Ensure the initial check
using isinstance(payload, dict) and payload.get("data") remains intact and apply
this safe-typing logic around the payload → examples assignment.

---

Nitpick comments:
In `@scripts/seed_phoenix_calibration.py`:
- Line 27: The script currently hardcodes _DATASET = "glasshat-calibration",
which can diverge from the runtime config PHOENIX_CALIBRATION_DATASET; change
the code to read the dataset name from the configuration/environment
(PHOENIX_CALIBRATION_DATASET) and assign it to _DATASET (or a single
source-of-truth variable) so seeding and lookup use the same value, and update
the other occurrences in this file that reference the hardcoded string (the
other uses of _DATASET in this script) to use that variable instead.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 3c845630-262c-4fe6-a0e8-3e7242f2add3

📥 Commits

Reviewing files that changed from the base of the PR and between 6480fad and 8d567a0.

📒 Files selected for processing (3)
  • scripts/seed_phoenix_calibration.py
  • services/pipeline-orchestrator/src/glasshat/pipeline/adk_runtime.py
  • services/pipeline-orchestrator/tests/test_adk_runtime.py

Comment thread services/pipeline-orchestrator/src/glasshat/pipeline/adk_runtime.py Outdated
- _parse_examples: handle scalar JSON payloads (bool/number/null) without an
  AttributeError on .get; explicit dict/list/else branches (CodeRabbit HIGH).
- seed_phoenix_calibration: delete any existing same-name dataset via REST before
  create, so re-running is idempotent (the phoenix client exposes no delete)
  (CodeRabbit medium). Verified: re-run removes the old dataset + recreates 429.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@ComBba ComBba merged commit 57ad0f0 into main Jun 6, 2026
5 checks passed
@ComBba ComBba deleted the feat/phoenix-mcp-live-calibration branch June 6, 2026 05:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant