evals for api review command #2606

theobarberbany · 2025-11-27T12:59:36Z

This is experimental.

Part of the cluster infra f2f hackathon held last week.

Vibe coded test suite to test our vibe coded commands to review our vibe coded code.

Uses LLM as a judge (claude) to review output of the /api-review command.

Each suite can be configured to use a different model. The default judge is haiku 4.5, golden runs on sonnet, and integration on opus. I'm not sure these are the best choices (other than haiku for the judge), but we should be able to gather some data 🤞 .

Directory Structure

tests/eval/testdata/
├── golden/                     # Base truth tests - single isolated issues
│   ├── missing-optional-doc/
│   │   ├── patch.diff          # Triggers ONLY missing-optional-doc
│   │   └── expected.txt
│   ├── undocumented-enum/
│   │   ├── patch.diff          # Triggers ONLY undocumented-enum
│   │   └── expected.txt
│   ├── missing-featuregate/
│   │   ├── patch.diff          # Triggers ONLY missing-featuregate
│   │   └── expected.txt
│   └── valid-api-change/
│       ├── patch.diff          # Triggers NO issues
│       └── expected.txt
└── integration/                # Complex scenarios - multiple issues
    ├── new-field-all-issues/
    │   ├── patch.diff          # Triggers multiple issues together
    │   └── expected.txt
    └── partial-documentation/
        ├── patch.diff
        └── expected.txt

Test Case Format

patch.diff

Standard git diff format:

diff --git a/config/v1/types.go b/config/v1/types.go
--- a/config/v1/types.go
+++ b/config/v1/types.go
@@ -10,0 +11,5 @@
+// MyField does something
+// +optional
+// +kubebuilder:validation:Enum=Foo;Bar
+MyField string `json:"myField"`

expected.txt

One expected issue per line:

enum values Foo and Bar not documented in comment
optional field does not explain behavior when omitted

Empty file means the API change should pass review with no issues.

Note: Order of issues in expected.txt does not matter. Comparison uses semantic matching, not exact string matching.

Example output where we catch two instances of the command hallucinating new rules:

Look at Judge result and Judge reason for explanations of decisions made.

Details

Running Suite: API Review Eval Suite - /Users/tbarberb/projects/api/tests/eval
==============================================================================
Random Seed: 1764259960

Will run 4 of 4 specs
------------------------------
[BeforeSuite]
/Users/tbarberb/projects/api/tests/eval/eval_test.go:80
  STEP: verifying local AGENTS.md exists @ 11/27/25 17:12:40.662
  STEP: verifying local .claude/commands/api-review.md exists @ 11/27/25 17:12:40.662
  STEP: creating temp directory for clone @ 11/27/25 17:12:40.662
  STEP: shallow cloning openshift/api @ 11/27/25 17:12:40.662
  STEP: copying local AGENTS.md to temp clone @ 11/27/25 17:12:54.327
  STEP: copying local .claude directory to temp clone @ 11/27/25 17:12:54.328
[BeforeSuite] PASSED [13.667 seconds]
------------------------------
API Review Evaluation Golden Tests should correctly identify single issues missing-optional-doc
/Users/tbarberb/projects/api/tests/eval/eval_test.go:199
  STEP: verifying origin remote exists @ 11/27/25 17:12:54.329
  STEP: resetting repo to clean state @ 11/27/25 17:12:54.341
  STEP: copying local AGENTS.md to temp clone @ 11/27/25 17:12:54.832
  STEP: copying local .claude directory to temp clone @ 11/27/25 17:12:54.832
  STEP: reading and applying patch @ 11/27/25 17:12:54.833
  STEP: running API review via Claude (claude-sonnet-4-5@20250929) @ 11/27/25 17:12:54.847
  STEP: comparing results with Claude judge (claude-haiku-4-5@20251001) @ 11/27/25 17:14:19.61
  Judge result: pass=true, reason=The review correctly identified the core issue: the optional field does not explain behavior when omitted. The output semantic matches the expected issue by identifying that the comment lacks documentation of what happens when `customLogoURL` is not set (i.e., the default logo is used). The review provides appropriate context about the validation pattern allowing empty strings and suggests a fix that documents the omitted behavior.
• [88.746 seconds]
------------------------------
API Review Evaluation Golden Tests should correctly identify single issues undocumented-enum
/Users/tbarberb/projects/api/tests/eval/eval_test.go:199
  STEP: verifying origin remote exists @ 11/27/25 17:14:23.075
  STEP: resetting repo to clean state @ 11/27/25 17:14:23.097
  STEP: copying local AGENTS.md to temp clone @ 11/27/25 17:14:23.263
  STEP: copying local .claude directory to temp clone @ 11/27/25 17:14:23.263
  STEP: reading and applying patch @ 11/27/25 17:14:23.264
  STEP: running API review via Claude (claude-sonnet-4-5@20250929) @ 11/27/25 17:14:23.278

  STEP: comparing results with Claude judge (claude-haiku-4-5@20251001) @ 11/27/25 17:16:31.141
  Judge result: pass=true, reason=The review correctly identified that enum values Light and Dark are not documented in the comment. The output expanded on this core issue by also noting incomplete enum documentation, missing specific default behavior (Light vs Dark), and formatting issues, but all of these are semantic expansions of the same fundamental problem: the enum values are not properly documented as required by the API review standards.
• [131.341 seconds]
------------------------------
API Review Evaluation Golden Tests should correctly identify single issues valid-api-change
/Users/tbarberb/projects/api/tests/eval/eval_test.go:199
  STEP: verifying origin remote exists @ 11/27/25 17:16:34.417
  STEP: resetting repo to clean state @ 11/27/25 17:16:34.433
  STEP: copying local AGENTS.md to temp clone @ 11/27/25 17:16:34.589
  STEP: copying local .claude directory to temp clone @ 11/27/25 17:16:34.59
  STEP: reading and applying patch @ 11/27/25 17:16:34.59
  STEP: running API review via Claude (claude-sonnet-4-5@20250929) @ 11/27/25 17:16:34.603
  STEP: comparing results with Claude judge (claude-haiku-4-5@20251001) @ 11/27/25 17:17:26.436
  Judge result: pass=false, reason=The expected issues list is empty, but the API review output identifies a real issue: field documentation mismatch between 'bannerText' description and its enum validation. Since there are no expected issues to match against, and the review found an unrelated issue, this fails the comparison criteria.
  [FAILED] in [It] - /Users/tbarberb/projects/api/tests/eval/eval_test.go:304 @ 11/27/25 17:17:29.727
• [FAILED] [55.310 seconds]
API Review Evaluation Golden Tests should correctly identify single issues [It] valid-api-change
/Users/tbarberb/projects/api/tests/eval/eval_test.go:199

  [FAILED] API review did not match expected issues.
  Judge reason: The expected issues list is empty, but the API review output identifies a real issue: field documentation mismatch between 'bannerText' description and its enum validation. Since there are no expected issues to match against, and the review found an unrelated issue, this fails the comparison criteria.
  Review output:
  ✅ Linting checks passed.

  Now analyzing the changed file for documentation compliance:

  ## API Review Report

  ### 📄 File: `config/v1/types_console.go`

  **Status: ❌ FAILED**

  ---

  **Line 49: Field name does not match documentation**

  **Current (problematic) code:**
  ```go
  // bannerText is an optional field that specifies a custom banner message
  // to display at the top of the console. Valid values are "Info", "Warning",
  // and "Error" which control the banner styling. When omitted, no banner
  // is displayed.
  // +optional
  // +kubebuilder:validation:Enum=Info;Warning;Error
  BannerText string `json:"bannerText,omitempty"`
  ```

  **Suggested change:**
  ```diff
  - // bannerText is an optional field that specifies a custom banner message
  + // bannerText is an optional field that specifies the type of banner message
    // to display at the top of the console. Valid values are "Info", "Warning",
    // and "Error" which control the banner styling. When omitted, no banner
    // is displayed.
  ```

  **Explanation:** The field name is `bannerText` but the documentation says it "specifies a custom banner message". However, this field only accepts enum values "Info", "Warning", or "Error" - it doesn't accept arbitrary custom messages. The documentation should clarify that this field specifies the **type** of banner, not the message content itself. If this truly is meant to hold custom banner text, then the `Enum` validation is incorrect and should be removed.

  ---

  ## Summary

  ❌ **Review Failed**: 1 issue found

  The new `BannerText` field has a mismatch between its documentation and validation constraints. The comment suggests it holds "a custom banner message" but the enum validation only allows three specific values ("Info", "Warning", "Error"). Please clarify:

  1. If this field should accept arbitrary custom text, remove the `Enum` validation
  2. If this field should only accept enum values, update the documentation to reflect that it specifies the banner **type**, not custom message content

  Expected issues:

  Expected
      <bool>: false
  to be true
  In [It] at: /Users/tbarberb/projects/api/tests/eval/eval_test.go:304 @ 11/27/25 17:17:29.727
------------------------------
API Review Evaluation Integration Tests should correctly identify multiple issues new-field-multiple-issues
/Users/tbarberb/projects/api/tests/eval/eval_test.go:216
  STEP: verifying origin remote exists @ 11/27/25 17:17:29.742
  STEP: resetting repo to clean state @ 11/27/25 17:17:29.757
  STEP: copying local AGENTS.md to temp clone @ 11/27/25 17:17:29.915
  STEP: copying local .claude directory to temp clone @ 11/27/25 17:17:29.915
  STEP: reading and applying patch @ 11/27/25 17:17:29.916
  STEP: running API review via Claude (claude-opus-4-5@20251101) @ 11/27/25 17:17:29.931
  STEP: comparing results with Claude judge (claude-haiku-4-5@20251001) @ 11/27/25 17:18:31.13
  Judge result: pass=false, reason=The review identified an extra issue not in the expected list: 'missing URL validation pattern for customLogoURL' was reported as a separate finding beyond the expected 'missing length validation for customLogoURL'. While the review correctly identified all 4 expected issues (optional field behavior, URL validation, FeatureGate, and length validation), it reported validation as a separate concern than length validation, which appears to be treating them as distinct problems rather than recognizing they may be the same underlying validation gap.
  [FAILED] in [It] - /Users/tbarberb/projects/api/tests/eval/eval_test.go:304 @ 11/27/25 17:18:34.472
• [FAILED] [64.730 seconds]
API Review Evaluation Integration Tests should correctly identify multiple issues [It] new-field-multiple-issues
/Users/tbarberb/projects/api/tests/eval/eval_test.go:216

  [FAILED] API review did not match expected issues.
  Judge reason: The review identified an extra issue not in the expected list: 'missing URL validation pattern for customLogoURL' was reported as a separate finding beyond the expected 'missing length validation for customLogoURL'. While the review correctly identified all 4 expected issues (optional field behavior, URL validation, FeatureGate, and length validation), it reported validation as a separate concern than length validation, which appears to be treating them as distinct problems rather than recognizing they may be the same underlying validation gap.
  Review output:
  ## API Review Results

  **File reviewed:** `config/v1/types_console.go`

  **Linting:** ✅ Passed (0 issues)

  ---

  ### ❌ Review FAILED - Documentation Issues Found

  **+44: Missing optional field behavior documentation**
  **Current (problematic) code:**
  ```go
        // customLogoURL specifies a URL for a custom logo image.
        // +optional
        CustomLogoURL string `json:"customLogoURL,omitempty"`
  ```

  **Suggested change:**
  ```diff
  -     // customLogoURL specifies a URL for a custom logo image.
  +     // customLogoURL specifies a URL for a custom logo image.
  +     // When omitted, the default OpenShift logo will be displayed.
  ```

  **Explanation:** Optional fields MUST explain the behavior when the field is not present.

  ---

  **+44: Missing URL validation**
  **Current (problematic) code:**
  ```go
        // customLogoURL specifies a URL for a custom logo image.
        // +optional
        CustomLogoURL string `json:"customLogoURL,omitempty"`
  ```

  **Suggested change:**
  ```diff
        // customLogoURL specifies a URL for a custom logo image.
  +     // The URL must be a valid HTTPS URL.
  +     // When omitted, the default OpenShift logo will be displayed.
        // +optional
  +     // +kubebuilder:validation:Pattern=`^$|^https://.*$`
        CustomLogoURL string `json:"customLogoURL,omitempty"`
  ```

  **Explanation:** URL fields should have validation to ensure proper format. Compare with `LogoutRedirect` field at line 83 which validates HTTPS URLs with a pattern.

  ---

  **+44: New stable API field should be feature-gated**
  **Current (problematic) code:**
  ```go
        CustomLogoURL string `json:"customLogoURL,omitempty"`
  ```

  **Suggested change:**
  ```diff
  +     // +openshift:enable:FeatureGate=ConsoleCustomLogo
        CustomLogoURL string `json:"customLogoURL,omitempty"`
  ```

  **Explanation:** Per CLAUDE.md guidelines, new fields on stable APIs should be introduced behind a feature gate.

  ---

  ### Summary

  | Check | Status |
  |-------|--------|
  | Linting | ✅ Pass |
  | Field Documentation | ✅ Pass |
  | Optional Behavior Documented | ❌ Fail |
  | Validation Markers Present | ❌ Fail |
  | Feature Gate for New Field | ❌ Fail |

  Expected issues:
  optional field customLogoURL does not explain behavior when omitted
  missing URL validation pattern for customLogoURL
  missing FeatureGate for new field on stable API
  missing length validation for customLogoURL
  Expected
      <bool>: false
  to be true
  In [It] at: /Users/tbarberb/projects/api/tests/eval/eval_test.go:304 @ 11/27/25 17:18:34.472
------------------------------
[AfterSuite]
/Users/tbarberb/projects/api/tests/eval/eval_test.go:114
  STEP: cleaning up temp directory @ 11/27/25 17:18:34.49
[AfterSuite] PASSED [0.853 seconds]
------------------------------

Summarizing 2 Failures:
  [FAIL] API Review Evaluation Golden Tests should correctly identify single issues [It] valid-api-change
  /Users/tbarberb/projects/api/tests/eval/eval_test.go:304
  [FAIL] API Review Evaluation Integration Tests should correctly identify multiple issues [It] new-field-multiple-issues
  /Users/tbarberb/projects/api/tests/eval/eval_test.go:304

Ran 4 of 4 Specs in 354.680 seconds
FAIL! -- 2 Passed | 2 Failed | 0 Pending | 0 Skipped
--- FAIL: TestEval (354.68s)
FAIL

Ginkgo ran 1 suite in 5m55.161278125s

Test Suite Failed

openshift-ci-robot · 2025-11-27T12:59:39Z

Pipeline controller notification
This repository is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

openshift-ci · 2025-11-27T12:59:41Z

Hello @theobarberbany! Some important instructions when contributing to openshift/api:
API design plays an important part in the user experience of OpenShift and as such API PRs are subject to a high level of scrutiny to ensure they follow our best practices. If you haven't already done so, please review the OpenShift API Conventions and ensure that your proposed changes are compliant. Following these conventions will help expedite the api review process for your PR.

coderabbitai · 2025-11-27T12:59:41Z

Important

Review skipped

Auto reviews are limited based on label configuration.

🚫 Excluded labels (none allowed) (1)

do-not-merge/work-in-progress

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

✨ Finishing touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

openshift-ci · 2025-11-27T12:59:58Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign everettraven for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

This builds a basic go test suite that uses claude as a judge to review the output of the /api-review command. See tests/eval/DESIGN.md for more details.

openshift-ci · 2025-12-02T10:07:08Z

@theobarberbany: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/integration	`a999d82`	link	true	`/test integration`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

JoelSpeed · 2025-12-01T16:15:35Z

tests/eval/testdata/golden/undocumented-enum/patch.diff

+	// theme specifies the console color theme.
+	// +optional
+	// +kubebuilder:validation:Enum=Light;Dark
+ // When omitted the default theme is used.


Is the indent being off in this patch a problem here?

JoelSpeed · 2025-12-15T11:36:22Z

tests/eval/DESIGN.md

+### Step 1 - Run API Review
+
+```bash
+claude --print -p "/api-review" --allowedTools "Bash,Read,Grep,Glob,Task" <files>


Allowed tools matters here?

JoelSpeed · 2025-12-15T11:38:15Z

tests/eval/DESIGN.md

+
+## Phase 2
+
+### Cost Tracking ✅ IMPLEMENTED


Why do these have IMPLEMENTED?

JoelSpeed · 2025-12-15T11:39:49Z

tests/eval/DESIGN.md

+# Override golden tests to use Haiku
+EVAL_GOLDEN_MODEL=claude-3-haiku-20240307 go test ./tests/eval/...
+
+# Override all models


Doesn't include an example of overriding the judge model

JoelSpeed · 2025-12-15T11:40:11Z

tests/eval/DESIGN.md

+
+### Patch Stability
+
+Patches may fail to apply as `origin/master` evolves over time. Strategies:


Did we come to a conclusion on this?

JoelSpeed · 2025-12-15T11:47:40Z

tests/eval/eval_test.go

+	copyLocalFiles()
+}
+
+func discoverTestCases(testdataPath string) ([]string, error) {


It would be more efficient if this returned a list of objects containing all the relevant information for a test case, rather than just a list of names

We effectively walk the test directories twice with this pattern and duplicate the naming formation logic for the expected files

JoelSpeed · 2025-12-15T11:48:51Z

tests/eval/eval_test.go

+func stripMarkdownCodeBlock(s string) string {
+	s = strings.TrimSpace(s)
+	s = strings.TrimPrefix(s, "```json")
+	s = strings.TrimPrefix(s, "```")
+	s = strings.TrimSuffix(s, "```")
+	return strings.TrimSpace(s)
+}


How come this is needed?

JoelSpeed · 2025-12-15T11:49:19Z

tests/eval/eval_test.go

+	Expect(err).NotTo(HaveOccurred(), "git apply failed: %s", string(output))
+}
+
+// runAPIReview and runJudge can probably share some common code.


Is this a TODO?

JoelSpeed · 2025-12-15T11:50:29Z

tests/eval/eval_test.go

+	totalJudgeCost += parsed.TotalCostUSD
+
+	var result evalResult
+	jsonStr := stripMarkdownCodeBlock(parsed.Result)


The judge isn't good at obeying the instructions not to include extra formatting?

this was belt and braces, but yes at least once 😂

JoelSpeed · 2025-12-15T11:51:40Z

tests/eval/eval_test.go

+	return result, parsed.TotalCostUSD
+}
+
+func runTestCase(tier, tc, reviewModel, judgeModelName string) {


What if runTestCase accepted a struct with the patch and expected issues in it, gather in one place rather than having to stat the file then reconstruct the name and read the file again here?

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 27, 2025

openshift-ci bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Nov 27, 2025

openshift-ci bot requested review from JoelSpeed and everettraven November 27, 2025 12:59

theobarberbany force-pushed the test-evals branch 2 times, most recently from 99de6d4 to 06bca4c Compare November 27, 2025 13:54

theobarberbany marked this pull request as draft November 27, 2025 14:15

Basic evals for /api-review command

4ea8b69

This builds a basic go test suite that uses claude as a judge to review the output of the /api-review command. See tests/eval/DESIGN.md for more details.

theobarberbany force-pushed the test-evals branch from 06bca4c to 0c769f0 Compare November 27, 2025 16:13

openshift-ci bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Nov 27, 2025

theobarberbany mentioned this pull request Dec 1, 2025

Add a command for reviewing e2e test PRs and enforcing conventions openshift-eng/ai-helpers#186

Open

theobarberbany force-pushed the test-evals branch from 0651596 to b69d5eb Compare December 1, 2025 15:47

theobarberbany changed the title ~~WIP: evals for api review command~~ evals for api review command Dec 1, 2025

Refactor + golden/integration + different models

1d8da4f

theobarberbany force-pushed the test-evals branch from b69d5eb to 55b0c64 Compare December 1, 2025 16:04

Adds cost reporting

a999d82

theobarberbany force-pushed the test-evals branch from 55b0c64 to a999d82 Compare December 2, 2025 09:37

theobarberbany marked this pull request as ready for review December 2, 2025 09:37

openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 2, 2025

JoelSpeed reviewed Dec 15, 2025

View reviewed changes


		### Patch Stability

		Patches may fail to apply as `origin/master` evolves over time. Strategies:

evals for api review command #2606

Are you sure you want to change the base?

evals for api review command #2606

Uh oh!

Conversation

theobarberbany commented Nov 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

This is experimental.

Directory Structure

Test Case Format

patch.diff

expected.txt

Example output where we catch two instances of the command hallucinating new rules:

Uh oh!

openshift-ci-robot commented Nov 27, 2025

Uh oh!

openshift-ci bot commented Nov 27, 2025

Uh oh!

coderabbitai bot commented Nov 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

openshift-ci bot commented Nov 27, 2025

Uh oh!

openshift-ci bot commented Dec 2, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

theobarberbany commented Nov 27, 2025 •

edited

Loading

coderabbitai bot commented Nov 27, 2025 •

edited

Loading