Skip to content

Commit 0c19fd8

Browse files
committed
Merge branch 'main' into fix-exclude-default-fragments
2 parents 9d386e2 + abe1ebc commit 0c19fd8

File tree

456 files changed

+133180
-24129
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

456 files changed

+133180
-24129
lines changed

src/aks-agent/.gitignore

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
# Ignore Poetry artifacts
2+
poetry.lock
3+
pyproject.toml
Lines changed: 142 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,142 @@
1+
# AKS Agent Evals
2+
3+
## Environment Setup
4+
5+
Create and activate a virtual environment (example shown for bash-compatible shells):
6+
7+
```bash
8+
python -m venv .venv
9+
source .venv/bin/activate
10+
python -m pip install --upgrade pip
11+
python -m pip install -e .
12+
```
13+
14+
Optional tooling used by the eval harness (Braintrust uploads and semantic classifier helpers):
15+
16+
```bash
17+
python -m pip install braintrust openai autoevals
18+
```
19+
20+
## Running Live Scenarios
21+
22+
```bash
23+
RUN_LIVE=true \
24+
MODEL=azure/gpt-4.1 \
25+
CLASSIFIER_MODEL=azure/gpt-4o \
26+
AKS_AGENT_RESOURCE_GROUP=<rg> \
27+
AKS_AGENT_CLUSTER=<cluster> \
28+
KUBECONFIG=<path-to-kubeconfig> \
29+
AZURE_API_KEY=<key> \
30+
AZURE_API_BASE=<endpoint> \
31+
AZURE_API_VERSION=<version> \
32+
python -m pytest azext_aks_agent/tests/evals/test_ask_agent.py -k 01_list_all_nodes -m aks_eval
33+
```
34+
35+
Per-scenario overrides (`resource_group`, `cluster_name`, `kubeconfig`, `test_env_vars`) still apply. Use `--skip-setup` or `--skip-cleanup` to bypass hooks. Expect the test to log iteration progress, classifier scores, and (on the final iteration) a Braintrust link when uploads are enabled.
36+
37+
**Example output (live run with classifier)**
38+
39+
```
40+
[iteration 1/3] running setup commands for 01_list_all_nodes
41+
[iteration 1/3] invoking AKS Agent CLI for 01_list_all_nodes
42+
[iteration 1/3] classifier score for 01_list_all_nodes: 1
43+
[iteration 2/3] invoking AKS Agent CLI for 01_list_all_nodes
44+
[iteration 2/3] classifier score for 01_list_all_nodes: 1
45+
[iteration 3/3] invoking AKS Agent CLI for 01_list_all_nodes
46+
[iteration 3/3] classifier score for 01_list_all_nodes: 1
47+
...
48+
🔍 Braintrust: https://www.braintrust.dev/app/<org>/p/aks-agent/experiments/aks-agent/...
49+
```
50+
51+
## Mock Workflow
52+
53+
```bash
54+
# Generate fresh mocks from a live run
55+
RUN_LIVE=true GENERATE_MOCKS=true \
56+
MODEL=azure/gpt-4.1 \
57+
AKS_AGENT_RESOURCE_GROUP=<rg> \
58+
AKS_AGENT_CLUSTER=<cluster> \
59+
KUBECONFIG=<path-to-kubeconfig> \
60+
AZURE_API_KEY=<key> \
61+
AZURE_API_BASE=<endpoint> \
62+
AZURE_API_VERSION=<version> \
63+
python -m pytest azext_aks_agent/tests/evals/test_ask_agent.py -k 02_list_clusters -m aks_eval
64+
65+
# Re-run offline using the recorded response
66+
python -m pytest azext_aks_agent/tests/evals/test_ask_agent.py -k 02_list_clusters -m aks_eval
67+
```
68+
69+
If a mock is missing, pytest skips the scenario with instructions to regenerate it.
70+
71+
**Regression guardrails**
72+
73+
- Mocked answers make iterations deterministic, so you can update parsing or prompts without waiting on live infrastructure.
74+
- If you check in a new mock after behavior changes, reviewers see the exact diff in `mocks/response.txt`, making regressions obvious.
75+
- CI can run `RUN_LIVE` off by default, catching logical regressions early without needing cluster credentials.
76+
77+
78+
**Example skip (no mock present)**
79+
80+
```
81+
azext_aks_agent/tests/evals/test_ask_agent.py::test_ask_agent_live[02_list_clusters]
82+
SKIPPED: Mock response missing for scenario 02_list_clusters; rerun with RUN_LIVE=true GENERATE_MOCKS=true
83+
```
84+
85+
## Braintrust Uploads
86+
87+
Set the following environment variables to push results:
88+
89+
- `BRAINTRUST_API_KEY` and `BRAINTRUST_ORG` (required)
90+
- Optional overrides: `BRAINTRUST_PROJECT` (default `aks-agent`), `BRAINTRUST_DATASET` (default `aks-agent/ask`), `EXPERIMENT_ID`
91+
92+
Each iteration logs to Braintrust; the console prints a clickable link (for aware terminals) when uploads succeed.
93+
94+
**Tips**
95+
96+
- Leave `EXPERIMENT_ID` unset to generate a fresh experiment name each run (`aks-agent/<model>/<run-id>`).
97+
- Use `BRAINTRUST_RUN_ID=<custom>` if you want deterministic experiment names across retries.
98+
- The upload payload includes classifier score, rationale, raw CLI output, cluster, and resource group metadata for later filtering.
99+
100+
## Semantic Classifier
101+
102+
- Enabled by default; set `ENABLE_CLASSIFIER=false` to opt out.
103+
- Requires Azure OpenAI credentials: `AZURE_API_BASE`, `AZURE_API_KEY`, `AZURE_API_VERSION`, and a classifier deployment specified via `CLASSIFIER_MODEL` (e.g. `azure/<deployment>`). Defaults to the same deployment as `MODEL` when not provided.
104+
- Install classifier dependencies when online (see Environment Setup above if not already installed).
105+
106+
- Scenarios can override the grading style by adding:
107+
108+
```yaml
109+
evaluation:
110+
correctness:
111+
type: loose # or strict (default)
112+
```
113+
114+
Classifier scores and rationales are attached to Braintrust uploads and printed in the pytest output metadata.
115+
116+
**Debugging classifiers**
117+
118+
```
119+
python -m pytest ... -o log_cli=true -o log_cli_level=DEBUG -s
120+
```
121+
122+
Look for `classifier score ...` lines to confirm the semantic judge executed.
123+
124+
## Iterations & Tags
125+
126+
- `ITERATIONS=<n>` repeats every scenario, useful for non-deterministic models.
127+
- Filter suites with pytest markers: `-m aks_eval`, `-m easy`, `-m medium`, etc.
128+
129+
## Troubleshooting
130+
131+
- Missing mocks: rerun with `RUN_LIVE=true GENERATE_MOCKS=true`.
132+
- Cleanup always executes unless `--skip-cleanup` is provided; check the `[cleanup]` log line.
133+
- Braintrust disabled messages mean credentials or the SDK are missing.
134+
- Classifier disabled messages usually indicate missing Azure settings (`AZURE_API_BASE`, `AZURE_API_KEY`, `AZURE_API_VERSION`).
135+
136+
## Quick Checklist
137+
138+
- Install dependencies inside a virtual environment (`python -m pip install -e .`) and, if needed, the optional tooling (`python -m pip install braintrust openai autoevals`).
139+
- `RUN_LIVE=true`: set Azure creds (`AZURE_API_KEY`, `AZURE_API_BASE`, `AZURE_API_VERSION`), `MODEL`, kubeconfig, and optional Braintrust vars.
140+
- `RUN_LIVE` unset/false: ensure each scenario directory has `mocks/response.txt`.
141+
- Classifier overrides: `CLASSIFIER_MODEL` (defaults to `MODEL`) and per-scenario `evaluation.correctness.type`.
142+
- Optional: `BRAINTRUST_RUN_ID=<identifier>` to reuse experiment names across retries.
Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
# --------------------------------------------------------------------------------------------
2+
# Copyright (c) Microsoft Corporation. All rights reserved.
3+
# Licensed under the MIT License. See License.txt in the project root for license information.
4+
# --------------------------------------------------------------------------------------------
Lines changed: 216 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,216 @@
1+
# --------------------------------------------------------------------------------------------
2+
# Copyright (c) Microsoft Corporation. All rights reserved.
3+
# Licensed under the MIT License. See License.txt in the project root for license information.
4+
# --------------------------------------------------------------------------------------------
5+
6+
from __future__ import annotations
7+
8+
import logging
9+
import os
10+
from dataclasses import dataclass
11+
from typing import Any, Dict, Mapping, Optional
12+
from urllib.parse import quote
13+
14+
LOGGER = logging.getLogger(__name__)
15+
16+
17+
@dataclass
18+
class BraintrustMetadata:
19+
project: str
20+
dataset: str
21+
experiment: Optional[str]
22+
api_key: str
23+
org: str
24+
25+
26+
class BraintrustUploader:
27+
"""Uploads eval results to Braintrust when credentials and SDK are available."""
28+
29+
def __init__(self, env: Mapping[str, str | None]) -> None:
30+
self._env = env
31+
self._metadata = self._load_metadata(env)
32+
self._enabled = self._metadata is not None
33+
self._braintrust = None
34+
self._dataset = None
35+
self._experiments: Dict[str, Any] = {}
36+
self._warning_emitted = False
37+
38+
@staticmethod
39+
def _load_metadata(env: Mapping[str, str | None]) -> Optional[BraintrustMetadata]:
40+
api_key = env.get("BRAINTRUST_API_KEY") or ""
41+
org = env.get("BRAINTRUST_ORG") or ""
42+
if not api_key or not org:
43+
return None
44+
project = env.get("BRAINTRUST_PROJECT") or "aks-agent"
45+
dataset = env.get("BRAINTRUST_DATASET") or "aks-agent/ask"
46+
experiment = env.get("EXPERIMENT_ID")
47+
return BraintrustMetadata(
48+
project=project,
49+
dataset=dataset,
50+
experiment=experiment,
51+
api_key=api_key,
52+
org=org,
53+
)
54+
55+
@property
56+
def enabled(self) -> bool:
57+
return self._enabled
58+
59+
def _warn_once(self, message: str) -> None:
60+
if not self._warning_emitted:
61+
LOGGER.warning("[braintrust] %s", message)
62+
self._warning_emitted = True
63+
64+
def _ensure_braintrust(self) -> bool:
65+
if self._braintrust is not None:
66+
return True
67+
if not self._enabled or not self._metadata:
68+
return False
69+
try:
70+
import braintrust # type: ignore
71+
except ImportError:
72+
self._warn_once(
73+
"braintrust package not installed; skipping Braintrust uploads"
74+
)
75+
self._enabled = False
76+
return False
77+
78+
# Configure environment for braintrust SDK
79+
os.environ.setdefault("BRAINTRUST_API_KEY", self._metadata.api_key)
80+
os.environ.setdefault("BRAINTRUST_ORG", self._metadata.org)
81+
self._braintrust = braintrust
82+
return True
83+
84+
def _ensure_dataset(self) -> Optional[Any]:
85+
if not self._ensure_braintrust():
86+
return None
87+
if self._dataset is None and self._metadata:
88+
try:
89+
self._dataset = self._braintrust.init_dataset( # type: ignore[attr-defined]
90+
project=self._metadata.project,
91+
name=self._metadata.dataset,
92+
)
93+
except Exception as exc: # pragma: no cover - SDK specific failure
94+
self._warn_once(f"Unable to initialise Braintrust dataset: {exc}")
95+
self._enabled = False
96+
return None
97+
return self._dataset
98+
99+
def _get_experiment(self, experiment_name: str) -> Optional[Any]:
100+
if experiment_name in self._experiments:
101+
return self._experiments[experiment_name]
102+
dataset = self._ensure_dataset()
103+
if dataset is None or not self._metadata:
104+
return None
105+
try:
106+
experiment = self._braintrust.init( # type: ignore[attr-defined]
107+
project=self._metadata.project,
108+
experiment=experiment_name,
109+
dataset=dataset,
110+
open=False,
111+
update=True,
112+
metadata={"aks_agent": True},
113+
)
114+
except Exception as exc: # pragma: no cover - SDK specific failure
115+
self._warn_once(f"Unable to initialise Braintrust experiment: {exc}")
116+
self._enabled = False
117+
return None
118+
self._experiments[experiment_name] = experiment
119+
return experiment
120+
121+
def _build_url(
122+
self,
123+
experiment_name: str,
124+
span_id: Optional[str],
125+
root_span_id: Optional[str],
126+
) -> Optional[str]:
127+
if not self._metadata:
128+
return None
129+
encoded_exp = quote(experiment_name, safe="")
130+
base = (
131+
f"https://www.braintrust.dev/app/{self._metadata.org}/p/"
132+
f"{self._metadata.project}/experiments/{encoded_exp}"
133+
)
134+
if span_id and root_span_id:
135+
return f"{base}?r={span_id}&s={root_span_id}"
136+
return base
137+
138+
def record(
139+
self,
140+
*,
141+
scenario_name: str,
142+
iteration: int,
143+
total_iterations: int,
144+
prompt: str,
145+
answer: str,
146+
expected_output: list[str],
147+
model: str,
148+
tags: list[str],
149+
passed: bool,
150+
run_live: bool,
151+
raw_output: str,
152+
resource_group: str,
153+
cluster_name: str,
154+
error_message: Optional[str] = None,
155+
classifier_score: Optional[float] = None,
156+
classifier_rationale: Optional[str] = None,
157+
) -> Optional[Dict[str, Optional[str]]]:
158+
if not self._enabled:
159+
return None
160+
metadata = self._metadata
161+
if not metadata:
162+
return None
163+
164+
if metadata.experiment:
165+
experiment_name = metadata.experiment
166+
else:
167+
iteration_token = os.environ.get("BRAINTRUST_RUN_ID") or os.environ.get("GITHUB_RUN_ID") or os.environ.get("CI_PIPELINE_ID")
168+
if not iteration_token:
169+
iteration_token = f"{model}-{os.getpid()}"
170+
experiment_name = f"aks-agent/{model}/{iteration_token}"
171+
experiment = self._get_experiment(experiment_name)
172+
if experiment is None:
173+
return None
174+
175+
span = experiment.start_span(
176+
name=f"{scenario_name} [iter {iteration + 1}/{total_iterations}]"
177+
)
178+
metadata: Dict[str, Any] = {
179+
"raw_output": raw_output,
180+
"resource_group": resource_group,
181+
"cluster_name": cluster_name,
182+
"error": error_message,
183+
}
184+
if classifier_score is not None:
185+
metadata["classifier_score"] = classifier_score
186+
if classifier_rationale:
187+
metadata["classifier_rationale"] = classifier_rationale
188+
189+
span.log(
190+
input=prompt,
191+
output=answer,
192+
expected="\n".join(expected_output),
193+
dataset_record_id=scenario_name,
194+
scores={
195+
"correctness": 1 if passed else 0,
196+
"classifier": classifier_score,
197+
},
198+
tags=list(tags) + [f"model:{model}", f"run_live:{run_live}"],
199+
metadata=metadata,
200+
)
201+
span_id = getattr(span, "id", None)
202+
root_span_id = getattr(span, "root_span_id", None)
203+
span.end()
204+
try:
205+
experiment.flush()
206+
except Exception as exc: # pragma: no cover - SDK specific failure
207+
self._warn_once(f"Failed flushing Braintrust experiment: {exc}")
208+
self._enabled = False
209+
return None
210+
return {
211+
"span_id": span_id,
212+
"root_span_id": root_span_id,
213+
"url": self._build_url(experiment_name, span_id, root_span_id),
214+
"classifier_score": classifier_score,
215+
"classifier_rationale": classifier_rationale,
216+
}

0 commit comments

Comments
 (0)