RFC: SDK-native evaluation harness (local + Azure/OTel sinks, LLM-as-judge, analysis agent, suite runner)

## Summary

Proposed: an **opt-in, SDK-native evaluation harness** that would measure and grade Copilot agent runs directly from the SDK's event stream, score them (operationally and/or with an LLM-as-judge), and fan results out to pluggable destinations — a **local file** and/or **OpenTelemetry backends such as Azure Monitor / Application Insights**. The proposal also includes an optional, fully-customizable **analysis ("improve") agent** and a **suite runner** for evaluating across many sessions.

> Filing per `CONTRIBUTING.md` ("Please discuss any feature work with us before writing code"). This is a design proposal for team alignment, not a request to merge. A reference prototype exists in the .NET SDK to validate the shape, but everything below is proposed and open to change. If accepted it would be designed to land across all supported SDKs.

## Motivation

The SDK exposes a rich typed event stream (`assistant.usage`, `tool.execution_start/complete`, `abort`, `model.call_failure`, `user.message`, `assistant.message`) and a runtime-side `TelemetryConfig`, but there is **no first-class way to evaluate a run**: aggregate tokens/cost/tool-economy/reliability, score output quality, persist a journal, compare runs over time, or batch-run a dataset. Consumers re-implement this per project. An idiomatic, observe-only eval surface would make the SDKs "nicer to use" (an explicitly welcomed contribution category) without changing agent behavior.

## Proposed pipeline

`EvalCollector` (taps events) → `EvalRunRecord` (operational metrics + optional transcript) → `IEvalEvaluator` (pluggable scorer) → `EvalInsight` (generic verdict + named `[0,1]` scores) → `IEvalSink[]` (fan-out). Orchestrated by `EvalHarness`; batched by `EvalSuite`. (Names are illustrative and open to bikeshedding.)

## Proposed functionality

**Capture — `EvalCollector`**
- Would subscribe to a session via the idiomatic `On<T>` pattern and aggregate per-task buckets (`BeginTask`) of token usage (input/output/cache/reasoning), cost, per-model usage, tool-call counts + failures, aborts, and model-call failures.
- Would map `tool.execution_complete` (which carries only `ToolCallId`) back to the tool name from `tool.execution_start`.
- Optional **transcript capture** (user/assistant/tool turns) — proposed **off by default for privacy**, enabled via `CaptureContent`. This is what would unlock output-quality grading.
- Designed to be usable standalone (no harness), exposing an immutable `EvalMetricsSnapshot`.

**Records & journal — `EvalRunRecord` / `EvalJournal` / `EvalJournalEntry`**
- One append-only record per run: identity, timing, dominant model, task label, success flag, metrics snapshot, the OTel trace-id (to join runtime-native traces), and the optional transcript.
- A JSON Lines journal: append-only, stream-friendly, corruption-tolerant reads, and **thread-safe concurrent appends** so parallel suites don't collide. Source-generated JSON (AOT-safe).

**Scoring — `IEvalEvaluator` / `EvalInsight` / `OperationalEvaluator`**
- `EvalInsight` proposed as **generic and unopinionated**: `Verdict` + `OverallScore` + a named `Scores` map + `Summary` + factual `Notes`. An evaluator may emit any dimensions.
- A default `OperationalEvaluator` would be **deterministic, free, no-model**: scoring `efficiency` (token economy vs prior run), `tool_economy` (tool success rate), `reliability` (aborts/model-failures/success), plus a verdict (Improved/Flat/Degraded/Inconclusive) from explicit run-over-run signals.
- Model-performance / **LLM-as-judge** grading would simply be a custom `IEvalEvaluator` over the captured transcript. Operational and quality evals would coexist and emit the same insight.

**Destinations — `IEvalSink` / `IEvalHistorySource` (the flexibility)**
- Every completed run would **fan out to all configured sinks at once**, so one run can land locally *and* in Azure simultaneously.
- **Local — `JsonlEvalSink`:** zero-infra append-only JSONL journal that also serves run history; proposed as the default when no sink is configured.
- **Azure / OTLP — `OpenTelemetryEvalSink`:** would emit run metrics and scores through **public** `EvalTelemetry` instruments, picking **no** destination itself — wherever the host routes the meter is where signals land:
  ```csharp
  builder.Services.AddOpenTelemetry()
      .WithMetrics(m => m.AddMeter(EvalTelemetry.MeterName))
      .WithTracing(t => t.AddSource(EvalTelemetry.ActivitySourceName))
      .UseAzureMonitor();          // App Insights — or .AddOtlpExporter(), console, etc.
  ```
  Proposed metrics: `copilot.evals.tokens`, `copilot.evals.tool_calls`, `copilot.evals.cost`, `copilot.evals.score` (tagged by `dimension`, so operational and quality trends sit side by side), plus a per-run span whose trace-id is stamped onto each record.
- `IEvalSink` would be public so consumers can write a database/queue/dashboard sink in a few lines.

**Orchestration — `EvalHarness`**
- `EvalHarness.Attach(session, options)` once per session; `BeginTask` before each prompt; `CompleteRunAsync` returning a record + insight and writing to every sink. Opt-in via `ENABLE_EVALS` or `EvalOptions.Enabled`, and proposed to be an **inert no-op when disabled** (zero prod overhead).
- `EvalOptions` would carry: `Enabled`, `WorkingDirectory`, `CaptureContent`, `Sinks` (+ fluent `AddLocalSink()` / `AddOpenTelemetrySink()`), `Evaluator`, `FromEnvironment()`.

**Analysis sidecar — `EvalAgent` (the proposed "improve" agent)**
- A **separate, optional** agent you could send off to analyze eval data and recommend improvements — decoupled from the eval loop and **observe-only** (never edits instructions/prompts/config).
- Fully customizable: `EvalAgentOptions` (persona/`SystemPrompt`, `Instruction`, `Model`), backed by an `EvalAnalysisRunner` delegate so it can drive a real Copilot session, a different provider, or a test stub.
- A `FromClient(client, configure)` helper would create a session, apply the persona/model, send the instruction + serialized runs, and return the analysis; `AnalyzeAsync(history)` over the journal.

**Batch — `EvalSuite` (across many sessions)**
- Define a dataset of `EvalCase`s (name + prompt, optional per-case `SessionConfig` and success predicate) and run each in **its own fresh session** with bounded concurrency (`MaxConcurrency`, `Timeout`).
- Aggregate into `EvalSuiteResult`: per-case results plus `MeanOverallScore`, `MeanScores()` per dimension, `PassCount`/`FailCount`, `TotalTokens`, `TotalCost` — the offline / CI-gate path, reusing the same evaluator and sinks.

## Scope / open questions for the team

1. Is an eval surface in-scope for the SDKs, or should it be an external companion library?
2. Naming/namespace and GA-vs-preview gating.
3. Cross-language parity plan (Node/Python/Go/.NET/Java/Rust) — suggest landing the shape in one SDK behind preview, then porting.
4. Boundary with the runtime-side `TelemetryConfig` (this would be additive/SDK-side, not a replacement).
5. Whether the analysis agent (`EvalAgent`) and suite runner (`EvalSuite`) belong in an initial cut or a follow-up.

## Non-goals

- No change to agent/runtime behavior; strictly observe-only.
- Not a clone of any vendor's eval catalog — provide the engine and seams, not opinionated graders.

Happy to adjust scope based on the roadmap; a .NET reference prototype can seed the discussion.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: SDK-native evaluation harness (local + Azure/OTel sinks, LLM-as-judge, analysis agent, suite runner) #1753

Summary

Motivation

Proposed pipeline

Proposed functionality

Scope / open questions for the team

Non-goals

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

RFC: SDK-native evaluation harness (local + Azure/OTel sinks, LLM-as-judge, analysis agent, suite runner) #1753

Description

Summary

Motivation

Proposed pipeline

Proposed functionality

Scope / open questions for the team

Non-goals

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions