Skip to content

RFC: SDK-native evaluation harness (local + Azure/OTel sinks, LLM-as-judge, analysis agent, suite runner) #1753

Description

@garrettlondon1

Summary

Proposed: an opt-in, SDK-native evaluation harness that would measure and grade Copilot agent runs directly from the SDK's event stream, score them (operationally and/or with an LLM-as-judge), and fan results out to pluggable destinations — a local file and/or OpenTelemetry backends such as Azure Monitor / Application Insights. The proposal also includes an optional, fully-customizable analysis ("improve") agent and a suite runner for evaluating across many sessions.

Filing per CONTRIBUTING.md ("Please discuss any feature work with us before writing code"). This is a design proposal for team alignment, not a request to merge. A reference prototype exists in the .NET SDK to validate the shape, but everything below is proposed and open to change. If accepted it would be designed to land across all supported SDKs.

Motivation

The SDK exposes a rich typed event stream (assistant.usage, tool.execution_start/complete, abort, model.call_failure, user.message, assistant.message) and a runtime-side TelemetryConfig, but there is no first-class way to evaluate a run: aggregate tokens/cost/tool-economy/reliability, score output quality, persist a journal, compare runs over time, or batch-run a dataset. Consumers re-implement this per project. An idiomatic, observe-only eval surface would make the SDKs "nicer to use" (an explicitly welcomed contribution category) without changing agent behavior.

Proposed pipeline

EvalCollector (taps events) → EvalRunRecord (operational metrics + optional transcript) → IEvalEvaluator (pluggable scorer) → EvalInsight (generic verdict + named [0,1] scores) → IEvalSink[] (fan-out). Orchestrated by EvalHarness; batched by EvalSuite. (Names are illustrative and open to bikeshedding.)

Proposed functionality

Capture — EvalCollector

  • Would subscribe to a session via the idiomatic On<T> pattern and aggregate per-task buckets (BeginTask) of token usage (input/output/cache/reasoning), cost, per-model usage, tool-call counts + failures, aborts, and model-call failures.
  • Would map tool.execution_complete (which carries only ToolCallId) back to the tool name from tool.execution_start.
  • Optional transcript capture (user/assistant/tool turns) — proposed off by default for privacy, enabled via CaptureContent. This is what would unlock output-quality grading.
  • Designed to be usable standalone (no harness), exposing an immutable EvalMetricsSnapshot.

Records & journal — EvalRunRecord / EvalJournal / EvalJournalEntry

  • One append-only record per run: identity, timing, dominant model, task label, success flag, metrics snapshot, the OTel trace-id (to join runtime-native traces), and the optional transcript.
  • A JSON Lines journal: append-only, stream-friendly, corruption-tolerant reads, and thread-safe concurrent appends so parallel suites don't collide. Source-generated JSON (AOT-safe).

Scoring — IEvalEvaluator / EvalInsight / OperationalEvaluator

  • EvalInsight proposed as generic and unopinionated: Verdict + OverallScore + a named Scores map + Summary + factual Notes. An evaluator may emit any dimensions.
  • A default OperationalEvaluator would be deterministic, free, no-model: scoring efficiency (token economy vs prior run), tool_economy (tool success rate), reliability (aborts/model-failures/success), plus a verdict (Improved/Flat/Degraded/Inconclusive) from explicit run-over-run signals.
  • Model-performance / LLM-as-judge grading would simply be a custom IEvalEvaluator over the captured transcript. Operational and quality evals would coexist and emit the same insight.

Destinations — IEvalSink / IEvalHistorySource (the flexibility)

  • Every completed run would fan out to all configured sinks at once, so one run can land locally and in Azure simultaneously.
  • Local — JsonlEvalSink: zero-infra append-only JSONL journal that also serves run history; proposed as the default when no sink is configured.
  • Azure / OTLP — OpenTelemetryEvalSink: would emit run metrics and scores through public EvalTelemetry instruments, picking no destination itself — wherever the host routes the meter is where signals land:
    builder.Services.AddOpenTelemetry()
        .WithMetrics(m => m.AddMeter(EvalTelemetry.MeterName))
        .WithTracing(t => t.AddSource(EvalTelemetry.ActivitySourceName))
        .UseAzureMonitor();          // App Insights — or .AddOtlpExporter(), console, etc.
    Proposed metrics: copilot.evals.tokens, copilot.evals.tool_calls, copilot.evals.cost, copilot.evals.score (tagged by dimension, so operational and quality trends sit side by side), plus a per-run span whose trace-id is stamped onto each record.
  • IEvalSink would be public so consumers can write a database/queue/dashboard sink in a few lines.

Orchestration — EvalHarness

  • EvalHarness.Attach(session, options) once per session; BeginTask before each prompt; CompleteRunAsync returning a record + insight and writing to every sink. Opt-in via ENABLE_EVALS or EvalOptions.Enabled, and proposed to be an inert no-op when disabled (zero prod overhead).
  • EvalOptions would carry: Enabled, WorkingDirectory, CaptureContent, Sinks (+ fluent AddLocalSink() / AddOpenTelemetrySink()), Evaluator, FromEnvironment().

Analysis sidecar — EvalAgent (the proposed "improve" agent)

  • A separate, optional agent you could send off to analyze eval data and recommend improvements — decoupled from the eval loop and observe-only (never edits instructions/prompts/config).
  • Fully customizable: EvalAgentOptions (persona/SystemPrompt, Instruction, Model), backed by an EvalAnalysisRunner delegate so it can drive a real Copilot session, a different provider, or a test stub.
  • A FromClient(client, configure) helper would create a session, apply the persona/model, send the instruction + serialized runs, and return the analysis; AnalyzeAsync(history) over the journal.

Batch — EvalSuite (across many sessions)

  • Define a dataset of EvalCases (name + prompt, optional per-case SessionConfig and success predicate) and run each in its own fresh session with bounded concurrency (MaxConcurrency, Timeout).
  • Aggregate into EvalSuiteResult: per-case results plus MeanOverallScore, MeanScores() per dimension, PassCount/FailCount, TotalTokens, TotalCost — the offline / CI-gate path, reusing the same evaluator and sinks.

Scope / open questions for the team

  1. Is an eval surface in-scope for the SDKs, or should it be an external companion library?
  2. Naming/namespace and GA-vs-preview gating.
  3. Cross-language parity plan (Node/Python/Go/.NET/Java/Rust) — suggest landing the shape in one SDK behind preview, then porting.
  4. Boundary with the runtime-side TelemetryConfig (this would be additive/SDK-side, not a replacement).
  5. Whether the analysis agent (EvalAgent) and suite runner (EvalSuite) belong in an initial cut or a follow-up.

Non-goals

  • No change to agent/runtime behavior; strictly observe-only.
  • Not a clone of any vendor's eval catalog — provide the engine and seams, not opinionated graders.

Happy to adjust scope based on the roadmap; a .NET reference prototype can seed the discussion.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions