Summary
Proposed: an opt-in, SDK-native evaluation harness that would measure and grade Copilot agent runs directly from the SDK's event stream, score them (operationally and/or with an LLM-as-judge), and fan results out to pluggable destinations — a local file and/or OpenTelemetry backends such as Azure Monitor / Application Insights. The proposal also includes an optional, fully-customizable analysis ("improve") agent and a suite runner for evaluating across many sessions.
Filing per CONTRIBUTING.md ("Please discuss any feature work with us before writing code"). This is a design proposal for team alignment, not a request to merge. A reference prototype exists in the .NET SDK to validate the shape, but everything below is proposed and open to change. If accepted it would be designed to land across all supported SDKs.
Motivation
The SDK exposes a rich typed event stream (assistant.usage, tool.execution_start/complete, abort, model.call_failure, user.message, assistant.message) and a runtime-side TelemetryConfig, but there is no first-class way to evaluate a run: aggregate tokens/cost/tool-economy/reliability, score output quality, persist a journal, compare runs over time, or batch-run a dataset. Consumers re-implement this per project. An idiomatic, observe-only eval surface would make the SDKs "nicer to use" (an explicitly welcomed contribution category) without changing agent behavior.
Proposed pipeline
EvalCollector (taps events) → EvalRunRecord (operational metrics + optional transcript) → IEvalEvaluator (pluggable scorer) → EvalInsight (generic verdict + named [0,1] scores) → IEvalSink[] (fan-out). Orchestrated by EvalHarness; batched by EvalSuite. (Names are illustrative and open to bikeshedding.)
Proposed functionality
Capture — EvalCollector
- Would subscribe to a session via the idiomatic
On<T> pattern and aggregate per-task buckets (BeginTask) of token usage (input/output/cache/reasoning), cost, per-model usage, tool-call counts + failures, aborts, and model-call failures.
- Would map
tool.execution_complete (which carries only ToolCallId) back to the tool name from tool.execution_start.
- Optional transcript capture (user/assistant/tool turns) — proposed off by default for privacy, enabled via
CaptureContent. This is what would unlock output-quality grading.
- Designed to be usable standalone (no harness), exposing an immutable
EvalMetricsSnapshot.
Records & journal — EvalRunRecord / EvalJournal / EvalJournalEntry
- One append-only record per run: identity, timing, dominant model, task label, success flag, metrics snapshot, the OTel trace-id (to join runtime-native traces), and the optional transcript.
- A JSON Lines journal: append-only, stream-friendly, corruption-tolerant reads, and thread-safe concurrent appends so parallel suites don't collide. Source-generated JSON (AOT-safe).
Scoring — IEvalEvaluator / EvalInsight / OperationalEvaluator
EvalInsight proposed as generic and unopinionated: Verdict + OverallScore + a named Scores map + Summary + factual Notes. An evaluator may emit any dimensions.
- A default
OperationalEvaluator would be deterministic, free, no-model: scoring efficiency (token economy vs prior run), tool_economy (tool success rate), reliability (aborts/model-failures/success), plus a verdict (Improved/Flat/Degraded/Inconclusive) from explicit run-over-run signals.
- Model-performance / LLM-as-judge grading would simply be a custom
IEvalEvaluator over the captured transcript. Operational and quality evals would coexist and emit the same insight.
Destinations — IEvalSink / IEvalHistorySource (the flexibility)
- Every completed run would fan out to all configured sinks at once, so one run can land locally and in Azure simultaneously.
- Local —
JsonlEvalSink: zero-infra append-only JSONL journal that also serves run history; proposed as the default when no sink is configured.
- Azure / OTLP —
OpenTelemetryEvalSink: would emit run metrics and scores through public EvalTelemetry instruments, picking no destination itself — wherever the host routes the meter is where signals land:
builder.Services.AddOpenTelemetry()
.WithMetrics(m => m.AddMeter(EvalTelemetry.MeterName))
.WithTracing(t => t.AddSource(EvalTelemetry.ActivitySourceName))
.UseAzureMonitor(); // App Insights — or .AddOtlpExporter(), console, etc.
Proposed metrics: copilot.evals.tokens, copilot.evals.tool_calls, copilot.evals.cost, copilot.evals.score (tagged by dimension, so operational and quality trends sit side by side), plus a per-run span whose trace-id is stamped onto each record.
IEvalSink would be public so consumers can write a database/queue/dashboard sink in a few lines.
Orchestration — EvalHarness
EvalHarness.Attach(session, options) once per session; BeginTask before each prompt; CompleteRunAsync returning a record + insight and writing to every sink. Opt-in via ENABLE_EVALS or EvalOptions.Enabled, and proposed to be an inert no-op when disabled (zero prod overhead).
EvalOptions would carry: Enabled, WorkingDirectory, CaptureContent, Sinks (+ fluent AddLocalSink() / AddOpenTelemetrySink()), Evaluator, FromEnvironment().
Analysis sidecar — EvalAgent (the proposed "improve" agent)
- A separate, optional agent you could send off to analyze eval data and recommend improvements — decoupled from the eval loop and observe-only (never edits instructions/prompts/config).
- Fully customizable:
EvalAgentOptions (persona/SystemPrompt, Instruction, Model), backed by an EvalAnalysisRunner delegate so it can drive a real Copilot session, a different provider, or a test stub.
- A
FromClient(client, configure) helper would create a session, apply the persona/model, send the instruction + serialized runs, and return the analysis; AnalyzeAsync(history) over the journal.
Batch — EvalSuite (across many sessions)
- Define a dataset of
EvalCases (name + prompt, optional per-case SessionConfig and success predicate) and run each in its own fresh session with bounded concurrency (MaxConcurrency, Timeout).
- Aggregate into
EvalSuiteResult: per-case results plus MeanOverallScore, MeanScores() per dimension, PassCount/FailCount, TotalTokens, TotalCost — the offline / CI-gate path, reusing the same evaluator and sinks.
Scope / open questions for the team
- Is an eval surface in-scope for the SDKs, or should it be an external companion library?
- Naming/namespace and GA-vs-preview gating.
- Cross-language parity plan (Node/Python/Go/.NET/Java/Rust) — suggest landing the shape in one SDK behind preview, then porting.
- Boundary with the runtime-side
TelemetryConfig (this would be additive/SDK-side, not a replacement).
- Whether the analysis agent (
EvalAgent) and suite runner (EvalSuite) belong in an initial cut or a follow-up.
Non-goals
- No change to agent/runtime behavior; strictly observe-only.
- Not a clone of any vendor's eval catalog — provide the engine and seams, not opinionated graders.
Happy to adjust scope based on the roadmap; a .NET reference prototype can seed the discussion.
Summary
Proposed: an opt-in, SDK-native evaluation harness that would measure and grade Copilot agent runs directly from the SDK's event stream, score them (operationally and/or with an LLM-as-judge), and fan results out to pluggable destinations — a local file and/or OpenTelemetry backends such as Azure Monitor / Application Insights. The proposal also includes an optional, fully-customizable analysis ("improve") agent and a suite runner for evaluating across many sessions.
Motivation
The SDK exposes a rich typed event stream (
assistant.usage,tool.execution_start/complete,abort,model.call_failure,user.message,assistant.message) and a runtime-sideTelemetryConfig, but there is no first-class way to evaluate a run: aggregate tokens/cost/tool-economy/reliability, score output quality, persist a journal, compare runs over time, or batch-run a dataset. Consumers re-implement this per project. An idiomatic, observe-only eval surface would make the SDKs "nicer to use" (an explicitly welcomed contribution category) without changing agent behavior.Proposed pipeline
EvalCollector(taps events) →EvalRunRecord(operational metrics + optional transcript) →IEvalEvaluator(pluggable scorer) →EvalInsight(generic verdict + named[0,1]scores) →IEvalSink[](fan-out). Orchestrated byEvalHarness; batched byEvalSuite. (Names are illustrative and open to bikeshedding.)Proposed functionality
Capture —
EvalCollectorOn<T>pattern and aggregate per-task buckets (BeginTask) of token usage (input/output/cache/reasoning), cost, per-model usage, tool-call counts + failures, aborts, and model-call failures.tool.execution_complete(which carries onlyToolCallId) back to the tool name fromtool.execution_start.CaptureContent. This is what would unlock output-quality grading.EvalMetricsSnapshot.Records & journal —
EvalRunRecord/EvalJournal/EvalJournalEntryScoring —
IEvalEvaluator/EvalInsight/OperationalEvaluatorEvalInsightproposed as generic and unopinionated:Verdict+OverallScore+ a namedScoresmap +Summary+ factualNotes. An evaluator may emit any dimensions.OperationalEvaluatorwould be deterministic, free, no-model: scoringefficiency(token economy vs prior run),tool_economy(tool success rate),reliability(aborts/model-failures/success), plus a verdict (Improved/Flat/Degraded/Inconclusive) from explicit run-over-run signals.IEvalEvaluatorover the captured transcript. Operational and quality evals would coexist and emit the same insight.Destinations —
IEvalSink/IEvalHistorySource(the flexibility)JsonlEvalSink: zero-infra append-only JSONL journal that also serves run history; proposed as the default when no sink is configured.OpenTelemetryEvalSink: would emit run metrics and scores through publicEvalTelemetryinstruments, picking no destination itself — wherever the host routes the meter is where signals land:copilot.evals.tokens,copilot.evals.tool_calls,copilot.evals.cost,copilot.evals.score(tagged bydimension, so operational and quality trends sit side by side), plus a per-run span whose trace-id is stamped onto each record.IEvalSinkwould be public so consumers can write a database/queue/dashboard sink in a few lines.Orchestration —
EvalHarnessEvalHarness.Attach(session, options)once per session;BeginTaskbefore each prompt;CompleteRunAsyncreturning a record + insight and writing to every sink. Opt-in viaENABLE_EVALSorEvalOptions.Enabled, and proposed to be an inert no-op when disabled (zero prod overhead).EvalOptionswould carry:Enabled,WorkingDirectory,CaptureContent,Sinks(+ fluentAddLocalSink()/AddOpenTelemetrySink()),Evaluator,FromEnvironment().Analysis sidecar —
EvalAgent(the proposed "improve" agent)EvalAgentOptions(persona/SystemPrompt,Instruction,Model), backed by anEvalAnalysisRunnerdelegate so it can drive a real Copilot session, a different provider, or a test stub.FromClient(client, configure)helper would create a session, apply the persona/model, send the instruction + serialized runs, and return the analysis;AnalyzeAsync(history)over the journal.Batch —
EvalSuite(across many sessions)EvalCases (name + prompt, optional per-caseSessionConfigand success predicate) and run each in its own fresh session with bounded concurrency (MaxConcurrency,Timeout).EvalSuiteResult: per-case results plusMeanOverallScore,MeanScores()per dimension,PassCount/FailCount,TotalTokens,TotalCost— the offline / CI-gate path, reusing the same evaluator and sinks.Scope / open questions for the team
TelemetryConfig(this would be additive/SDK-side, not a replacement).EvalAgent) and suite runner (EvalSuite) belong in an initial cut or a follow-up.Non-goals
Happy to adjust scope based on the roadmap; a .NET reference prototype can seed the discussion.