Add --run-in-docker to skill-validator to run Copilot CLI in a docker container by caaavik-msft · Pull Request #273 · dotnet/skills

caaavik-msft · 2026-03-06T19:30:32Z

This PR is the same as #176 but based off a non-forked branch

Summary

This PR adds an optional Docker execution mode to skill-validator so agent runs, judges, and setup commands can execute in an isolated container instead of directly on the host machine.

Motivation

The main use case for this is for local development, but it might also be useful for running in CI if we want to build on top of it. I was building some skills and found that when using some weaker models, they made destructive changes to my host system to accomplish the task (e.g. reinstalling .NET). With this, agents and judges run inside a container with only access to the files they need bound to the host machine. This does not add any additional security measures for network isolation.

Implementation

This makes use of the --headless mode for running copilot as described here: https://github.com/github/copilot-sdk/blob/main/docs/guides/setup/backend-services.md.

It requires a GITHUB_TOKEN be present to pass into the container so that it can use that to authenticate to the Copilot API. I have an example in the README which explains that you can get this token with gh auth token. For people with multiple gh accounts (e.g. personal and enterprise), you can also do gh auth token --user <name>.

A Dockerfile is included in the repo to use as the base image:

FROM mcr.microsoft.com/dotnet/sdk:10.0 AS build

ARG COPILOT_SDK_VERSION
RUN dotnet new console -o /tmp/dl \
    && dotnet add /tmp/dl package GitHub.Copilot.SDK --version $COPILOT_SDK_VERSION \
    && dotnet build /tmp/dl -c Release \
    && cp /tmp/dl/bin/Release/net10.0/runtimes/*/native/copilot /usr/local/bin/copilot \
    && chmod +x /usr/local/bin/copilot \
    && rm -rf /tmp/dl

RUN copilot --version

This ensures that we use the exact same Copilot CLI binary that is shipped with the SDK. The SDK version is resolved programmatically inside the SkillValidator so it is kept in sync. It places the copilot binary at /usr/local/bin/copilot inside the container.

To handle path mapping/translation, when running in docker mode, all temp/work directories are placed inside a single directory in the TMP folder, and that entire directory is mounted into the container with read-write. This makes it easy to map paths to and from the host and container equivalent when needed. Skill directories are also mounted into the container with read-only access, and only the directories that are being evaluated will be mounted.

The container uses a randomised port -p 0:4321 which is resolved later using docker port. The container is always cleaned up after finishing, including on ProcessExit and CancelKeyPress events.

Future Extensibility

I have a proof of concept working locally which I chose not to push for now to keep this PR simple which runs all agents inside their own containers rather than having a single container that is used to run all agents and judges. This would help reduce any risks of agents modifying the environment and impacting other evaluations if that sounds desirable, but it does mean that each agent would use a separate CopilotClient rather a single shared CopilotClient.

…t-server

Copilot

Pull request overview

This PR adds an optional --run-in-docker flag to the skill-validator tool, enabling agent runs, judges, and setup commands to execute inside a Docker container rather than directly on the host. This provides isolation for local development, protecting the host system from potentially destructive changes made by weaker models. The implementation uses the Copilot SDK's --headless mode, builds a Docker image containing the exact Copilot CLI binary from the SDK, manages container lifecycle with cleanup handlers, and handles path translation between host and container mount points.

Changes:

Adds DockerCopilotServer service that manages Docker container lifecycle, path mapping between host and container, and Docker CLI execution.
Integrates Docker mode throughout the evaluation pipeline (AgentRunner, Judge, PairwiseJudge, ValidateCommand), translating work directories and skill paths when Docker is active.
Adds a Dockerfile, CLI option, configuration model update, documentation, and unit tests for the new functionality.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
`eng/skill-validator/src/Services/DockerCopilotServer.cs`	New service managing Docker container lifecycle, skill volume mounts, host↔container path mapping, and copilot CLI startup
`eng/skill-validator/tests/DockerCopilotServerTests.cs`	Unit tests for `BuildSkillMounts`, `MapHostPathToContainer`, `TryMapContainerPathToHost`, and `GetCopilotSdkVersion`
`eng/skill-validator/src/Services/AgentRunner.cs`	Integrates Docker mode for client initialization, work dir setup, permission checking, session config building, and setup command execution
`eng/skill-validator/src/Commands/ValidateCommand.cs`	Adds `--run-in-docker` CLI option, moves skill discovery earlier for mount setup, adds Docker container cleanup
`eng/skill-validator/src/Services/Judge.cs`	Maps work directory to container path for judge sessions
`eng/skill-validator/src/Services/PairwiseJudge.cs`	Maps work directory to container path for pairwise judge sessions
`eng/skill-validator/src/Models/Models.cs`	Adds `RunInDocker` property to `ValidatorConfig`
`eng/skill-validator/src/Docker/Dockerfile`	Dockerfile that installs the Copilot CLI binary from the SDK package
`eng/skill-validator/src/SkillValidator.csproj`	Includes the Dockerfile in build output
`eng/skill-validator/README.md`	Documents the `--run-in-docker` flag and Docker mode requirements

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

eng/skill-validator/src/Services/DockerCopilotServer.cs

github-actions · 2026-03-06T20:21:27Z

Skill Validation Results

Skill	Scenario	Baseline	With Skill	Δ	Skills Loaded	Overfit	Verdict
dotnet-trace-collect	High CPU in Kubernetes on Linux (.NET 8)	3.5/5	4.5/5	+1.0	✅ dotnet-trace-collect; tools: skill, report_intent, view, glob	✅ 0.15	✅
dotnet-trace-collect	.NET Framework on Windows without admin privileges	2.0/5	5.0/5	+3.0	✅ dotnet-trace-collect; tools: skill	✅ 0.15	✅
dotnet-trace-collect	.NET 10 on Linux with root access and native call stacks	1.0/5	4.0/5	+3.0	✅ dotnet-trace-collect; tools: skill	✅ 0.15	✅
dotnet-trace-collect	Memory leak on Linux (.NET 8)	3.0/5	3.0/5	0.0	✅ dotnet-trace-collect; tools: skill, report_intent, view, glob	✅ 0.15	✅
dotnet-trace-collect	Slow requests on Windows with PerfView	4.5/5	5.0/5	+0.5	✅ dotnet-trace-collect; tools: skill, report_intent, view, glob	✅ 0.15	✅
dotnet-trace-collect	Excessive GC on Linux (.NET 8)	3.0/5	5.0/5	+2.0	✅ dotnet-trace-collect; tools: skill, bash	✅ 0.15	✅
dotnet-trace-collect	Hang or deadlock diagnosis on Linux	3.0/5	4.0/5	+1.0	✅ dotnet-trace-collect; tools: skill, report_intent, view	✅ 0.15	✅
dotnet-trace-collect	Windows container high CPU with PerfView	1.5/5	5.0/5	+3.5	✅ dotnet-trace-collect; tools: report_intent, skill, view, glob	✅ 0.15	✅
dotnet-trace-collect	Long-running intermittent issue with PerfView triggers	2.5/5	5.0/5	+2.5	✅ dotnet-trace-collect; tools: skill, report_intent, view, glob	✅ 0.15	✅
dotnet-trace-collect	Linux pre-.NET 10 needing native call stacks	2.0/5	5.0/5	+3.0	✅ dotnet-trace-collect; tools: skill, report_intent, view, glob	✅ 0.15	✅
dotnet-trace-collect	Windows modern .NET with admin high CPU	2.0/5	4.0/5	+2.0	✅ dotnet-trace-collect; tools: skill, report_intent, view, glob	✅ 0.15	✅
dotnet-trace-collect	Memory leak on .NET Framework Windows	3.5/5	5.0/5	+1.5	✅ dotnet-trace-collect; tools: skill, report_intent, view, glob	✅ 0.15	✅
dotnet-trace-collect	Kubernetes with console access prefers console tools	5.0/5	4.5/5	-0.5	✅ dotnet-trace-collect; tools: skill, report_intent, view, glob	✅ 0.15	✅
dotnet-trace-collect	Container installation without .NET SDK	3.0/5	4.5/5	+1.5	✅ dotnet-trace-collect; tools: skill	✅ 0.15	✅
dotnet-trace-collect	HTTP 500s from downstream service on Linux (.NET 8)	4.0/5	5.0/5	+1.0	✅ dotnet-trace-collect; tools: skill, bash, report_intent, view, glob	✅ 0.15	✅
dotnet-trace-collect	Networking timeouts on Windows with admin (.NET 8)	2.0/5	5.0/5	+3.0	✅ dotnet-trace-collect; tools: skill, report_intent, view, glob	✅ 0.15	✅
microbenchmarking	Investigate runtime upgrade performance impact	3.5/5	3.0/5	-0.5	✅ microbenchmarking; tools: skill, glob, stop_bash	✅ 0.12	❌
csharp-scripts	Test a C# language feature with a script	3.5/5	5.0/5	+1.5	✅ csharp-scripts; tools: skill, create, edit	🟡 0.32	✅
clr-activation-debugging	Diagnose unexpected FOD dialog from native build tool	1.0/5	5.0/5	+4.0	✅ clr-activation-debugging; tools: skill	✅ 0.12	✅
clr-activation-debugging	Diagnose FOD suppressed but activation still failing	1.0/5	5.0/5	+4.0	✅ clr-activation-debugging; tools: skill	✅ 0.12	✅
clr-activation-debugging	Explain why same binary behaves differently under different launch methods	1.0/5	5.0/5	+4.0	✅ clr-activation-debugging; tools: skill	✅ 0.12	✅
clr-activation-debugging	Analyze healthy managed EXE activation	1.0/5	5.0/5	+4.0	✅ clr-activation-debugging; tools: skill	✅ 0.12	✅
clr-activation-debugging	Identify multiple activation sequences in a single log	1.0/5	5.0/5	+4.0	✅ clr-activation-debugging; tools: skill	✅ 0.12	✅
clr-activation-debugging	Explain useLegacyV2RuntimeActivationPolicy in activation log	2.0/5	3.0/5	+1.0	✅ clr-activation-debugging; tools: skill	✅ 0.12	✅
clr-activation-debugging	Decline non-CLR-activation issue	1.0/5	5.0/5	+4.0	✅ tools: bash	✅ 0.12	✅
thread-abort-migration	Worker thread with abort-based cancellation	5.0/5	5.0/5	0.0	✅ thread-abort-migration; tools: skill	✅ 0.10	❌
thread-abort-migration	Timeout enforcement via Thread.Abort	4.5/5	5.0/5	+0.5	✅ thread-abort-migration; tools: skill	✅ 0.10	✅
thread-abort-migration	Blocking WaitHandle with Thread.Interrupt	3.5/5	4.5/5	+1.0	✅ thread-abort-migration; tools: skill	✅ 0.10	✅
thread-abort-migration	ASP.NET Response.End and Response.Redirect with Thread.Abort	4.0/5	5.0/5	+1.0	✅ thread-abort-migration; tools: skill	✅ 0.10	✅
thread-abort-migration	Thread.Join and Thread.Sleep only — should not migrate	3.0/5	5.0/5	+2.0	✅ thread-abort-migration; tools: skill	✅ 0.10	✅
migrate-nullable-references	Enable NRT in a small library with mixed nullability	5.0/5	5.0/5	0.0	✅ migrate-nullable-references; tools: skill	✅ 0.04	❌
migrate-nullable-references	File-by-file migration: only modify the targeted file	5.0/5	5.0/5	0.0	⚠️ NOT ACTIVATED	✅ 0.04	❌
migrate-nullable-references	Enable NRT in ASP.NET Core Web API with EF Core	3.5/5	3.0/5	-0.5	⚠️ NOT ACTIVATED	✅ 0.04	❌
nuget-trusted-publishing	Set up trusted publishing for a new NuGet library	3.0/5	4.0/5	+1.0	✅ nuget-trusted-publishing; tools: skill, stop_bash	✅ 0.13	✅
nuget-trusted-publishing	Set up NuGet publishing without mentioning trusted publishing	2.0/5	5.0/5	+3.0	✅ nuget-trusted-publishing; tools: skill, report_intent, view, glob, bash, create	✅ 0.13	✅
nuget-trusted-publishing	Migrate existing workflow from API key to trusted publishing	3.0/5	4.5/5	+1.5	✅ nuget-trusted-publishing; tools: skill, view, bash	✅ 0.13	✅
analyzing-dotnet-performance	Detects compiled regex startup budget and regex chain allocations	1.0/5	3.0/5 ⏰ timeout	+2.0	✅ analyzing-dotnet-performance; tools: skill	✅ 0.13	✅
analyzing-dotnet-performance	Detects CurrentCulture comparer and compiled regex budget in inflection rules	1.0/5	3.5/5 ⏰ timeout	+2.5	✅ analyzing-dotnet-performance; tools: skill, task, glob, grep	✅ 0.13	✅
analyzing-dotnet-performance	Finds per-call Dictionary allocation not hoisted to static	1.0/5	5.0/5 ⏰ timeout	+4.0	✅ analyzing-dotnet-performance; tools: skill	✅ 0.13	✅
analyzing-dotnet-performance	Catches compound allocations in recursive number converter with ToLower	1.0/5	4.0/5	+3.0	✅ analyzing-dotnet-performance; tools: skill	✅ 0.13	✅
analyzing-dotnet-performance	Finds StringComparison.Ordinal missing and FrozenDictionary opportunities	1.0/5	5.0/5 ⏰ timeout	+4.0	✅ analyzing-dotnet-performance; tools: skill, grep	✅ 0.13	✅
analyzing-dotnet-performance	Detects Aggregate+Replace chain and struct missing IEquatable	1.0/5	5.0/5	+4.0	✅ analyzing-dotnet-performance; tools: skill	✅ 0.13	✅
analyzing-dotnet-performance	Finds branched Replace chain in format string manipulation	1.0/5	3.5/5 ⏰ timeout	+2.5	✅ analyzing-dotnet-performance; tools: skill, grep	✅ 0.13	✅
analyzing-dotnet-performance	Catches LINQ on hot-path string processing and All(char.IsUpper)	1.0/5	3.5/5 ⏰ timeout	+2.5	✅ analyzing-dotnet-performance; tools: skill	✅ 0.13	✅
analyzing-dotnet-performance	Detects LINQ pipeline in TimeSpan formatting and collection processing	1.0/5	4.0/5	+3.0	✅ analyzing-dotnet-performance; tools: skill	✅ 0.13	✅
analyzing-dotnet-performance	Flags Span inconsistencies and compound method chains in truncation library	1.0/5	4.5/5	+3.5	✅ analyzing-dotnet-performance; tools: skill	✅ 0.13	✅
analyzing-dotnet-performance	Identifies unsealed leaf classes and locale hierarchy patterns	1.0/5	5.0/5 ⏰ timeout	+4.0	✅ analyzing-dotnet-performance; tools: skill	✅ 0.13	✅
dotnet-aot-compat	Make Azure.ResourceManager AOT-compatible	1.5/5 ⏰ timeout	3.5/5 ⏰ timeout	+2.0	✅ dotnet-aot-compat; tools: skill, read_agent, bash, create	✅ 0.14	✅
optimizing-ef-core-queries	Optimize bulk operations with EF Core 7+ ExecuteUpdate and ExecuteDelete	4.0/5	5.0/5	+1.0	✅ optimizing-ef-core-queries; tools: skill	🟡 0.20	✅
android-tombstone-symbolication	Symbolicate .NET frames in an Android tombstone	3.5/5	4.0/5	+0.5	✅ android-tombstone-symbolication; tools: skill, stop_bash, glob	✅ 0.18	✅
android-tombstone-symbolication	Recognize tombstone with no .NET frames	5.0/5	5.0/5	0.0	✅ android-tombstone-symbolication; tools: skill, bash	✅ 0.18	✅
android-tombstone-symbolication	Symbolicate CoreCLR frames in an Android tombstone	3.5/5	4.0/5	+0.5	✅ android-tombstone-symbolication; tools: skill, stop_bash, glob	✅ 0.18	✅
android-tombstone-symbolication	Recognize NativeAOT tombstone with app binary and libSystem.Native.so	3.0/5	4.0/5	+1.0	✅ android-tombstone-symbolication; tools: skill, glob, bash	✅ 0.18	✅
android-tombstone-symbolication	Symbolicate multi-thread tombstone	4.0/5	4.5/5	+0.5	✅ android-tombstone-symbolication; tools: skill, stop_bash, read_bash	✅ 0.18	❌
android-tombstone-symbolication	Handle .NET frames with no BuildId metadata	4.0/5	5.0/5	+1.0	✅ android-tombstone-symbolication; tools: skill, glob, bash	✅ 0.18	✅
android-tombstone-symbolication	Symbolicate tombstone with multiple .NET libraries and different BuildIds	3.5/5	4.0/5	+0.5	✅ android-tombstone-symbolication; tools: skill, glob	✅ 0.18	✅
android-tombstone-symbolication	Reject iOS crash log as wrong format	5.0/5	5.0/5	0.0	ℹ️ not activated (expected)	✅ 0.18	✅
dotnet-pinvoke	Generate LibraryImport declaration from C header (.NET 8+)	4.5/5	5.0/5	+0.5	✅ dotnet-pinvoke; tools: skill	✅ 0.07	✅
dotnet-pinvoke	Generate LibraryImport declaration from C header (.NET Framework)	3.5/5	5.0/5	+1.5	✅ dotnet-pinvoke; tools: skill	✅ 0.07	✅
dump-collect	Configure automatic crash dumps for CoreCLR app on Linux	4.5/5	5.0/5	+0.5	✅ dump-collect; tools: skill, report_intent, view, glob	✅ 0.16	✅
dump-collect	Set up NativeAOT crash dumps with createdump in Kubernetes	1.5/5 ⏰ timeout	5.0/5	+3.5	✅ dump-collect; tools: skill	✅ 0.16	✅
dump-collect	Recover crash dump from macOS NativeAOT without createdump	4.0/5	4.5/5	+0.5	✅ dump-collect; tools: skill, report_intent, view, glob, bash	✅ 0.16	✅
dump-collect	Configure CoreCLR dump collection in Alpine Docker as non-root	2.0/5	5.0/5	+3.0	✅ dump-collect; tools: skill, report_intent, view, glob, bash	✅ 0.16	✅
dump-collect	Advisory: macOS NativeAOT crash dump recovery steps	4.0/5	4.5/5	+0.5	✅ dump-collect; tools: skill, glob, bash	✅ 0.16	✅
dump-collect	Advisory: CoreCLR Alpine Docker non-root configuration	3.5/5	5.0/5	+1.5	✅ dump-collect; tools: skill, report_intent, view	✅ 0.16	✅
dump-collect	Advisory: NativeAOT Kubernetes dump collection setup	3.0/5	4.0/5	+1.0	✅ dump-collect; tools: skill	✅ 0.16	✅
dump-collect	Detect runtime and configure crash dumps for unknown .NET app on Linux	3.5/5	4.5/5	+1.0	✅ dump-collect; tools: skill, bash	✅ 0.16	✅
dump-collect	Decline dump analysis request	2.0/5	4.0/5	+2.0	ℹ️ not activated (expected)	✅ 0.16	✅
build-parallelism	Analyze build parallelism bottlenecks	1.0/5 ⏰ timeout	1.0/5 ⏰ timeout	0.0	✅ build-parallelism; binlog-generation; tools: skill, task, glob	✅ 0.14	✅
including-generated-files	Diagnose generated file inclusion failure	3.0/5	5.0/5	+2.0	✅ including-generated-files; tools: skill	✅ 0.13	✅
msbuild-antipatterns	Review MSBuild files for anti-patterns and style issues	5.0/5	5.0/5	0.0	✅ msbuild-antipatterns; tools: skill, glob	✅ 0.05	❌
build-perf-baseline	Establish build performance baseline and recommend optimizations	3.0/5	4.5/5	+1.5	✅ build-perf-baseline; build-perf-diagnostics; tools: skill	🟡 0.37	✅
msbuild-modernization	Modernize legacy project to SDK-style	5.0/5	5.0/5	0.0	✅ msbuild-modernization; tools: skill	✅ 0.04	❌
directory-build-organization	Organize build infrastructure for a multi-project repo	3.5/5	5.0/5	+1.5	✅ msbuild-antipatterns; directory-build-organization; tools: skill	✅ 0.18	✅
check-bin-obj-clash	Diagnose bin/obj output path clashes	3.0/5	4.5/5	+1.5	✅ check-bin-obj-clash; binlog-generation; tools: skill, glob	✅ 0.15	✅
incremental-build	Analyze incremental build issues	3.0/5	4.0/5	+1.0	✅ incremental-build; tools: skill	✅ 0.12	✅
eval-performance	Analyze MSBuild evaluation performance issues	4.0/5	4.5/5	+0.5	✅ eval-performance; tools: skill	✅ 0.11	✅
build-perf-diagnostics	Analyze analyzer performance impact on builds	1.0/5 ⏰ timeout	4.0/5 ⏰ timeout	+3.0	✅ binlog-generation; build-perf-diagnostics; binlog-failure-analysis; tools: skill, edit	🟡 0.23	✅
binlog-generation	Build project with /bl flag	1.0/5	5.0/5	+4.0	✅ binlog-generation; tools: skill	✅ 0.00	✅
binlog-generation	Build with /bl in PowerShell	3.5/5	5.0/5	+1.5	✅ binlog-generation; tools: skill	✅ 0.00	✅
binlog-generation	Build multiple configurations with unique binlogs	3.5/5	5.0/5	+1.5	✅ binlog-generation; tools: skill	✅ 0.00	✅
binlog-failure-analysis	Diagnose build failures from binlog only (no source files)	4.0/5	5.0/5	+1.0	✅ binlog-failure-analysis; tools: skill	✅ 0.04	✅

⏰ timeout — run hit the scenario timeout limit; scoring may be impacted by aborting model execution before it could produce its full output

Model: claude-opus-4.6 | Judge: claude-opus-4.6

Full results

caaavik-msft added 8 commits March 4, 2026 09:17

Add --run-in-docker to run Copilot CLI in docker container

c42609a

Merge remote-tracking branch 'origin/main' into caaavik/docker-copilo…

f757de2

…t-server

Address PR comments

c842f7b

Merge remote-tracking branch 'origin/main' into caaavik/docker-copilo…

f12f8cc

…t-server

Change temp dir prefix for docker container host dir

bba74a6

Extract path mapping logic to helper function

659a68f

Merge remote-tracking branch 'origin/main' into caaavik/docker-copilo…

d2480a0

…t-server

Re-add newline at end of README.md that disappeared in merge.

f547651

caaavik-msft requested a review from a team March 6, 2026 19:30

caaavik-msft requested review from JanKrivanek and ViktorHofer as code owners March 6, 2026 19:30

Copilot AI review requested due to automatic review settings March 6, 2026 19:30

caaavik-msft mentioned this pull request Mar 6, 2026

Add --run-in-docker to skill-validator to run Copilot CLI in a docker container #176

Closed

Copilot started reviewing on behalf of caaavik-msft March 6, 2026 19:30 View session

Copilot AI reviewed Mar 6, 2026

View reviewed changes

eng/skill-validator/src/Services/DockerCopilotServer.cs Outdated Show resolved Hide resolved

Fix docker exec command

40292f8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add --run-in-docker to skill-validator to run Copilot CLI in a docker container#273

Add --run-in-docker to skill-validator to run Copilot CLI in a docker container#273
caaavik-msft wants to merge 9 commits intomainfrom
caaavik/docker-copilot-server

caaavik-msft commented Mar 6, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

github-actions bot commented Mar 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

caaavik-msft commented Mar 6, 2026

Summary

Motivation

Implementation

Future Extensibility

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

github-actions bot commented Mar 6, 2026

Skill Validation Results

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants