Skip to content

Add --run-in-docker to skill-validator to run Copilot CLI in a docker container#273

Open
caaavik-msft wants to merge 9 commits intomainfrom
caaavik/docker-copilot-server
Open

Add --run-in-docker to skill-validator to run Copilot CLI in a docker container#273
caaavik-msft wants to merge 9 commits intomainfrom
caaavik/docker-copilot-server

Conversation

@caaavik-msft
Copy link
Contributor

This PR is the same as #176 but based off a non-forked branch

Summary

This PR adds an optional Docker execution mode to skill-validator so agent runs, judges, and setup commands can execute in an isolated container instead of directly on the host machine.

Motivation

The main use case for this is for local development, but it might also be useful for running in CI if we want to build on top of it. I was building some skills and found that when using some weaker models, they made destructive changes to my host system to accomplish the task (e.g. reinstalling .NET). With this, agents and judges run inside a container with only access to the files they need bound to the host machine. This does not add any additional security measures for network isolation.

Implementation

This makes use of the --headless mode for running copilot as described here: https://github.com/github/copilot-sdk/blob/main/docs/guides/setup/backend-services.md.

It requires a GITHUB_TOKEN be present to pass into the container so that it can use that to authenticate to the Copilot API. I have an example in the README which explains that you can get this token with gh auth token. For people with multiple gh accounts (e.g. personal and enterprise), you can also do gh auth token --user <name>.

A Dockerfile is included in the repo to use as the base image:

FROM mcr.microsoft.com/dotnet/sdk:10.0 AS build

ARG COPILOT_SDK_VERSION
RUN dotnet new console -o /tmp/dl \
    && dotnet add /tmp/dl package GitHub.Copilot.SDK --version $COPILOT_SDK_VERSION \
    && dotnet build /tmp/dl -c Release \
    && cp /tmp/dl/bin/Release/net10.0/runtimes/*/native/copilot /usr/local/bin/copilot \
    && chmod +x /usr/local/bin/copilot \
    && rm -rf /tmp/dl

RUN copilot --version

This ensures that we use the exact same Copilot CLI binary that is shipped with the SDK. The SDK version is resolved programmatically inside the SkillValidator so it is kept in sync. It places the copilot binary at /usr/local/bin/copilot inside the container.

To handle path mapping/translation, when running in docker mode, all temp/work directories are placed inside a single directory in the TMP folder, and that entire directory is mounted into the container with read-write. This makes it easy to map paths to and from the host and container equivalent when needed. Skill directories are also mounted into the container with read-only access, and only the directories that are being evaluated will be mounted.

The container uses a randomised port -p 0:4321 which is resolved later using docker port. The container is always cleaned up after finishing, including on ProcessExit and CancelKeyPress events.

Future Extensibility

I have a proof of concept working locally which I chose not to push for now to keep this PR simple which runs all agents inside their own containers rather than having a single container that is used to run all agents and judges. This would help reduce any risks of agents modifying the environment and impacting other evaluations if that sounds desirable, but it does mean that each agent would use a separate CopilotClient rather a single shared CopilotClient.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds an optional --run-in-docker flag to the skill-validator tool, enabling agent runs, judges, and setup commands to execute inside a Docker container rather than directly on the host. This provides isolation for local development, protecting the host system from potentially destructive changes made by weaker models. The implementation uses the Copilot SDK's --headless mode, builds a Docker image containing the exact Copilot CLI binary from the SDK, manages container lifecycle with cleanup handlers, and handles path translation between host and container mount points.

Changes:

  • Adds DockerCopilotServer service that manages Docker container lifecycle, path mapping between host and container, and Docker CLI execution.
  • Integrates Docker mode throughout the evaluation pipeline (AgentRunner, Judge, PairwiseJudge, ValidateCommand), translating work directories and skill paths when Docker is active.
  • Adds a Dockerfile, CLI option, configuration model update, documentation, and unit tests for the new functionality.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
eng/skill-validator/src/Services/DockerCopilotServer.cs New service managing Docker container lifecycle, skill volume mounts, host↔container path mapping, and copilot CLI startup
eng/skill-validator/tests/DockerCopilotServerTests.cs Unit tests for BuildSkillMounts, MapHostPathToContainer, TryMapContainerPathToHost, and GetCopilotSdkVersion
eng/skill-validator/src/Services/AgentRunner.cs Integrates Docker mode for client initialization, work dir setup, permission checking, session config building, and setup command execution
eng/skill-validator/src/Commands/ValidateCommand.cs Adds --run-in-docker CLI option, moves skill discovery earlier for mount setup, adds Docker container cleanup
eng/skill-validator/src/Services/Judge.cs Maps work directory to container path for judge sessions
eng/skill-validator/src/Services/PairwiseJudge.cs Maps work directory to container path for pairwise judge sessions
eng/skill-validator/src/Models/Models.cs Adds RunInDocker property to ValidatorConfig
eng/skill-validator/src/Docker/Dockerfile Dockerfile that installs the Copilot CLI binary from the SDK package
eng/skill-validator/src/SkillValidator.csproj Includes the Dockerfile in build output
eng/skill-validator/README.md Documents the --run-in-docker flag and Docker mode requirements

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@github-actions
Copy link
Contributor

github-actions bot commented Mar 6, 2026

Skill Validation Results

Skill Scenario Baseline With Skill Δ Skills Loaded Overfit Verdict
dotnet-trace-collect High CPU in Kubernetes on Linux (.NET 8) 3.5/5 4.5/5 +1.0 ✅ dotnet-trace-collect; tools: skill, report_intent, view, glob ✅ 0.15
dotnet-trace-collect .NET Framework on Windows without admin privileges 2.0/5 5.0/5 +3.0 ✅ dotnet-trace-collect; tools: skill ✅ 0.15
dotnet-trace-collect .NET 10 on Linux with root access and native call stacks 1.0/5 4.0/5 +3.0 ✅ dotnet-trace-collect; tools: skill ✅ 0.15
dotnet-trace-collect Memory leak on Linux (.NET 8) 3.0/5 3.0/5 0.0 ✅ dotnet-trace-collect; tools: skill, report_intent, view, glob ✅ 0.15
dotnet-trace-collect Slow requests on Windows with PerfView 4.5/5 5.0/5 +0.5 ✅ dotnet-trace-collect; tools: skill, report_intent, view, glob ✅ 0.15
dotnet-trace-collect Excessive GC on Linux (.NET 8) 3.0/5 5.0/5 +2.0 ✅ dotnet-trace-collect; tools: skill, bash ✅ 0.15
dotnet-trace-collect Hang or deadlock diagnosis on Linux 3.0/5 4.0/5 +1.0 ✅ dotnet-trace-collect; tools: skill, report_intent, view ✅ 0.15
dotnet-trace-collect Windows container high CPU with PerfView 1.5/5 5.0/5 +3.5 ✅ dotnet-trace-collect; tools: report_intent, skill, view, glob ✅ 0.15
dotnet-trace-collect Long-running intermittent issue with PerfView triggers 2.5/5 5.0/5 +2.5 ✅ dotnet-trace-collect; tools: skill, report_intent, view, glob ✅ 0.15
dotnet-trace-collect Linux pre-.NET 10 needing native call stacks 2.0/5 5.0/5 +3.0 ✅ dotnet-trace-collect; tools: skill, report_intent, view, glob ✅ 0.15
dotnet-trace-collect Windows modern .NET with admin high CPU 2.0/5 4.0/5 +2.0 ✅ dotnet-trace-collect; tools: skill, report_intent, view, glob ✅ 0.15
dotnet-trace-collect Memory leak on .NET Framework Windows 3.5/5 5.0/5 +1.5 ✅ dotnet-trace-collect; tools: skill, report_intent, view, glob ✅ 0.15
dotnet-trace-collect Kubernetes with console access prefers console tools 5.0/5 4.5/5 -0.5 ✅ dotnet-trace-collect; tools: skill, report_intent, view, glob ✅ 0.15
dotnet-trace-collect Container installation without .NET SDK 3.0/5 4.5/5 +1.5 ✅ dotnet-trace-collect; tools: skill ✅ 0.15
dotnet-trace-collect HTTP 500s from downstream service on Linux (.NET 8) 4.0/5 5.0/5 +1.0 ✅ dotnet-trace-collect; tools: skill, bash, report_intent, view, glob ✅ 0.15
dotnet-trace-collect Networking timeouts on Windows with admin (.NET 8) 2.0/5 5.0/5 +3.0 ✅ dotnet-trace-collect; tools: skill, report_intent, view, glob ✅ 0.15
microbenchmarking Investigate runtime upgrade performance impact 3.5/5 3.0/5 -0.5 ✅ microbenchmarking; tools: skill, glob, stop_bash ✅ 0.12
csharp-scripts Test a C# language feature with a script 3.5/5 5.0/5 +1.5 ✅ csharp-scripts; tools: skill, create, edit 🟡 0.32
clr-activation-debugging Diagnose unexpected FOD dialog from native build tool 1.0/5 5.0/5 +4.0 ✅ clr-activation-debugging; tools: skill ✅ 0.12
clr-activation-debugging Diagnose FOD suppressed but activation still failing 1.0/5 5.0/5 +4.0 ✅ clr-activation-debugging; tools: skill ✅ 0.12
clr-activation-debugging Explain why same binary behaves differently under different launch methods 1.0/5 5.0/5 +4.0 ✅ clr-activation-debugging; tools: skill ✅ 0.12
clr-activation-debugging Analyze healthy managed EXE activation 1.0/5 5.0/5 +4.0 ✅ clr-activation-debugging; tools: skill ✅ 0.12
clr-activation-debugging Identify multiple activation sequences in a single log 1.0/5 5.0/5 +4.0 ✅ clr-activation-debugging; tools: skill ✅ 0.12
clr-activation-debugging Explain useLegacyV2RuntimeActivationPolicy in activation log 2.0/5 3.0/5 +1.0 ✅ clr-activation-debugging; tools: skill ✅ 0.12
clr-activation-debugging Decline non-CLR-activation issue 1.0/5 5.0/5 +4.0 ✅ tools: bash ✅ 0.12
thread-abort-migration Worker thread with abort-based cancellation 5.0/5 5.0/5 0.0 ✅ thread-abort-migration; tools: skill ✅ 0.10
thread-abort-migration Timeout enforcement via Thread.Abort 4.5/5 5.0/5 +0.5 ✅ thread-abort-migration; tools: skill ✅ 0.10
thread-abort-migration Blocking WaitHandle with Thread.Interrupt 3.5/5 4.5/5 +1.0 ✅ thread-abort-migration; tools: skill ✅ 0.10
thread-abort-migration ASP.NET Response.End and Response.Redirect with Thread.Abort 4.0/5 5.0/5 +1.0 ✅ thread-abort-migration; tools: skill ✅ 0.10
thread-abort-migration Thread.Join and Thread.Sleep only — should not migrate 3.0/5 5.0/5 +2.0 ✅ thread-abort-migration; tools: skill ✅ 0.10
migrate-nullable-references Enable NRT in a small library with mixed nullability 5.0/5 5.0/5 0.0 ✅ migrate-nullable-references; tools: skill ✅ 0.04
migrate-nullable-references File-by-file migration: only modify the targeted file 5.0/5 5.0/5 0.0 ⚠️ NOT ACTIVATED ✅ 0.04
migrate-nullable-references Enable NRT in ASP.NET Core Web API with EF Core 3.5/5 3.0/5 -0.5 ⚠️ NOT ACTIVATED ✅ 0.04
nuget-trusted-publishing Set up trusted publishing for a new NuGet library 3.0/5 4.0/5 +1.0 ✅ nuget-trusted-publishing; tools: skill, stop_bash ✅ 0.13
nuget-trusted-publishing Set up NuGet publishing without mentioning trusted publishing 2.0/5 5.0/5 +3.0 ✅ nuget-trusted-publishing; tools: skill, report_intent, view, glob, bash, create ✅ 0.13
nuget-trusted-publishing Migrate existing workflow from API key to trusted publishing 3.0/5 4.5/5 +1.5 ✅ nuget-trusted-publishing; tools: skill, view, bash ✅ 0.13
analyzing-dotnet-performance Detects compiled regex startup budget and regex chain allocations 1.0/5 3.0/5 ⏰ timeout +2.0 ✅ analyzing-dotnet-performance; tools: skill ✅ 0.13
analyzing-dotnet-performance Detects CurrentCulture comparer and compiled regex budget in inflection rules 1.0/5 3.5/5 ⏰ timeout +2.5 ✅ analyzing-dotnet-performance; tools: skill, task, glob, grep ✅ 0.13
analyzing-dotnet-performance Finds per-call Dictionary allocation not hoisted to static 1.0/5 5.0/5 ⏰ timeout +4.0 ✅ analyzing-dotnet-performance; tools: skill ✅ 0.13
analyzing-dotnet-performance Catches compound allocations in recursive number converter with ToLower 1.0/5 4.0/5 +3.0 ✅ analyzing-dotnet-performance; tools: skill ✅ 0.13
analyzing-dotnet-performance Finds StringComparison.Ordinal missing and FrozenDictionary opportunities 1.0/5 5.0/5 ⏰ timeout +4.0 ✅ analyzing-dotnet-performance; tools: skill, grep ✅ 0.13
analyzing-dotnet-performance Detects Aggregate+Replace chain and struct missing IEquatable 1.0/5 5.0/5 +4.0 ✅ analyzing-dotnet-performance; tools: skill ✅ 0.13
analyzing-dotnet-performance Finds branched Replace chain in format string manipulation 1.0/5 3.5/5 ⏰ timeout +2.5 ✅ analyzing-dotnet-performance; tools: skill, grep ✅ 0.13
analyzing-dotnet-performance Catches LINQ on hot-path string processing and All(char.IsUpper) 1.0/5 3.5/5 ⏰ timeout +2.5 ✅ analyzing-dotnet-performance; tools: skill ✅ 0.13
analyzing-dotnet-performance Detects LINQ pipeline in TimeSpan formatting and collection processing 1.0/5 4.0/5 +3.0 ✅ analyzing-dotnet-performance; tools: skill ✅ 0.13
analyzing-dotnet-performance Flags Span inconsistencies and compound method chains in truncation library 1.0/5 4.5/5 +3.5 ✅ analyzing-dotnet-performance; tools: skill ✅ 0.13
analyzing-dotnet-performance Identifies unsealed leaf classes and locale hierarchy patterns 1.0/5 5.0/5 ⏰ timeout +4.0 ✅ analyzing-dotnet-performance; tools: skill ✅ 0.13
dotnet-aot-compat Make Azure.ResourceManager AOT-compatible 1.5/5 ⏰ timeout 3.5/5 ⏰ timeout +2.0 ✅ dotnet-aot-compat; tools: skill, read_agent, bash, create ✅ 0.14
optimizing-ef-core-queries Optimize bulk operations with EF Core 7+ ExecuteUpdate and ExecuteDelete 4.0/5 5.0/5 +1.0 ✅ optimizing-ef-core-queries; tools: skill 🟡 0.20
android-tombstone-symbolication Symbolicate .NET frames in an Android tombstone 3.5/5 4.0/5 +0.5 ✅ android-tombstone-symbolication; tools: skill, stop_bash, glob ✅ 0.18
android-tombstone-symbolication Recognize tombstone with no .NET frames 5.0/5 5.0/5 0.0 ✅ android-tombstone-symbolication; tools: skill, bash ✅ 0.18
android-tombstone-symbolication Symbolicate CoreCLR frames in an Android tombstone 3.5/5 4.0/5 +0.5 ✅ android-tombstone-symbolication; tools: skill, stop_bash, glob ✅ 0.18
android-tombstone-symbolication Recognize NativeAOT tombstone with app binary and libSystem.Native.so 3.0/5 4.0/5 +1.0 ✅ android-tombstone-symbolication; tools: skill, glob, bash ✅ 0.18
android-tombstone-symbolication Symbolicate multi-thread tombstone 4.0/5 4.5/5 +0.5 ✅ android-tombstone-symbolication; tools: skill, stop_bash, read_bash ✅ 0.18
android-tombstone-symbolication Handle .NET frames with no BuildId metadata 4.0/5 5.0/5 +1.0 ✅ android-tombstone-symbolication; tools: skill, glob, bash ✅ 0.18
android-tombstone-symbolication Symbolicate tombstone with multiple .NET libraries and different BuildIds 3.5/5 4.0/5 +0.5 ✅ android-tombstone-symbolication; tools: skill, glob ✅ 0.18
android-tombstone-symbolication Reject iOS crash log as wrong format 5.0/5 5.0/5 0.0 ℹ️ not activated (expected) ✅ 0.18
dotnet-pinvoke Generate LibraryImport declaration from C header (.NET 8+) 4.5/5 5.0/5 +0.5 ✅ dotnet-pinvoke; tools: skill ✅ 0.07
dotnet-pinvoke Generate LibraryImport declaration from C header (.NET Framework) 3.5/5 5.0/5 +1.5 ✅ dotnet-pinvoke; tools: skill ✅ 0.07
dump-collect Configure automatic crash dumps for CoreCLR app on Linux 4.5/5 5.0/5 +0.5 ✅ dump-collect; tools: skill, report_intent, view, glob ✅ 0.16
dump-collect Set up NativeAOT crash dumps with createdump in Kubernetes 1.5/5 ⏰ timeout 5.0/5 +3.5 ✅ dump-collect; tools: skill ✅ 0.16
dump-collect Recover crash dump from macOS NativeAOT without createdump 4.0/5 4.5/5 +0.5 ✅ dump-collect; tools: skill, report_intent, view, glob, bash ✅ 0.16
dump-collect Configure CoreCLR dump collection in Alpine Docker as non-root 2.0/5 5.0/5 +3.0 ✅ dump-collect; tools: skill, report_intent, view, glob, bash ✅ 0.16
dump-collect Advisory: macOS NativeAOT crash dump recovery steps 4.0/5 4.5/5 +0.5 ✅ dump-collect; tools: skill, glob, bash ✅ 0.16
dump-collect Advisory: CoreCLR Alpine Docker non-root configuration 3.5/5 5.0/5 +1.5 ✅ dump-collect; tools: skill, report_intent, view ✅ 0.16
dump-collect Advisory: NativeAOT Kubernetes dump collection setup 3.0/5 4.0/5 +1.0 ✅ dump-collect; tools: skill ✅ 0.16
dump-collect Detect runtime and configure crash dumps for unknown .NET app on Linux 3.5/5 4.5/5 +1.0 ✅ dump-collect; tools: skill, bash ✅ 0.16
dump-collect Decline dump analysis request 2.0/5 4.0/5 +2.0 ℹ️ not activated (expected) ✅ 0.16
build-parallelism Analyze build parallelism bottlenecks 1.0/5 ⏰ timeout 1.0/5 ⏰ timeout 0.0 ✅ build-parallelism; binlog-generation; tools: skill, task, glob ✅ 0.14
including-generated-files Diagnose generated file inclusion failure 3.0/5 5.0/5 +2.0 ✅ including-generated-files; tools: skill ✅ 0.13
msbuild-antipatterns Review MSBuild files for anti-patterns and style issues 5.0/5 5.0/5 0.0 ✅ msbuild-antipatterns; tools: skill, glob ✅ 0.05
build-perf-baseline Establish build performance baseline and recommend optimizations 3.0/5 4.5/5 +1.5 ✅ build-perf-baseline; build-perf-diagnostics; tools: skill 🟡 0.37
msbuild-modernization Modernize legacy project to SDK-style 5.0/5 5.0/5 0.0 ✅ msbuild-modernization; tools: skill ✅ 0.04
directory-build-organization Organize build infrastructure for a multi-project repo 3.5/5 5.0/5 +1.5 ✅ msbuild-antipatterns; directory-build-organization; tools: skill ✅ 0.18
check-bin-obj-clash Diagnose bin/obj output path clashes 3.0/5 4.5/5 +1.5 ✅ check-bin-obj-clash; binlog-generation; tools: skill, glob ✅ 0.15
incremental-build Analyze incremental build issues 3.0/5 4.0/5 +1.0 ✅ incremental-build; tools: skill ✅ 0.12
eval-performance Analyze MSBuild evaluation performance issues 4.0/5 4.5/5 +0.5 ✅ eval-performance; tools: skill ✅ 0.11
build-perf-diagnostics Analyze analyzer performance impact on builds 1.0/5 ⏰ timeout 4.0/5 ⏰ timeout +3.0 ✅ binlog-generation; build-perf-diagnostics; binlog-failure-analysis; tools: skill, edit 🟡 0.23
binlog-generation Build project with /bl flag 1.0/5 5.0/5 +4.0 ✅ binlog-generation; tools: skill ✅ 0.00
binlog-generation Build with /bl in PowerShell 3.5/5 5.0/5 +1.5 ✅ binlog-generation; tools: skill ✅ 0.00
binlog-generation Build multiple configurations with unique binlogs 3.5/5 5.0/5 +1.5 ✅ binlog-generation; tools: skill ✅ 0.00
binlog-failure-analysis Diagnose build failures from binlog only (no source files) 4.0/5 5.0/5 +1.0 ✅ binlog-failure-analysis; tools: skill ✅ 0.04

timeout — run hit the scenario timeout limit; scoring may be impacted by aborting model execution before it could produce its full output

Model: claude-opus-4.6 | Judge: claude-opus-4.6

Full results

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants