Add --run-in-docker to skill-validator to run Copilot CLI in a docker container#273
Add --run-in-docker to skill-validator to run Copilot CLI in a docker container#273caaavik-msft wants to merge 9 commits intomainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR adds an optional --run-in-docker flag to the skill-validator tool, enabling agent runs, judges, and setup commands to execute inside a Docker container rather than directly on the host. This provides isolation for local development, protecting the host system from potentially destructive changes made by weaker models. The implementation uses the Copilot SDK's --headless mode, builds a Docker image containing the exact Copilot CLI binary from the SDK, manages container lifecycle with cleanup handlers, and handles path translation between host and container mount points.
Changes:
- Adds
DockerCopilotServerservice that manages Docker container lifecycle, path mapping between host and container, and Docker CLI execution. - Integrates Docker mode throughout the evaluation pipeline (
AgentRunner,Judge,PairwiseJudge,ValidateCommand), translating work directories and skill paths when Docker is active. - Adds a Dockerfile, CLI option, configuration model update, documentation, and unit tests for the new functionality.
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
eng/skill-validator/src/Services/DockerCopilotServer.cs |
New service managing Docker container lifecycle, skill volume mounts, host↔container path mapping, and copilot CLI startup |
eng/skill-validator/tests/DockerCopilotServerTests.cs |
Unit tests for BuildSkillMounts, MapHostPathToContainer, TryMapContainerPathToHost, and GetCopilotSdkVersion |
eng/skill-validator/src/Services/AgentRunner.cs |
Integrates Docker mode for client initialization, work dir setup, permission checking, session config building, and setup command execution |
eng/skill-validator/src/Commands/ValidateCommand.cs |
Adds --run-in-docker CLI option, moves skill discovery earlier for mount setup, adds Docker container cleanup |
eng/skill-validator/src/Services/Judge.cs |
Maps work directory to container path for judge sessions |
eng/skill-validator/src/Services/PairwiseJudge.cs |
Maps work directory to container path for pairwise judge sessions |
eng/skill-validator/src/Models/Models.cs |
Adds RunInDocker property to ValidatorConfig |
eng/skill-validator/src/Docker/Dockerfile |
Dockerfile that installs the Copilot CLI binary from the SDK package |
eng/skill-validator/src/SkillValidator.csproj |
Includes the Dockerfile in build output |
eng/skill-validator/README.md |
Documents the --run-in-docker flag and Docker mode requirements |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Skill Validation Results
Model: claude-opus-4.6 | Judge: claude-opus-4.6 |
Summary
This PR adds an optional Docker execution mode to
skill-validatorso agent runs, judges, and setup commands can execute in an isolated container instead of directly on the host machine.Motivation
The main use case for this is for local development, but it might also be useful for running in CI if we want to build on top of it. I was building some skills and found that when using some weaker models, they made destructive changes to my host system to accomplish the task (e.g. reinstalling .NET). With this, agents and judges run inside a container with only access to the files they need bound to the host machine. This does not add any additional security measures for network isolation.
Implementation
This makes use of the
--headlessmode for running copilot as described here: https://github.com/github/copilot-sdk/blob/main/docs/guides/setup/backend-services.md.It requires a
GITHUB_TOKENbe present to pass into the container so that it can use that to authenticate to the Copilot API. I have an example in theREADMEwhich explains that you can get this token withgh auth token. For people with multiple gh accounts (e.g. personal and enterprise), you can also dogh auth token --user <name>.A Dockerfile is included in the repo to use as the base image:
This ensures that we use the exact same Copilot CLI binary that is shipped with the SDK. The SDK version is resolved programmatically inside the SkillValidator so it is kept in sync. It places the copilot binary at
/usr/local/bin/copilotinside the container.To handle path mapping/translation, when running in docker mode, all temp/work directories are placed inside a single directory in the TMP folder, and that entire directory is mounted into the container with read-write. This makes it easy to map paths to and from the host and container equivalent when needed. Skill directories are also mounted into the container with read-only access, and only the directories that are being evaluated will be mounted.
The container uses a randomised port
-p 0:4321which is resolved later usingdocker port. The container is always cleaned up after finishing, including onProcessExitandCancelKeyPressevents.Future Extensibility
I have a proof of concept working locally which I chose not to push for now to keep this PR simple which runs all agents inside their own containers rather than having a single container that is used to run all agents and judges. This would help reduce any risks of agents modifying the environment and impacting other evaluations if that sounds desirable, but it does mean that each agent would use a separate
CopilotClientrather a single sharedCopilotClient.