Skip to content

[ARC-DinD] GAW should provide first-class ARC runner support for AWF-backed workflows #30840

@rhardouin

Description

@rhardouin

Summary

GAW can compile AWF-backed workflows that target self-hosted ARC runners, but today those workflows may require brittle workflow-authored pre-agent staging when ARC uses Docker-in-Docker (DinD). The underlying blocker is AWF chroot support for split runner/Docker daemon filesystems; this issue asks GAW to provide first-class integration and user-facing validation around that support.

Depends on #30838. GAW can implement a temporary compiler-side compatibility path only until AWF supports ARC/DinD natively.

Symptoms

Observed with a workflow using:

  • a self-hosted ARC runner label for the agent job
  • a self-hosted ARC runner label for framework jobs
  • sandbox.agent.id: awf
  • Copilot engine
  • MCP Gateway
  • safe outputs

The workflow only passed after adding workflow-specific pre-agent steps and custom mounts that staged AWF chroot/runtime files into daemon-visible /tmp/gh-aw/... paths, including:

  • /tmp/gh-aw/arc-etc/passwd
  • /tmp/gh-aw/arc-etc/group
  • /tmp/gh-aw/arc-etc/hosts
  • /tmp/gh-aw/arc-tools/bin
  • /tmp/gh-aw/arc-tools/chroot-bin

The workaround also had to prepare /tmp/gh-aw home/cache/config/state directories, MCP payload/log directories, safeoutputs directories, AWF firewall log directories, Copilot MCP config files, runner-visible placeholder mount sources, executable shell wrappers, and tar-stream transfers into the DinD daemon filesystem.

This workaround is too brittle for normal GAW users. A workflow author should not need to understand AWF internal chroot mount expectations, DinD daemon-side path resolution, runner-side validation placeholders, shell heredoc quoting, or Copilot/Node/coreutils staging just to run an AWF-backed workflow on ARC.

The workflow also had to expose infrastructure topology through frontmatter so the generated MCP Gateway step could reach the ARC/DinD Docker daemon:

sandbox:
  mcp:
    env:
      DOCKER_HOST: tcp://localhost:2375

That is an integration leak. MCP Gateway can legitimately consume DOCKER_HOST because it launches stdio MCP servers as containers, but workflow authors should not need to repeat the ARC runner's Docker daemon endpoint in every workflow. GAW should generate the required gateway Docker environment from runtime detection where possible, with an explicit ARC/DinD compatibility flag only as a fallback when detection is unavailable or inconclusive.

After the workflow could start and complete successfully, a later run exposed a second integration symptom: the agent attempted to emit a safeoutputs no-op, but the shell invocation failed inside the AWF chroot because Copilot's PTY/native module path could not load libutil.so.1.

safeoutputs noop --message "..."
Failed to load native module: pty.node
Error: libutil.so.1: cannot open shared object file: No such file or directory

The workflow still concluded successfully, but the safeoutputs artifact stayed empty ({"items":[]}). As a result, the generated detection job was skipped because there were no output types and no patch, and the generated safe_outputs job was skipped because it requires detection to succeed. This is too subtle for users: a green run can hide the fact that the agent failed to emit the intended safe output.

After staging the PTY dependency, a later run exposed a third integration symptom: MCP Gateway started and registered tools successfully, but the agent could not reach it from inside the AWF chroot because host.docker.internal was unresolved. The AWF chroot hosts mount appeared as a directory, so the entrypoint could not add the host-gateway entry:

[entrypoint][WARN] Could not add host.docker.internal to chroot /etc/hosts
/usr/local/bin/entrypoint.sh: line ...: /host/etc/hosts: Is a directory
host.docker.internal unresolved

Manually staging a daemon-visible /tmp/gh-aw/arc-etc/hosts file with the host-gateway mapping and mounting it into the chroot fixed that issue. The successful run then produced a safeoutputs noop item with no errors, and both generated framework jobs (detection and safe_outputs) ran to success.

No MCPG issue is included: MCP Gateway started, registered GitHub and safeoutputs routes, and served tools successfully once the AWF chroot filesystem was usable.

Root Cause

GAW correctly emitted AWF-backed agent jobs and framework jobs targeting the configured runner labels, but the compiled workflow did not account for the fact that ARC/DinD separates:

  • the runner container filesystem used by GAW-generated pre-agent/setup steps, and
  • the Docker daemon sidecar filesystem used by AWF bind mounts.

The direct root cause belongs in AWF and is tracked by #30838. GAW still has responsibility for the workflow author experience:

  • The compiler allows an ARC/DinD configuration that can fail at runtime with low-level mount and missing-tool errors.
  • Users are forced to add custom mounts and staging scripts for AWF internals.
  • Users are forced to put runner infrastructure details such as sandbox.mcp.env.DOCKER_HOST: tcp://localhost:2375 in workflow frontmatter so the generated MCP Gateway startup path can reach the DinD daemon.
  • Runner labels are not a reliable signal for this behavior. A label can be any organization-specific string, so GAW should not infer ARC/DinD from runs-on or runs-on-slim.
  • Generated prompts, setup action files, MCP configuration, and engine runtime files must be visible from inside the AWF chroot when the daemon filesystem is separate.
  • Generated MCP Gateway access also depends on chroot-visible host-gateway resolution. In ARC/DinD, GAW-generated MCP configuration can correctly point at host.docker.internal while AWF still fails to provide a usable /etc/hosts view inside the chroot.
  • GAW's downstream job gating treats empty agent output as a normal no-output/no-patch case. That is correct for true no-op runs, but it makes ARC/AWF runtime failures harder to diagnose when the agent attempted to call safeoutputs and the command failed before writing an output item.

Relevant ownership areas:

  • AWF command generation and custom mounts: gh-aw/pkg/workflow/copilot_engine_execution.go, gh-aw/pkg/workflow/awf_config.go, and shared engine helpers
  • Sandbox frontmatter parsing and validation: gh-aw/pkg/workflow/sandbox.go and gh-aw/pkg/workflow/sandbox_validation.go
  • MCP Gateway setup and host-domain generation: gh-aw/pkg/workflow/mcp_setup_generator.go
  • Framework runner selection: gh-aw/pkg/workflow/safe_outputs_runtime.go
  • Safe output job gating and artifact processing: gh-aw/pkg/workflow/compiler_safe_outputs_job.go, gh-aw/pkg/workflow/compiler_safe_outputs_steps.go, and gh-aw/pkg/workflow/noop.go
  • Threat detection and AWF-backed detection data: gh-aw/pkg/workflow/threat_detection.go
  • Compiler validation: gh-aw/pkg/workflow/compiler_validators.go

Expected Behavior

GAW workflows should support ARC runners with AWF-backed engines without private, workflow-specific staging hacks.

GAW should not try to guess ARC/DinD from runner labels. Runner labels are arbitrary user-controlled names, so any ARC/DinD support must be based on runtime evidence from the runner environment. If runtime detection cannot be made reliable, users should opt in through an explicit frontmatter compatibility flag that describes the runner topology, not a private runner group name.

When the selected AWF version cannot support ARC/DinD safely, GAW should fail early or warn clearly during compile/runtime setup with an actionable message rather than producing obscure AWF mount or missing binary failures.

When an agent attempts to emit safe outputs but the AWF-backed runtime cannot execute the safeoutputs command path, users should get an actionable failure or diagnostic rather than a successful run with skipped detection and safe_outputs jobs.

When GAW-generated MCP configuration uses host.docker.internal or an equivalent host-gateway name, the compiled workflow should either rely on an AWF version that makes that name resolvable inside the chroot or fail with a clear ARC/DinD compatibility diagnostic.

GAW should minimize infrastructure detail in workflow frontmatter. ARC/DinD Docker daemon routing for generated MCP Gateway steps should preferably be selected by a generated runtime probe. If GAW cannot reliably detect the topology at runtime, it should fall back to an explicit compatibility mode rather than requiring every workflow to set sandbox.mcp.env.

Proposed Implementation Plan

Please implement first-class GAW support around AWF ARC/DinD compatibility:

  1. Add an ARC/DinD compatibility model to workflow compilation.
    • Prefer runtime detection over static inference. Generate an early setup/probe step that determines whether the current runner uses ARC/DinD with a split runner/Docker daemon filesystem.
    • Do not infer this mode from runs-on or runs-on-slim; runner labels are arbitrary user-controlled names.
    • The probe should check runtime facts such as Docker daemon connectivity, whether DOCKER_HOST points to a TCP DinD daemon, whether runner-visible sentinel files are daemon-visible, and whether the MCP Gateway Docker startup path can access the same daemon.
    • Store the detected compatibility mode in step outputs or environment variables that later generated steps can consume.
    • If runtime detection cannot be made reliable for all supported ARC/DinD setups, add an explicit frontmatter flag as a fallback override, for example:
sandbox:
  compatibility: arc-dind
  • Define any fallback flag in gh-aw/pkg/workflow/sandbox.go, validate it in gh-aw/pkg/workflow/sandbox_validation.go, and reject unsupported values with GAW's standard error-message style.
  • Treat MCP Gateway Docker daemon access as part of the detected or configured compatibility model. For the standard supported ARC/DinD runner shape, generated gateway startup should receive the correct DOCKER_HOST or Docker socket configuration without requiring workflow-authored sandbox.mcp.env.
  • If GAW needs to support non-standard DinD endpoints later, prefer a scoped compatibility option such as sandbox.compatibility-options.docker-host over requiring users to place raw infrastructure env vars under sandbox.mcp.env.
  1. Integrate with the AWF fix from [ARC-DinD] AWF chroot mode should support ARC/DinD Docker daemon filesystems without manual staging #30838.

    • When an AWF version with native ARC/DinD support is available, compile workflows to rely on AWF's staging/diagnostics instead of emitting user-authored workarounds.
    • If the configured AWF version is too old, emit a clear compiler or setup error explaining the required AWF version.
    • Follow GAW's validation architecture and error-message style: [what is wrong]. [what is expected]. [example].
    • Use pkg/console formatting for user-facing CLI/setup diagnostics where applicable.
  2. Remove the need for manual workflow mounts.

    • Users should not need to declare mounts like /tmp/gh-aw/arc-etc/passwd:/etc/passwd:ro, /tmp/gh-aw/arc-etc/hosts:/etc/hosts:ro, or /tmp/gh-aw/arc-tools/chroot-bin:/bin:ro.
    • GAW should not require workflow authors to stage capsh, bash, Node, Copilot CLI wrappers, common applets, dynamic libraries, firewall log directories, or runner-visible placeholder files for mount validation.
  3. Ensure GAW-generated runtime assets are chroot-visible.

    • Prompts, setup action files, MCP CLI wrappers, MCP Gateway output, safeoutputs files, and agent logs/state must be accessible from inside AWF chroot even when the Docker daemon filesystem is separate.
    • MCP Gateway hostnames generated by GAW, such as host.docker.internal, must be resolvable from inside the AWF chroot through the AWF-supported path.
    • MCP Gateway's own Docker daemon environment should be generated from runtime detection or the fallback explicit compatibility mode. Do not require users to hardcode DOCKER_HOST: tcp://localhost:2375 in workflow frontmatter for the standard ARC/DinD case.
    • Keep paths sanitized and consistent with existing /tmp/gh-aw and ${RUNNER_TEMP}/gh-aw conventions.
  4. Surface empty-output false greens.

    • Preserve the current skip behavior for genuine no-output/no-patch runs.
    • Add a clear diagnostic when the agent log or safeoutputs runtime indicates that a safeoutputs command was attempted but no item was recorded.
    • Consider failing the conclusion job, or at least emitting a prominent warning, when an AWF-backed ARC run has safeoutputs enabled, the agent attempted safeoutputs, and agent_output.json remains empty.
  5. Handle threat detection and framework jobs.

    • Confirm runs-on-slim and safe-outputs.runs-on continue to preserve the user-configured runner labels for framework jobs.
    • Ensure threat detection jobs either receive the same AWF ARC/DinD compatibility behavior or fail with a precise message if unsupported.
  6. Update documentation.

    • Document supported ARC runner requirements for AWF-backed workflows.
    • Remove any need for workflow authors to copy the temporary workaround once AWF support lands.

Temporary Compatibility Option

If AWF native support cannot land first, GAW may add a temporary compiler-side setup path driven by runtime detection, with an explicit compatibility flag as a fallback override. That temporary path should:

  • Be documented as transitional.
  • Be removed or disabled by default once AWF supports ARC/DinD directly.
  • Prefer reusable setup generation over encouraging users to paste long pre-agent scripts.
  • Also avoid requiring users to paste MCP Gateway Docker environment details such as DOCKER_HOST; the temporary path should generate those details from runtime detection or the fallback explicit compatibility flag.

Test Plan

Add tests before implementation is considered complete:

  • Compiler tests:

    • Generated ARC/DinD runtime probe steps are emitted for AWF-backed workflows when ARC/DinD compatibility is supported.
    • Runtime probe outputs feed MCP Gateway Docker daemon configuration and AWF compatibility setup.
    • sandbox.compatibility: arc-dind acts as a fallback override when runtime detection is unavailable or inconclusive.
    • Missing sandbox.compatibility: arc-dind keeps existing non-ARC/local Docker socket workflow output unchanged when the runtime probe reports non-ARC/local Docker.
    • Invalid sandbox.compatibility values fail validation with a clear message and example.
    • AWF-backed engine configuration either emits the ARC/DinD compatibility path or validates the required AWF version.
    • Manual mounts for /tmp/gh-aw/arc-etc and /tmp/gh-aw/arc-tools are not required in normal workflow frontmatter.
    • Manual sandbox.mcp.env.DOCKER_HOST is not required for the standard supported ARC/DinD runner profile.
    • GAW-generated setup no longer requires workflow authors to add runner-visible placeholder files only to satisfy AWF mount validation.
  • Golden lockfile tests:

    • Generated YAML contains the expected ARC/DinD support behavior.
    • Existing non-ARC/local Docker socket workflow output remains unchanged.
    • Generated heredocs/scripts used for setup are quoted safely and avoid the fragile delimiter escaping that the workflow-level workaround needed.
  • Runtime/integration smoke test:

    • AWF + Copilot + MCP Gateway + safeoutputs on ARC/DinD.
    • Runtime detection identifies the ARC/DinD split filesystem condition before AWF and MCP Gateway setup.
    • MCP Gateway starts and registers GitHub and safeoutputs servers.
    • MCP Gateway can launch containerized stdio MCP servers on ARC/DinD using GAW-generated Docker daemon configuration, without workflow-authored DOCKER_HOST.
    • Copilot runs through the AWF chroot and can access generated prompt/runtime files.
    • Generated MCP config reaches Copilot inside the chroot without workflow-authored wrapper generation.
    • AWF firewall, MCP, safeoutputs, and agent log/state directories are writable where required.
    • The chroot can resolve and connect to the configured MCP Gateway hostname.
    • A shell-issued safeoutputs no-op writes an item with no errors, detection runs when expected, and safe_outputs is not skipped because of an empty artifact caused by runtime failure.
    • Safeoutputs and threat detection behavior are covered when enabled.
  • Run the full project finish target:

make agent-finish

Acceptance Criteria

  • GAW users can run AWF-backed workflows on ARC/DinD without hand-authored AWF runtime staging.
  • GAW users can run MCP Gateway-backed workflows on supported ARC/DinD runners without putting Docker daemon topology such as DOCKER_HOST: tcp://localhost:2375 in workflow frontmatter.
  • ARC/DinD behavior is controlled by runtime detection where possible, with an explicit frontmatter compatibility flag only as a fallback; it is not controlled by matching private or organization-specific runner labels.
  • Unsupported AWF versions or unsupported ARC/DinD shapes fail with clear, actionable diagnostics.
  • Green ARC/AWF runs do not hide failed safeoutputs emission behind skipped detection and safe_outputs jobs.
  • MCP Gateway connectivity from the AWF chroot works on ARC/DinD without manual /etc/hosts staging.
  • Documentation describes the supported ARC path and the AWF dependency.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions