Skip to content

14 infrastructure bugs found in SkillsBench Trial 1 audit — verifier, permissions, provisioning #192

@xdotli

Description

@xdotli

Summary

8-agent trajectory audit of SkillsBench Opus 4.7 with-skills Trial 1 (84 tasks) found 14 tasks with infrastructure bugs that caused false failures or prevented agent execution. These bugs reduce the measured score from a potential ~50-55% to 40.3%.

Model: claude-opus-4-7 | Agent: claude-agent-acp | Environment: Daytona | Concurrency: 2

Trial job dir: experiments/skillsbench/jobs/opus47-with-skills-t1/2026-04-24__16-00-14/


Category 1: Verifier Broken (5 tasks)

Agent did correct work but verifier failed to evaluate it. These are FALSE FAILs — reward=0.0 when it should be 1.0.

1. suricata-custom-exfil

  • Bug: Verifier pip install pytest installs to a different Python than /usr/bin/python3. Verifier runs python3 -m pytestNo module named pytest.
  • Agent work: Wrote correct Suricata rule, tested it, sid:1000001 fires on positive pcap.
  • Trial dir: suricata-custom-exfil__4473b0c0 (retry that ran)
  • Reproduce: Run the task, check verifier log for No module named pytest
  • Fix: Pin Python path in verifier test.sh, or use uv run pytest instead of python3 -m pytest

2. video-tutorial-indexer

  • Bug: Verifier installs pytest-json-ctrf==0.1.0 but the plugin registers differently — pytest -p ctrf fails with No module named 'ctrf'. Zero tests ran.
  • Agent work: Transcribed video with Whisper, wrote tutorial_index.json with 29 chapters.
  • Trial dir: video-tutorial-indexer__4d31a011
  • Fix: Update ctrf plugin import name in test.sh or remove -p ctrf flag

3. sales-pivot-analysis

  • Bug: Verifier test.sh only installs pytest + pytest-json-ctrf, but test_outputs.py imports openpyxl at line 9. ModuleNotFoundError. Zero tests ran.
  • Agent work: Parsed PDF + XLSX, wrote demographic_analysis.xlsx with pivot tables.
  • Trial dir: sales-pivot-analysis__f7896973
  • Fix: Add openpyxl to verifier test.sh pip install

4. pddl-tpp-planning

  • Bug: Verifier validate_plan() opens task01.pkl (reference solution pickle) which was never provisioned into /tests. FileNotFoundError.
  • Agent work: Read PDDL domain/problem files, wrote valid plan files.
  • Trial dir: pddl-tpp-planning__0ad04ccc
  • Fix: Include reference .pkl files in task tests/ directory

5. fix-build-google-auto

  • Bug: Verifier test.sh calls uv which is not installed in the sandbox. Dockerfile comment says "uv is installed" but never installs it. All 5 test lines fail with uv: command not found.
  • Agent work: Identified 4 cascading JDK 11 build issues, wrote 8 patches, got mvn verify -DskipTests=true to BUILD SUCCESS.
  • Trial dir: fix-build-google-auto__dc1de096
  • Fix: Add uv install to Dockerfile

Category 2: Sandbox Permissions (4 tasks)

Agent did correct work but couldn't write to required output paths. sandbox_user=agent can't write to root-owned directories.

6. syzkaller-ppdev-syzlang

  • Bug: /opt/syzkaller/sys/linux/ is owned by root. Agent (uid=1001) gets EACCES writing dev_ppdev.txt. Tried sudo (not installed), chmod (not permitted).
  • Agent work: Correctly identified ppdev ioctl interface, computed ioctl numbers, wrote proper syzlang description.
  • Trial dir: syzkaller-ppdev-syzlang__d928daed
  • Fix: chmod -R 777 /opt/syzkaller/sys/linux/ in Dockerfile or add agent to appropriate group

7. taxonomy-tree-merge

  • Bug: (a) Environment setup took 369s (vs typical 30-100s), eating 41% of 900s timeout. (b) /root/output/ owned by root, agent can't write.
  • Agent work: Processed 8,718 records through embedding + clustering pipeline. Completed successfully but hit PermissionError on output.
  • Trial dir: taxonomy-tree-merge__ce812f15
  • Fix: Create output dir with agent permissions in Dockerfile; investigate slow build

8. lean4-proof

  • Bug: Lean toolchain at /root/.elan/bin/ not accessible to sandbox_user=agent. Agent can't run lake or lean to type-check proofs.
  • Agent work: Wrote a Lean 4 proof (blind, couldn't verify).
  • Trial dir: lean4-proof__336be05e
  • Fix: Install Lean toolchain in a shared path or grant agent access to /root/.elan/

9. multilingual-video-dubbing

  • Bug: (a) /outputs directory root-owned, agent can't write. (b) Verifier itself crashes: No module named 'torch' — torch not installed for verifier.
  • Agent work: Generated TTS via Gemini API, did audio processing.
  • Trial dir: multilingual-video-dubbing__eb0c860f
  • Fix: chmod /outputs in Dockerfile; add torch to verifier deps

Category 3: Environment Provisioning (2 tasks)

Input data missing from sandbox — agent has nothing to work with.

10. organize-messy-files

  • Bug: Sandbox only had 3 files (DAMOP.pptx, paper_file_1.docx, paper_file_2.docx). Verifier expects ~100 PDFs across 5+ subject folders.
  • Trial dir: organize-messy-files__7283c7f8
  • Fix: Ensure all PDFs are included in the Docker build context / COPY step

11. azure-bgp-oscillation-route-leak

  • Bug: Dockerfile's inline DATAGEN Python script should generate JSON files into /app/data/. In sandbox, /app/data/ was empty.
  • Trial dir: azure-bgp-oscillation-route-leak__d4273147
  • Fix: Debug the DATAGEN heredoc in Dockerfile — may fail silently during Daytona image build

Category 4: Agent Launch / SDK Failures (3 tasks)

12. fix-build-agentops

  • Bug: Agent binary claude-agent-acp not found (rc=127). bugswarm/cached-images base image has PATH issues for Node.js/npm.
  • Trial dir: fix-build-agentops__a25f005e
  • Fix: Ensure Node.js 22 install works on bugswarm base image

13. latex-formula-extraction

  • Bug: Daytona SDK 0.168.0/0.169.0 SessionCommandLogsResponse pydantic validation error. Returns empty string instead of dict.
  • Trial dir: latex-formula-extraction__266e48be
  • Fix: Upstream Daytona SDK bug — report to Daytona

14. react-performance-debugging

  • Bug: Verifier pytest-playwright page fixture not found. 5 tests SKIPPED (async), 6 tests ERROR. Agent's React optimizations never evaluated.
  • Agent work: 21 tool calls, 7 code edits (parallelized API calls, added memoization, replaced lodash).
  • Trial dir: react-performance-debugging__f89a18bd
  • Fix: Fix conftest.py / pytest-playwright version compatibility in verifier

Additional Issues (not bugs, but score-impacting)

Disk space (2 tasks)

  • seismic-phase-picking and video-filler-word-remover: Sandbox 74% full at start. Agent wastes 5+ minutes clearing pip cache before doing real work. Combined with tight timeouts, causes failures.
  • Fix: Ensure sandbox images have sufficient free disk space

DinD Compose (4 tasks)

  • gh-repo-analytics, pedestrian-traffic-counting, pg-essay-to-audiobook, scheduling-email-assistant all fail with Process closed stdout (rc=None) because SSH pipes break through DinD layers.
  • Fix: PR on benchflow fix/dind-pty-acp branch — uses Daytona PTY WebSocket instead of SSH. Tested successfully on pedestrian-traffic-counting (reward=0.014, 7 tool calls).

Impact

Category Tasks Potential score gain
Verifier broken (false FAILs) 5 +5 passes
Sandbox permissions 4 +3-4 passes
Environment provisioning 2 +1-2 passes
Agent launch / SDK 3 +1-2 passes
Total 14 +10-13 passes

Current: 33.4/83 = 40.3%
Estimated if fixed: 43-46/83 = 52-55%

This would put Opus 4.7 above Gemini Flash (48.7%) on the leaderboard.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions