CLI reference

BenchFlow uses a resource-verb pattern: bench <resource> <verb>.

bench agent

bench agent list

List all registered agents with their protocol and auth requirements.

bench agent list

bench agent show

Show details for a specific agent.

bench agent show gemini

bench eval

bench eval create

Create and run an evaluation. Use it for YAML configs and batch runs; it also accepts a single task directory.

# From YAML config
bench eval create --config benchmarks/harvey-lab/harvey-lab-gemini-flash-lite.yaml

# From remote repo
bench eval create \
  --source-repo benchflow-ai/skillsbench \
  --source-path tasks \
  --agent gemini \
  --model gemini-3.1-flash-lite-preview \
  --sandbox daytona \
  --concurrency 64 \
  --sandbox-setup-timeout 300

# From local directory
bench eval create --tasks-dir ./tasks --agent gemini --model gemini-3.1-flash-lite-preview

# From a hosted PrimeIntellect / Verifiers environment
bench eval create \
  --source-env primeintellect/general-agent \
  --source-env-version 0.1.1 \
  --source-env-arg task=calendar_scheduling_t0 \
  --agent gemini \
  --model google/gemini-2.5-flash-lite

# Single task with mounted skills and the recommended skill nudge
bench eval create \
  --tasks-dir tasks/pdf-fix \
  --agent gemini \
  --model gemini-3.1-flash-lite-preview \
  --sandbox daytona \
  --skills-dir tasks/pdf-fix/environment/skills \
  --agent-env BENCHFLOW_SKILL_NUDGE=name

Flag	Default	Description
`--config`	—	YAML config file
`--tasks-dir`	—	Local task dir (single task with task.toml, or parent of many)
`--source-repo`	—	Remote repo as `org/repo` (e.g. `benchflow-ai/skillsbench`)
`--source-path`	—	Subpath within the repo (e.g. `tasks`)
`--source-ref`	—	Branch or tag to clone (e.g. `main`)
`--source-env`	—	Hosted environment source (e.g. `primeintellect/general-agent`)
`--source-env-version`	—	Hosted environment version
`--source-env-arg`	—	Hosted environment argument as `KEY=VALUE`; repeatable
`--source-env-num-examples`	`1`	Number of hosted environment examples
`--source-env-rollouts-per-example`	`1`	Rollouts per hosted environment example
`--source-env-max-tokens`	`1024`	Max tokens for hosted environment model calls
`--source-env-temperature`	`0.0`	Temperature for hosted environment model calls
`--source-env-sampling-arg`	—	Verifiers sampling argument as `KEY=VALUE`; repeatable (for example `reasoning_effort=minimal`)
`--agent`	`claude-agent-acp`	Agent name
`--model`	Agent default	Model ID
`--sandbox`	`docker`	Sandbox: docker, daytona, or modal
`--concurrency`	`4`	Max concurrent tasks (batch mode only)
`--jobs-dir`	`jobs`	Output directory
`--sandbox-user`	`agent`	Sandbox user (null for root)
`--sandbox-setup-timeout`	`120`	Timeout in seconds for sandbox user setup
`--skills-dir`	—	Skills directory to deploy into each task sandbox
`--agent-env`	—	Agent environment variable as `KEY=VALUE`; repeatable

When mounting skills, the recommended docs default is --agent-env BENCHFLOW_SKILL_NUDGE=name. It prepends a short hint telling the agent which skills are available and where to read them. More verbose modes are description and full. Omit the env var to leave BenchFlow's runtime default off.

--source-env is for external hosted environment hubs. The first supported runner is PrimeIntellect / Verifiers: BenchFlow preserves the hosted identity (env_uid, hub_url), installs the versioned package into an isolated local virtual environment, and runs vf-eval. --sandbox remains the BenchFlow task sandbox selector for local/repo task sources; Verifiers source environments own their own harness and sandbox behavior. --model is passed to the Verifiers model endpoint; use a model id available to that provider. Provider-specific sampling options are not inferred; pass them explicitly with --source-env-sampling-arg.

bench eval list

List completed evaluations from a jobs directory.

bench eval list jobs/

bench skills

bench skills eval

Evaluate a skill against its evals.json test cases.

bench skills eval skills/my-skill/ \
  --agent gemini \
  --model gemini-3.1-flash-lite-preview \
  --sandbox daytona

bench tasks

bench tasks init

Scaffold a new benchmark task.

bench tasks init my-new-task
bench tasks init my-new-task --dir tasks/

bench tasks check

Validate a task directory (Dockerfile, instruction.md, tests/).

bench tasks check tasks/my-task

bench tasks generate

Generate benchmark task directories from real agent traces.

bench tasks generate --from-local --project my-repo --limit 5
bench tasks generate --from-file session.jsonl --dry-run
bench tasks generate --from-hf opentraces-test --limit 50

Flag	Default	Description
`--from-local`	—	Generate from local Claude Code sessions
`--from-file`	—	Generate from a JSONL trace file
`--from-hf`	—	Generate from a HuggingFace dataset ID or alias
`--output`	`tasks`	Output directory for generated tasks
`--projects-dir`	`~/.claude/projects/`	Claude Code projects directory
`--project`	—	Filter local sessions by project path substring
`--format`	`auto`	Trace format override
`--split`	`train`	HuggingFace dataset split
`--max-rows`	`100`	Max rows to download from HuggingFace
`--limit`	`20`	Max traces to process
`--min-steps`	`2`	Minimum steps per trace
`--outcome`	—	Filter by outcome: success, failure, unknown
`--author`	`benchflow-traces`	Author name for generated task metadata
`--dry-run`	`false`	Preview traces without generating tasks

bench environment

bench environment create

Create an environment object from a task directory. This validates environment construction but does not start the sandbox.

bench environment create tasks/my-task --sandbox daytona

bench environment list

List active Daytona sandboxes, or list a hosted hub.

bench environment list
bench environment list --hub primeintellect --owner primeintellect --search general-agent --limit 5

bench environment show

Show hosted environment metadata.

bench environment show primeintellect/general-agent --version 0.1.1

bench environment inspect

Inspect a file from a hosted environment package.

bench environment inspect primeintellect/general-agent --version 0.1.1 --path README.md

bench environment cleanup

Clean up orphaned Daytona sandboxes. By default this deletes sandboxes older than 24 hours; use --dry-run to preview what would be deleted.

bench environment cleanup --dry-run --max-age 1440

YAML Config Format

Batch config with skills and skill nudge

source:
  repo: benchflow-ai/skillsbench
  path: tasks
environment: daytona
concurrency: 64
sandbox_setup_timeout: 300
agent: gemini
model: gemini-3.1-flash-lite-preview
skills_dir: shared-skills/
agent_env:
  BENCHFLOW_SKILL_NUDGE: name
max_retries: 2

Multi-scene (BYOS skill generation)

Use the Python API for multi-scene experiments. bench eval create --config is for batch job configs; scene configs are loaded with benchflow._utils.yaml_loader or built directly in Python.

task_dir: tasks/my-task
environment: daytona
sandbox_setup_timeout: 300

scenes:
  - name: skill-gen
    roles:
      - name: creator
        agent: gemini
        model: gemini-3.1-flash-lite-preview
    turns:
      - role: creator
        prompt: "Analyze the task and write a skill document to /app/generated-skill.md"

  - name: solve
    roles:
      - name: solver
        agent: gemini
        model: gemini-3.1-flash-lite-preview
    turns:
      - role: solver

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CLI reference

bench agent

bench agent list

bench agent show

bench eval

bench eval create

bench eval list

bench skills

bench skills eval

bench tasks

bench tasks init

bench tasks check

bench tasks generate

bench environment

bench environment create

bench environment list

bench environment show

bench environment inspect

bench environment cleanup

YAML Config Format

Batch config with skills and skill nudge

Multi-scene (BYOS skill generation)

FilesExpand file tree

cli.md

Latest commit

History

cli.md

File metadata and controls

CLI reference

bench agent

bench agent list

bench agent show

bench eval

bench eval create

bench eval list

bench skills

bench skills eval

bench tasks

bench tasks init

bench tasks check

bench tasks generate

bench environment

bench environment create

bench environment list

bench environment show

bench environment inspect

bench environment cleanup

YAML Config Format

Batch config with skills and skill nudge

Multi-scene (BYOS skill generation)