This directory contains the runnable examples referenced by the BenchFlow docs.
They live under docs/examples/ so examples and docs move together.
Use bench eval create for one task:
bench eval create \
--source-repo benchflow-ai/skillsbench \
--source-path tasks/edit-pdf \
--agent gemini \
--model gemini-3.1-flash-lite-preview \
--sandbox daytonaSandboxes are docker, daytona, and modal.
When an example or task has a skills directory, mount it and pass the skill
nudge. The docs use BENCHFLOW_SKILL_NUDGE=name as the default recommendation:
bench eval create \
--tasks-dir tasks/my-task \
--agent gemini \
--model gemini-3.1-flash-lite-preview \
--sandbox daytona \
--skills-dir tasks/my-task/environment/skills \
--agent-env BENCHFLOW_SKILL_NUDGE=nameOther nudge modes are description and full. Omit
--agent-env BENCHFLOW_SKILL_NUDGE=... to leave BenchFlow's runtime default off.
coder-reviewer-demo.pyruns a single-agent baseline and a coder-reviewer scene against a task directory.scene-patterns.mdexplains single-agent, self-review, specialist-review, and client-advisor scene patterns.nanofirm-task/is a tiny task directory you can use when checking task layout, verifier behavior, and oracle execution.user_dogfood.pydemonstrates a rule-basedFunctionUserprogressive disclosure loop.swebench_pro_user_dogfood.pyruns the progressive-disclosure pattern on SWE-bench Pro-style tasks.