Skip to content

feat(eval): cross-harness eval/benchmark framework (cli-anything-eval)#366

Open
omerarslan0 wants to merge 25 commits into
HKUDS:mainfrom
omerarslan0:feat/eval-benchmark-framework
Open

feat(eval): cross-harness eval/benchmark framework (cli-anything-eval)#366
omerarslan0 wants to merge 25 commits into
HKUDS:mainfrom
omerarslan0:feat/eval-benchmark-framework

Conversation

@omerarslan0

Copy link
Copy Markdown
Collaborator

Description

Generalizes the previously Audacity-only eval prototype into a shared, installable cli-anything-eval package — a cross-harness eval/benchmark framework. This implements the roadmap item "Benchmark suite for agent task completion rates."

The framework provides a convention-based task contract (TASK + run + optional verify/precheck/requires/prompt), a runner with pass/fail/error/skipped statuses, baseline regression detection, JSON + Markdown reports, a cross-harness leaderboard, and a CLI (cli-anything-eval run|suite). A verify(ctx) grader seam keeps tasks forward-compatible with future agent-driven grading while running deterministically today.

Adoption in this PR: Audacity is migrated onto the shared package via a backward-compatible shim; seed eval tasks are added for GIMP, Blender, and LibreOffice; a /cli-anything:eval plugin command and a HARNESS.md methodology section are added.

Type of Change

  • New Feature — adds new functionality to an existing harness or the plugin

(Also migrates the Audacity harness and seeds three others onto the new framework.)

Highlights

  • New top-level package cli-anything-eval/ exposing the cli_anything.eval namespace (PEP 420; depends only on click; imports nothing from any harness).
  • Status model: success_rate = passed / (pass + fail + error); skipped (unmet backend requires/precheck) is reported separately and excluded from the denominator — a deliberate, documented divergence from the test-suite "fail when backend missing" rule, so a benchmark's completion rate is not polluted by missing backends.
  • Backward compatible: cli_anything.audacity.eval.runner keeps its public API; the existing cli-anything-audacity eval command and test_eval.py are unchanged in behavior.
  • Harness eval subcommands register behind an except ImportError guard, so a harness still works if the eval package is absent.
  • Discovery works under both normal site-packages and PEP 660 editable installs.

For Existing CLI Modifications

  • Audacity eval tests pass (test_eval.py 3/3); the 4 Audacity eval tasks run via the shared runner.
  • No test regressions — Blender unit suite still 165/165 after a Windows-encoding fix in core/render.py.
  • registry.json — not applicable (no version/description/requirement change to registry entries).

General Checklist

  • Code follows existing patterns and conventions
  • --json flag is supported on the new commands
  • Commit messages follow the conventional format (feat:, fix:, docs:, test:)
  • I have tested my changes locally

Test Results

cli-anything-eval package:   29 passed
audacity test_eval.py:        3 passed   (backward-compatibility)
blender unit (test_core.py): 165 passed   (no regression from render.py encoding fix)

cross-harness leaderboard (cli-anything-eval suite):
| Rank | Harness     | Passed/Attempted | Skipped | Success Rate |
| ---- | ----------- | ---------------- | ------- | ------------ |
| 1    | audacity    | 4/4              | 0       | 100%         |
| 2    | blender     | 1/1              | 1       | 100%         |
| 3    | gimp        | 1/1              | 0       | 100%         |
| 4    | libreoffice | 1/1              | 1       | 100%         |

(skipped rows are backend-dependent tasks — blender render_execute, libreoffice export_pdf — correctly skipped when the binary is absent.)

Follow-ups (out of scope for this PR)

  • Optional extras_require["eval"] instead of a hard dependency (the registration is already ImportError-guarded).
  • A CI workflow to run the suite and publish the leaderboard.
  • A real (key-gated) agent driver behind the prompt/verify seam.

yuh-yang and others added 25 commits June 14, 2026 17:27
* feat: CLI-Matrix with multi-approach stages, skill discovery, and matrix search

Introduce CLI-Matrix — curated multi-CLI workflow matrices that agents can
install in one command. The video-creation matrix bundles 11 CLIs across 8
production stages (AI video gen, capture, audio, voice/TTS, music, NLE
editing, captions, thumbnails).

Each stage now exposes a goal, alternative approaches (Python libs, cloud
APIs, native commands), and skill_search_hints that encourage agents to
dynamically discover relevant skills via `npx skills search` rather than
relying on hard-coded tool lists.

Key changes:
- matrix_registry.json: extended stage schema with goal, alternatives,
  skill_search_hints fields
- cli-hub matrix list/search/info/install commands
- matrix_skill.py: renders dynamic SKILL.md with stage tooling overview,
  install status, and aggregated discovery commands
- Fixed brittle parents[2] repo root detection with git-based lookup
- 85 tests passing (10 new)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* update cli-matrix

* feat(cli-matrix): eco-first capability-based matrix, v2 schema + S2-S5 SKILLs

- Add docs/cli-matrix/matrix_registry.schema.md describing v2 capability-based
  registry shape (capabilities[], providers with kind/requires/cost/quality/
  offline, recipes[], known_gaps[], decision rubric, suggest-to-user template).
- Rewrite cli-hub-matrix/video-creation/SKILL.md and matrix_registry.json (S1)
  around capabilities + providers + recipes instead of linear stages.
- Rename Vn -> Sn across cli-matrix-plan.md and test fixtures.
- Reorder scenarios by current completeness; rewrite S2 knowledge-research,
  S3 3d-cad, S4 game-development, S5 image-design in v2 capability form with
  full SKILL.md files.
- Add docs/cli-matrix/test-plans/video-creation.md with 13 long realistic
  end-to-end tasks as checkable todo lists, each exercising 5-9 capabilities.
- Move cli-matrix-plan.md and matrix_registry.schema.md under docs/cli-matrix/.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* Document preview protocol and Audacity autosave

Add the preview bundle protocol plan, record video matrix review evidence, and make one-shot Audacity project mutations persist to disk with E2E coverage.

* Update CLI matrix skill registry and video workflow

* chore(git): always ignore docs/* — working documents stay local

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* feat(cli-matrix): video-creation skill WIP — sound design, source triage, render doctor, NLE refs, video_doctor script

- SKILL.md: adds sound.design capability, bundled video_doctor.py provider,
  recipe updates, and links to five new reference modules
- new references: art-direction-review, nle-shotcut-kdenlive, render-doctor,
  sound-design, source-triage; captions and story-structure-audio updated
- scripts/video_doctor.py: bundled probe/diagnose helper

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* fix(cli-hub): preflight detects packages by import name or PyPI dist name (P1-3)

_package_available() now tries find_spec as-is, dash->underscore
normalized, then importlib.metadata dist lookup (PEP 503), so registry
entries like edge-tts are detected when installed. All lookup failures
degrade to unavailable instead of crashing preflight. Adds 6 tests.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* feat(cli-hub): distribute matrix skill content to installed skills, wheels, and Pages (P1-4)

- matrix install renders to ~/.cli-hub/matrix/<name>/SKILL.md and copies
  references/ and scripts/ beside it (pycache excluded, idempotent
  reinstall purges stale files); legacy flat <name>.SKILL.md still read
- content lookup chain: repo checkout -> bundled cli_hub/_matrix_data
  (vendored into sdist/wheel by setup.py build hooks + MANIFEST.in) ->
  published Pages URL -> stub
- new 'matrix install --skill-only' renders skill + assets without
  installing CLIs
- deploy-pages.yml: copy cli-hub-matrix/ into the site after the Jekyll
  build (served verbatim at /matrix/<name>/); triggers remain main-only
- 12 new tests in tests/test_matrix_skill_dist.py (142 total pass)

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* feat(cli-matrix): register S2-S5 matrices and resync video-creation registry (P1-1, P1-2)

- add knowledge-research (S2, 12 caps), 3d-cad (S3, 12), game-development
  (S4, 10), image-design (S5, 9) derived from their SKILL.md drafts; full
  v2 shape with capabilities, providers, recipes, known_gaps; clis lists
  cross-checked against registry.json (unresolvable tools represented as
  public-cli/native/python/api/agent-skill providers instead)
- video-creation: add sound.design capability (5 providers, wired into 5
  recipes), register scripts/video_doctor.py as bundled-script provider
  under quality.review, cite the 5 new reference modules in provider
  notes, refresh description
- python provider package strings use import names (cv2, edge_tts,
  ffmpeg, skimage, ...) so preflight detection is robust to dist-name
  variants like opencv-python-headless
- fix homepage URLs to docs/cli-matrix/cli-matrix-plan.md; bump
  meta.updated to 2026-06-11

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* feat(cli-hub): ship the unified Gallery design as the production homepage

Replace docs/hub/index.html with the finalized "Gallery / R2 Flip" main page
(Steel Sky palette default, Newsreader serif hero title, liquid-glass flip
cards, JS masonry catalog, and the unified Matrices layer with bidirectional
stitching). Production-indexable robots meta retained.

Stop tracking docs/cli-matrix/* — the CLI-Matrix working docs stay local and
confidential; add an explicit /docs/cli-matrix/ ignore rule.

* feat: CLI-Matrix command family + Hub docs/demos pages and responsive nav

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions github-actions Bot added existing-cli-fix Fixes or improves an existing CLI harness cli-anything-skill Changes CLI-Anything plugin or skill files labels Jun 22, 2026

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 792aa264f6

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

}

comparison = None
if baseline_path and Path(baseline_path).exists():

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Do not silently ignore missing baselines

When a caller passes a baseline path that does not exist (for example a typo with --baseline ... --fail-on-regression), this guard leaves comparison as None, so the run succeeds and the regression gate is bypassed instead of surfacing the bad baseline path. Since load_baseline() already raises FileNotFoundError, the existence check should not suppress that error unless this is specifically the update-baseline creation path.

Useful? React with 👍 / 👎.


def run(ctx):
scene = create_scene(name="eval", resolution_x=320, resolution_y=240, samples=8)
add_object(scene, mesh_type="cube", location=[0, 0, 0])

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Add an active camera before rendering

When Blender is installed, this task builds a scene with only a cube; create_scene() starts with no cameras and the generated bpy script clears Blender's default scene before adding project cameras, so the render has no active camera and produces no PNG (or reports a Blender render error). The eval task should add an active camera (and ideally lighting) before calling render_scene() so environments with Blender don't fail this benchmark.

Useful? React with 👍 / 👎.

@omerarslan0

Copy link
Copy Markdown
Collaborator Author

@yuh-yang This PR is on the larger side (48 files) and touches shared/core infrastructure — it adds a new shared cli-anything-eval package and migrates the Audacity harness onto it (plus seeds GIMP/Blender/LibreOffice). Because of that surface area, could you please give it a detailed review rather than a quick pass when you have time? Happy to clarify any design decision or split it into smaller PRs if that helps. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cli-anything-skill Changes CLI-Anything plugin or skill files existing-cli-fix Fixes or improves an existing CLI harness

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants