feat(eval): cross-harness eval/benchmark framework (cli-anything-eval) by omerarslan0 · Pull Request #366 · HKUDS/CLI-Anything

omerarslan0 · 2026-06-22T15:10:35Z

Description

Generalizes the previously Audacity-only eval prototype into a shared, installable cli-anything-eval package — a cross-harness eval/benchmark framework. This implements the roadmap item "Benchmark suite for agent task completion rates."

The framework provides a convention-based task contract (TASK + run + optional verify/precheck/requires/prompt), a runner with pass/fail/error/skipped statuses, baseline regression detection, JSON + Markdown reports, a cross-harness leaderboard, and a CLI (cli-anything-eval run|suite). A verify(ctx) grader seam keeps tasks forward-compatible with future agent-driven grading while running deterministically today.

Adoption in this PR: Audacity is migrated onto the shared package via a backward-compatible shim; seed eval tasks are added for GIMP, Blender, and LibreOffice; a /cli-anything:eval plugin command and a HARNESS.md methodology section are added.

Type of Change

New Feature — adds new functionality to an existing harness or the plugin

(Also migrates the Audacity harness and seeds three others onto the new framework.)

Highlights

New top-level package cli-anything-eval/ exposing the cli_anything.eval namespace (PEP 420; depends only on click; imports nothing from any harness).
Status model: success_rate = passed / (pass + fail + error); skipped (unmet backend requires/precheck) is reported separately and excluded from the denominator — a deliberate, documented divergence from the test-suite "fail when backend missing" rule, so a benchmark's completion rate is not polluted by missing backends.
Backward compatible: cli_anything.audacity.eval.runner keeps its public API; the existing cli-anything-audacity eval command and test_eval.py are unchanged in behavior.
Harness eval subcommands register behind an except ImportError guard, so a harness still works if the eval package is absent.
Discovery works under both normal site-packages and PEP 660 editable installs.

For Existing CLI Modifications

Audacity eval tests pass (test_eval.py 3/3); the 4 Audacity eval tasks run via the shared runner.
No test regressions — Blender unit suite still 165/165 after a Windows-encoding fix in core/render.py.
registry.json — not applicable (no version/description/requirement change to registry entries).

General Checklist

Code follows existing patterns and conventions
--json flag is supported on the new commands
Commit messages follow the conventional format (feat:, fix:, docs:, test:)
I have tested my changes locally

Test Results

cli-anything-eval package:   29 passed
audacity test_eval.py:        3 passed   (backward-compatibility)
blender unit (test_core.py): 165 passed   (no regression from render.py encoding fix)

cross-harness leaderboard (cli-anything-eval suite):
| Rank | Harness     | Passed/Attempted | Skipped | Success Rate |
| ---- | ----------- | ---------------- | ------- | ------------ |
| 1    | audacity    | 4/4              | 0       | 100%         |
| 2    | blender     | 1/1              | 1       | 100%         |
| 3    | gimp        | 1/1              | 0       | 100%         |
| 4    | libreoffice | 1/1              | 1       | 100%         |

(skipped rows are backend-dependent tasks — blender render_execute, libreoffice export_pdf — correctly skipped when the binary is absent.)

Follow-ups (out of scope for this PR)

Optional extras_require["eval"] instead of a hard dependency (the registration is already ImportError-guarded).
A CI workflow to run the suite and publish the leaderboard.
A real (key-gated) agent driver behind the prompt/verify seam.

* feat: CLI-Matrix with multi-approach stages, skill discovery, and matrix search Introduce CLI-Matrix — curated multi-CLI workflow matrices that agents can install in one command. The video-creation matrix bundles 11 CLIs across 8 production stages (AI video gen, capture, audio, voice/TTS, music, NLE editing, captions, thumbnails). Each stage now exposes a goal, alternative approaches (Python libs, cloud APIs, native commands), and skill_search_hints that encourage agents to dynamically discover relevant skills via `npx skills search` rather than relying on hard-coded tool lists. Key changes: - matrix_registry.json: extended stage schema with goal, alternatives, skill_search_hints fields - cli-hub matrix list/search/info/install commands - matrix_skill.py: renders dynamic SKILL.md with stage tooling overview, install status, and aggregated discovery commands - Fixed brittle parents[2] repo root detection with git-based lookup - 85 tests passing (10 new) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * update cli-matrix * feat(cli-matrix): eco-first capability-based matrix, v2 schema + S2-S5 SKILLs - Add docs/cli-matrix/matrix_registry.schema.md describing v2 capability-based registry shape (capabilities[], providers with kind/requires/cost/quality/ offline, recipes[], known_gaps[], decision rubric, suggest-to-user template). - Rewrite cli-hub-matrix/video-creation/SKILL.md and matrix_registry.json (S1) around capabilities + providers + recipes instead of linear stages. - Rename Vn -> Sn across cli-matrix-plan.md and test fixtures. - Reorder scenarios by current completeness; rewrite S2 knowledge-research, S3 3d-cad, S4 game-development, S5 image-design in v2 capability form with full SKILL.md files. - Add docs/cli-matrix/test-plans/video-creation.md with 13 long realistic end-to-end tasks as checkable todo lists, each exercising 5-9 capabilities. - Move cli-matrix-plan.md and matrix_registry.schema.md under docs/cli-matrix/. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Document preview protocol and Audacity autosave Add the preview bundle protocol plan, record video matrix review evidence, and make one-shot Audacity project mutations persist to disk with E2E coverage. * Update CLI matrix skill registry and video workflow * chore(git): always ignore docs/* — working documents stay local Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * feat(cli-matrix): video-creation skill WIP — sound design, source triage, render doctor, NLE refs, video_doctor script - SKILL.md: adds sound.design capability, bundled video_doctor.py provider, recipe updates, and links to five new reference modules - new references: art-direction-review, nle-shotcut-kdenlive, render-doctor, sound-design, source-triage; captions and story-structure-audio updated - scripts/video_doctor.py: bundled probe/diagnose helper Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix(cli-hub): preflight detects packages by import name or PyPI dist name (P1-3) _package_available() now tries find_spec as-is, dash->underscore normalized, then importlib.metadata dist lookup (PEP 503), so registry entries like edge-tts are detected when installed. All lookup failures degrade to unavailable instead of crashing preflight. Adds 6 tests. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * feat(cli-hub): distribute matrix skill content to installed skills, wheels, and Pages (P1-4) - matrix install renders to ~/.cli-hub/matrix/<name>/SKILL.md and copies references/ and scripts/ beside it (pycache excluded, idempotent reinstall purges stale files); legacy flat <name>.SKILL.md still read - content lookup chain: repo checkout -> bundled cli_hub/_matrix_data (vendored into sdist/wheel by setup.py build hooks + MANIFEST.in) -> published Pages URL -> stub - new 'matrix install --skill-only' renders skill + assets without installing CLIs - deploy-pages.yml: copy cli-hub-matrix/ into the site after the Jekyll build (served verbatim at /matrix/<name>/); triggers remain main-only - 12 new tests in tests/test_matrix_skill_dist.py (142 total pass) Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * feat(cli-matrix): register S2-S5 matrices and resync video-creation registry (P1-1, P1-2) - add knowledge-research (S2, 12 caps), 3d-cad (S3, 12), game-development (S4, 10), image-design (S5, 9) derived from their SKILL.md drafts; full v2 shape with capabilities, providers, recipes, known_gaps; clis lists cross-checked against registry.json (unresolvable tools represented as public-cli/native/python/api/agent-skill providers instead) - video-creation: add sound.design capability (5 providers, wired into 5 recipes), register scripts/video_doctor.py as bundled-script provider under quality.review, cite the 5 new reference modules in provider notes, refresh description - python provider package strings use import names (cv2, edge_tts, ffmpeg, skimage, ...) so preflight detection is robust to dist-name variants like opencv-python-headless - fix homepage URLs to docs/cli-matrix/cli-matrix-plan.md; bump meta.updated to 2026-06-11 Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * feat(cli-hub): ship the unified Gallery design as the production homepage Replace docs/hub/index.html with the finalized "Gallery / R2 Flip" main page (Steel Sky palette default, Newsreader serif hero title, liquid-glass flip cards, JS masonry catalog, and the unified Matrices layer with bidirectional stitching). Production-indexable robots meta retained. Stop tracking docs/cli-matrix/* — the CLI-Matrix working docs stay local and confidential; add an explicit /docs/cli-matrix/ ignore rule. * feat: CLI-Matrix command family + Hub docs/demos pages and responsive nav --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

…pdate

…rgv directly

…overy test

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 792aa264f6

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-06-22T15:15:07Z

+    }
+
+    comparison = None
+    if baseline_path and Path(baseline_path).exists():


Do not silently ignore missing baselines

When a caller passes a baseline path that does not exist (for example a typo with --baseline ... --fail-on-regression), this guard leaves comparison as None, so the run succeeds and the regression gate is bypassed instead of surfacing the bad baseline path. Since load_baseline() already raises FileNotFoundError, the existence check should not suppress that error unless this is specifically the update-baseline creation path.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-22T15:15:08Z

+
+def run(ctx):
+    scene = create_scene(name="eval", resolution_x=320, resolution_y=240, samples=8)
+    add_object(scene, mesh_type="cube", location=[0, 0, 0])


Add an active camera before rendering

When Blender is installed, this task builds a scene with only a cube; create_scene() starts with no cameras and the generated bpy script clears Blender's default scene before adding project cameras, so the render has no active camera and produces no PNG (or reports a Blender render error). The eval task should add an active camera (and ideally lighting) before calling render_scene() so environments with Blender don't fail this benchmark.

Useful? React with 👍 / 👎.

omerarslan0 · 2026-06-22T15:46:43Z

@yuh-yang This PR is on the larger side (48 files) and touches shared/core infrastructure — it adds a new shared cli-anything-eval package and migrates the Audacity harness onto it (plus seeds GIMP/Blender/LibreOffice). Because of that surface area, could you please give it a detailed review rather than a quick pass when you have time? Happy to clarify any design decision or split it into smaller PRs if that helps. Thanks!

yuh-yang and others added 25 commits June 14, 2026 17:27

feat(eval): scaffold cli-anything-eval package with core contracts

11eaac6

feat(eval): add safe_write_json io helper

5f5875f

feat(eval): add baseline load + regression comparison

4fe44bf

test(eval): strengthen baseline regression assertions

b54ef36

feat(eval): add report markdown + baseline projection + leaderboard

c9ad565

fix(eval): rank leaderboard rows by success rate in the renderer

17506d0

feat(eval): add task discovery, execution and summary

6187507

fix(eval): skip tasks before creating dirs; guard test sys.path insert

9adf2fb

feat(eval): add run_eval orchestration with reports and baseline

349f2c9

style(eval): consolidate imports in runner and runner tests

dae37eb

feat(eval): add cross-harness suite discovery and leaderboard

40f224a

fix(eval): treat empty harness list as run-none in run_suite

7b42a81

feat(eval): add console script and per-harness command factory

038f796

docs(eval): add cli-anything-eval package README

082dd46

refactor(audacity): migrate eval runner onto shared cli-anything-eval

bf038b7

feat(gimp): add eval tasks using shared cli-anything-eval

ef14627

fix(gimp): narrow eval precheck to ImportError

c290506

feat(blender): add eval tasks using shared cli-anything-eval

9c0714c

refactor(blender): place eval registration before __main__ guard

9189414

feat(libreoffice): add eval tasks using shared cli-anything-eval

294f087

docs(eval): add plugin command, HARNESS.md methodology, and roadmap u…

ebb9583

…pdate

fix(eval): align HARNESS.md example, add io encoding, build blender a…

58fa575

…rgv directly

fix(eval): discover harnesses via distributions for editable installs

c77c29e

test(eval): fix sys.path leak and monkeypatch target in metadata disc…

792aa26

…overy test

omerarslan0 requested review from yuh-yang and zhangxilong-43 as code owners June 22, 2026 15:10

github-actions Bot added existing-cli-fix Fixes or improves an existing CLI harness cli-anything-skill Changes CLI-Anything plugin or skill files labels Jun 22, 2026

chatgpt-codex-connector Bot reviewed Jun 22, 2026

View reviewed changes

yuh-yang force-pushed the main branch from 58608d3 to dc73924 Compare June 25, 2026 14:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(eval): cross-harness eval/benchmark framework (cli-anything-eval)#366

feat(eval): cross-harness eval/benchmark framework (cli-anything-eval)#366
omerarslan0 wants to merge 25 commits into
HKUDS:mainfrom
omerarslan0:feat/eval-benchmark-framework

omerarslan0 commented Jun 22, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Jun 22, 2026

Uh oh!

chatgpt-codex-connector Bot Jun 22, 2026

Uh oh!

omerarslan0 commented Jun 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

omerarslan0 commented Jun 22, 2026

Description

Type of Change

Highlights

For Existing CLI Modifications

General Checklist

Test Results

Follow-ups (out of scope for this PR)

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

omerarslan0 commented Jun 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants