feat(eval): cross-harness eval/benchmark framework (cli-anything-eval)#366
feat(eval): cross-harness eval/benchmark framework (cli-anything-eval)#366omerarslan0 wants to merge 25 commits into
Conversation
* feat: CLI-Matrix with multi-approach stages, skill discovery, and matrix search Introduce CLI-Matrix — curated multi-CLI workflow matrices that agents can install in one command. The video-creation matrix bundles 11 CLIs across 8 production stages (AI video gen, capture, audio, voice/TTS, music, NLE editing, captions, thumbnails). Each stage now exposes a goal, alternative approaches (Python libs, cloud APIs, native commands), and skill_search_hints that encourage agents to dynamically discover relevant skills via `npx skills search` rather than relying on hard-coded tool lists. Key changes: - matrix_registry.json: extended stage schema with goal, alternatives, skill_search_hints fields - cli-hub matrix list/search/info/install commands - matrix_skill.py: renders dynamic SKILL.md with stage tooling overview, install status, and aggregated discovery commands - Fixed brittle parents[2] repo root detection with git-based lookup - 85 tests passing (10 new) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * update cli-matrix * feat(cli-matrix): eco-first capability-based matrix, v2 schema + S2-S5 SKILLs - Add docs/cli-matrix/matrix_registry.schema.md describing v2 capability-based registry shape (capabilities[], providers with kind/requires/cost/quality/ offline, recipes[], known_gaps[], decision rubric, suggest-to-user template). - Rewrite cli-hub-matrix/video-creation/SKILL.md and matrix_registry.json (S1) around capabilities + providers + recipes instead of linear stages. - Rename Vn -> Sn across cli-matrix-plan.md and test fixtures. - Reorder scenarios by current completeness; rewrite S2 knowledge-research, S3 3d-cad, S4 game-development, S5 image-design in v2 capability form with full SKILL.md files. - Add docs/cli-matrix/test-plans/video-creation.md with 13 long realistic end-to-end tasks as checkable todo lists, each exercising 5-9 capabilities. - Move cli-matrix-plan.md and matrix_registry.schema.md under docs/cli-matrix/. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Document preview protocol and Audacity autosave Add the preview bundle protocol plan, record video matrix review evidence, and make one-shot Audacity project mutations persist to disk with E2E coverage. * Update CLI matrix skill registry and video workflow * chore(git): always ignore docs/* — working documents stay local Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * feat(cli-matrix): video-creation skill WIP — sound design, source triage, render doctor, NLE refs, video_doctor script - SKILL.md: adds sound.design capability, bundled video_doctor.py provider, recipe updates, and links to five new reference modules - new references: art-direction-review, nle-shotcut-kdenlive, render-doctor, sound-design, source-triage; captions and story-structure-audio updated - scripts/video_doctor.py: bundled probe/diagnose helper Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix(cli-hub): preflight detects packages by import name or PyPI dist name (P1-3) _package_available() now tries find_spec as-is, dash->underscore normalized, then importlib.metadata dist lookup (PEP 503), so registry entries like edge-tts are detected when installed. All lookup failures degrade to unavailable instead of crashing preflight. Adds 6 tests. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * feat(cli-hub): distribute matrix skill content to installed skills, wheels, and Pages (P1-4) - matrix install renders to ~/.cli-hub/matrix/<name>/SKILL.md and copies references/ and scripts/ beside it (pycache excluded, idempotent reinstall purges stale files); legacy flat <name>.SKILL.md still read - content lookup chain: repo checkout -> bundled cli_hub/_matrix_data (vendored into sdist/wheel by setup.py build hooks + MANIFEST.in) -> published Pages URL -> stub - new 'matrix install --skill-only' renders skill + assets without installing CLIs - deploy-pages.yml: copy cli-hub-matrix/ into the site after the Jekyll build (served verbatim at /matrix/<name>/); triggers remain main-only - 12 new tests in tests/test_matrix_skill_dist.py (142 total pass) Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * feat(cli-matrix): register S2-S5 matrices and resync video-creation registry (P1-1, P1-2) - add knowledge-research (S2, 12 caps), 3d-cad (S3, 12), game-development (S4, 10), image-design (S5, 9) derived from their SKILL.md drafts; full v2 shape with capabilities, providers, recipes, known_gaps; clis lists cross-checked against registry.json (unresolvable tools represented as public-cli/native/python/api/agent-skill providers instead) - video-creation: add sound.design capability (5 providers, wired into 5 recipes), register scripts/video_doctor.py as bundled-script provider under quality.review, cite the 5 new reference modules in provider notes, refresh description - python provider package strings use import names (cv2, edge_tts, ffmpeg, skimage, ...) so preflight detection is robust to dist-name variants like opencv-python-headless - fix homepage URLs to docs/cli-matrix/cli-matrix-plan.md; bump meta.updated to 2026-06-11 Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * feat(cli-hub): ship the unified Gallery design as the production homepage Replace docs/hub/index.html with the finalized "Gallery / R2 Flip" main page (Steel Sky palette default, Newsreader serif hero title, liquid-glass flip cards, JS masonry catalog, and the unified Matrices layer with bidirectional stitching). Production-indexable robots meta retained. Stop tracking docs/cli-matrix/* — the CLI-Matrix working docs stay local and confidential; add an explicit /docs/cli-matrix/ ignore rule. * feat: CLI-Matrix command family + Hub docs/demos pages and responsive nav --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 792aa264f6
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| } | ||
|
|
||
| comparison = None | ||
| if baseline_path and Path(baseline_path).exists(): |
There was a problem hiding this comment.
Do not silently ignore missing baselines
When a caller passes a baseline path that does not exist (for example a typo with --baseline ... --fail-on-regression), this guard leaves comparison as None, so the run succeeds and the regression gate is bypassed instead of surfacing the bad baseline path. Since load_baseline() already raises FileNotFoundError, the existence check should not suppress that error unless this is specifically the update-baseline creation path.
Useful? React with 👍 / 👎.
|
|
||
| def run(ctx): | ||
| scene = create_scene(name="eval", resolution_x=320, resolution_y=240, samples=8) | ||
| add_object(scene, mesh_type="cube", location=[0, 0, 0]) |
There was a problem hiding this comment.
Add an active camera before rendering
When Blender is installed, this task builds a scene with only a cube; create_scene() starts with no cameras and the generated bpy script clears Blender's default scene before adding project cameras, so the render has no active camera and produces no PNG (or reports a Blender render error). The eval task should add an active camera (and ideally lighting) before calling render_scene() so environments with Blender don't fail this benchmark.
Useful? React with 👍 / 👎.
|
@yuh-yang This PR is on the larger side (48 files) and touches shared/core infrastructure — it adds a new shared |
Description
Generalizes the previously Audacity-only eval prototype into a shared, installable
cli-anything-evalpackage — a cross-harness eval/benchmark framework. This implements the roadmap item "Benchmark suite for agent task completion rates."The framework provides a convention-based task contract (
TASK+run+ optionalverify/precheck/requires/prompt), a runner withpass/fail/error/skippedstatuses, baseline regression detection, JSON + Markdown reports, a cross-harness leaderboard, and a CLI (cli-anything-eval run|suite). Averify(ctx)grader seam keeps tasks forward-compatible with future agent-driven grading while running deterministically today.Adoption in this PR: Audacity is migrated onto the shared package via a backward-compatible shim; seed eval tasks are added for GIMP, Blender, and LibreOffice; a
/cli-anything:evalplugin command and a HARNESS.md methodology section are added.Type of Change
(Also migrates the Audacity harness and seeds three others onto the new framework.)
Highlights
cli-anything-eval/exposing thecli_anything.evalnamespace (PEP 420; depends only onclick; imports nothing from any harness).success_rate = passed / (pass + fail + error);skipped(unmet backendrequires/precheck) is reported separately and excluded from the denominator — a deliberate, documented divergence from the test-suite "fail when backend missing" rule, so a benchmark's completion rate is not polluted by missing backends.cli_anything.audacity.eval.runnerkeeps its public API; the existingcli-anything-audacity evalcommand andtest_eval.pyare unchanged in behavior.except ImportErrorguard, so a harness still works if the eval package is absent.For Existing CLI Modifications
test_eval.py3/3); the 4 Audacity eval tasks run via the shared runner.core/render.py.registry.json— not applicable (no version/description/requirement change to registry entries).General Checklist
--jsonflag is supported on the new commandsfeat:,fix:,docs:,test:)Test Results
(
skippedrows are backend-dependent tasks —blender render_execute,libreoffice export_pdf— correctly skipped when the binary is absent.)Follow-ups (out of scope for this PR)
extras_require["eval"]instead of a hard dependency (the registration is alreadyImportError-guarded).prompt/verifyseam.