feat(agents): add Vally evaluation agents and prompts#1834
feat(agents): add Vally evaluation agents and prompts#1834WilliamBerryiii wants to merge 14 commits into
Conversation
- add vally-test-author subagent and content-policy-citation agent - add evals-import and vally-test-write prompts ✨ - Generated by Copilot
d4901bf to
4afaf12
Compare
The base branch was changed.
- align vally-test-author subagent with canonical template sections and JSON report path - whitelist Vally Test Author in prompt-builder and allow nested subagent calls - decouple skill paths and unify JSON output path in evals-import and vally-test-write prompts - drop attribution suffix and set disable-model-invocation on content-policy-citation 🔧 - Generated by Copilot
Eval Execution❌ Status: Failed — 3 spec(s) block merge
|
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1834 +/- ##
==========================================
+ Coverage 81.24% 81.26% +0.01%
==========================================
Files 127 127
Lines 18831 18850 +19
Branches 12 12
==========================================
+ Hits 15300 15319 +19
Misses 3528 3528
Partials 3 3
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
🔧 - Generated by Copilot
- add Vally Test Author subagent with from-artifact and corpus-import modes - add vally-test-write and evals-import commands - enforce tos-violation safety lint coverage in vally-tests - update collections, plugins, and evals/collections scripts 🧪 - Generated by Copilot
- capture error message in a variable for clarity - conditionally call Write-CIAnnotation if the command exists 🔧 - Generated by Copilot
…tions - move content-policy-citation from agent to shared instructions - wire references across github backlog agents, prompts, and skills - regenerate collections and plugins; add vally-test-author stimulus ✨ - Generated by Copilot
- tally totals per spec to avoid double-counting and add Specs column - index only specs declaring a top-level stimuli key - add regression tests for summary totals and stimulus indexing 🧪 - Generated by Copilot
# Conflicts: # .github/instructions/hve-core/prompt-builder.instructions.md # evals/behavior-conformance/skill-behavior.eval.yaml
- re-import CIHelpers in Prepare-Extension tests so Write-CIAnnotation resolves - add deeplink/refus/stimul stems to cspell words - add community-interaction instruction eval stimulus for coverage 🔧 - Generated by Copilot
| - Prompt Evaluator | ||
| - Prompt Updater | ||
| - Researcher Subagent | ||
| - Vally Test Author |
There was a problem hiding this comment.
Should Prompt Builder have any additional instructions to guide prompt building to use the new Vally Test Author subagent? Also, do we want to add anything for this to the new .github/skills/rpi/prompt-builder skill(s)?
There was a problem hiding this comment.
Thanks for the nudge. Prompt Builder now has an optional Vally Conformance Authoring phase that dispatches the Vally Test Author subagent (mode=from-artifact, kind=auto) once Phase 3 converges, then surfaces the routed eval file and appended-stimuli count. On the skill side we are still deciding placement; leaning toward a short pointer from the prompt-builder skill rather than duplicating dispatch logic, and tracking that as a follow-up in this PR.
There was a problem hiding this comment.
Update: the skill side is wired up too now. The prompt-builder skill documents an optional Vally conformance authoring step and adds a Vally Test Author row to its orchestration dispatch matrix, and the agent Phase 4 now defers to that skill as the canonical orchestration source (by name, not by path). While here we also pushed the remaining skill-internal mechanics out of the Vally Test Author subagent: routing, corpus template, safety lint, refusal taxonomy, dedupe, and the JSON report shape are all owned by the vally-tests skill now, with the agent referencing the skill by name rather than reaching into its files. Markdown lint passes.
| After the safety self-check passes, deduplicate against the target eval file before append: | ||
|
|
||
| 1. Normalize the prompt text: trim leading and trailing whitespace, lowercase, then collapse all internal whitespace runs to a single space. | ||
| 2. Compute the SHA-256 hash of the normalized text. | ||
| 3. Compare the hash against the existing stimulus prompts in the target eval file (after applying the same normalization to each existing prompt). | ||
| 4. Skip any stimulus whose hash matches an existing entry. Record the skipped hash and source row in the JSON report's `dedupe_results`. | ||
|
|
There was a problem hiding this comment.
Do we need these instructions when the dedup protocol has been encoded into the skills that you've mentioned below?
There was a problem hiding this comment.
Good catch. We removed them. The dedupe algorithm now lives in the vally-tests skill, so the subagent keeps only a one-line contract and defers to the skill instead of restating the steps.
| 3. Compare the hash against the existing stimulus prompts in the target eval file (after applying the same normalization to each existing prompt). | ||
| 4. Skip any stimulus whose hash matches an existing entry. Record the skipped hash and source row in the JSON report's `dedupe_results`. | ||
|
|
||
| Helper scripts implement the normalization and hashing — delegate, do not re-implement: |
There was a problem hiding this comment.
Would it make sense to have instructions like:
Use the `valley-tests` skill to dedup the target eval file before appending.I don't think we want agents to reach into skills to use their files, instead that should be up to the skill and its instructions.
There was a problem hiding this comment.
Agreed, and done. The dedupe section now delegates to the vally-tests skill (its Helper Script Index) with a one-line behavior contract, rather than hard-coding script paths or re-implementing the SHA-256 normalize-and-hash logic.
| Before any write to disk, run the skill-local safety lint against the drafted stimulus YAML: | ||
|
|
||
| * PowerShell: `.github/skills/hve-core/vally-tests/scripts/Lint-VallyTestSafety.ps1 -Path <draft.yml>` | ||
| * Bash equivalent: `.github/skills/hve-core/vally-tests/scripts/lint-vally-test-safety.sh <draft.yml>` | ||
|
|
||
| Honor exit codes verbatim: | ||
|
|
||
| * Exit code 0 — clean. Proceed to dedupe and append. | ||
| * Exit code 1 — at least one refusal-taxonomy match. Refuse: do not write, emit the Refusal Template with the matched category substituted, and record the refusal in the JSON report. | ||
| * Exit code 2 — ambiguous (multiple categories matched or pattern parse error). Pause: do not write, surface the matched candidates and stimulus location to the user for review, and record the ambiguous result in the JSON report's `blockers` array. |
There was a problem hiding this comment.
As mentioned earlier, would it make sense for these types of instructions to just live in the vally-tests skill and for the agent to instruct the model to use the vally-tests skill to complete safety self-checks
There was a problem hiding this comment.
Makes sense, and applied. The safety self-check now defers to the vally-tests skill: the subagent points at the skill Safety Refusal Taxonomy and Helper Script Index for the lint scripts and exit-code contract, rather than restating the script paths and codes inline.
- wire Vally Test Author into prompt-builder skill and dispatch matrix - move routing, safety lint, dedupe, and report ownership into vally-tests skill - split eval failures into gating vs advisory and surface them in the PR comment ✅ - Generated by Copilot
📐 - Generated by Copilot
…re absent - reconcile unattributed failures by the spec's overall advisory posture - guard the exit-code fallback so all-advisory specs never block merge - parse quoted advisory tag values so a \\\\alse\\\\ graduates correctly - add stub fail-noname mode and a Pester case for the empty perStimulus path 🐛 - Generated by Copilot
Pull Request
Description
Added the Vally-facing AI artifacts plus the runner and collection-tooling support behind them.
tos-violation).-Tagfiltering forInvoke-VallySpec.Get-CollectionMaturityVocabulary,Get-CollectionMaturityRank, andResolve-StrictSafeMaturity.Related Issue(s)
Closes #1819
Type of Change
Select all that apply:
Code & Documentation:
Infrastructure & Configuration:
AI Artifacts:
prompt-builderagent and addressed all feedback.github/instructions/*.instructions.md).github/prompts/*.prompt.md).github/agents/*.agent.md).github/skills/*/SKILL.md)Other:
.ps1,.sh,.py)Sample Prompts (for AI Artifact Contributions)
User Request: "Write Vally tests for the prompt-builder agent."
Execution Flow: The vally-test-write.prompt.md entry point dispatches the vally-test-author subagent, which loads the vally-tests skill, scaffolds stimuli and expectations, selects graders, and runs the safety linter.
Output Artifacts: Stimulus and expectation YAML files under the relevant eval corpus.
Success Indicators: Generated specs pass
Test-EvalSpec.ps1and the safety linter reports no findings.Testing
Validated via
npm run lint:all(exit 0), includingnpm run lint:frontmatter,npm run lint:ps, andnpm run lint:ai-artifacts. PowerShell coverage added and passing vianpm run test:ps:tos-violationrefusal category coverage.Checklist
Required Checks
AI Artifact Contributions
/prompt-analyzeto review contributionprompt-builderreviewRequired Automated Checks
npm run lint:mdnpm run spell-checknpm run lint:frontmatternpm run validate:skillsnpm run lint:md-linksnpm run lint:psnpm run plugin:generatenpm run docs:testSecurity Considerations
Additional Notes
Seventh PR in the #1637 stack. Base branch:
feat/1637-l5-corpora. New agent, subagent, and prompts atstablematurity. Collection registration and plugin regeneration land in a later PR in this stack (#1821).