Skip to content

feat(agents): add Vally evaluation agents and prompts#1834

Open
WilliamBerryiii wants to merge 14 commits into
mainfrom
feat/1637-l6-agents
Open

feat(agents): add Vally evaluation agents and prompts#1834
WilliamBerryiii wants to merge 14 commits into
mainfrom
feat/1637-l6-agents

Conversation

@WilliamBerryiii

@WilliamBerryiii WilliamBerryiii commented Jun 2, 2026

Copy link
Copy Markdown
Member

Pull Request

Description

Added the Vally-facing AI artifacts plus the runner and collection-tooling support behind them.

  • The content-policy-citation agent (content-policy-citation.agent.md) provides citation discretion rules for the CI agentic PR-review workflow.
  • The vally-test-author subagent (vally-test-author.agent.md) drives test authoring against the vally-tests skill, supporting from-artifact and corpus-import modes with a seven-category refusal taxonomy (including tos-violation).
  • Two companion prompts, evals-import.prompt.md and vally-test-write.prompt.md, expose import and authoring entry points.
  • The Vally runner (VallyRunner.psm1, Invoke-VallyEvals.ps1) gains tag-aware run plans, spec backlink counting, and -Tag filtering for Invoke-VallySpec.
  • Collection tooling (CollectionHelpers.psm1, Validate-Collections.ps1) gains strict-safe maturity propagation via Get-CollectionMaturityVocabulary, Get-CollectionMaturityRank, and Resolve-StrictSafeMaturity.

Related Issue(s)

Closes #1819

Type of Change

Select all that apply:

Code & Documentation:

  • Bug fix (non-breaking change fixing an issue)
  • New feature (non-breaking change adding functionality)
  • Breaking change (fix or feature causing existing functionality to change)
  • Documentation update

Infrastructure & Configuration:

  • GitHub Actions workflow
  • Linting configuration (markdown, PowerShell, etc.)
  • Security configuration
  • DevContainer configuration
  • Dependency update

AI Artifacts:

  • Reviewed contribution with prompt-builder agent and addressed all feedback
  • Copilot instructions (.github/instructions/*.instructions.md)
  • Copilot prompt (.github/prompts/*.prompt.md)
  • Copilot agent (.github/agents/*.agent.md)
  • Copilot skill (.github/skills/*/SKILL.md)

Other:

  • Script/automation (.ps1, .sh, .py)
  • Other (please describe):

Sample Prompts (for AI Artifact Contributions)

User Request: "Write Vally tests for the prompt-builder agent."

Execution Flow: The vally-test-write.prompt.md entry point dispatches the vally-test-author subagent, which loads the vally-tests skill, scaffolds stimuli and expectations, selects graders, and runs the safety linter.

Output Artifacts: Stimulus and expectation YAML files under the relevant eval corpus.

Success Indicators: Generated specs pass Test-EvalSpec.ps1 and the safety linter reports no findings.

Testing

Validated via npm run lint:all (exit 0), including npm run lint:frontmatter, npm run lint:ps, and npm run lint:ai-artifacts. PowerShell coverage added and passing via npm run test:ps:

  • CollectionHelpers.Tests.ps1 — strict-safe maturity vocabulary, rank, and resolution (+267 lines).
  • Invoke-VallyEvals.Tests.ps1 — tag-aware run-plan and backlink behavior (+187 lines), with a new stub-vally.ps1 fixture.
  • Lint-VallyTestSafety.Tests.ps1tos-violation refusal category coverage.

Checklist

Required Checks

  • Documentation is updated (if applicable)
  • Files follow existing naming conventions
  • Changes are backwards compatible (if applicable)
  • Tests added for new functionality (if applicable)

AI Artifact Contributions

  • Used /prompt-analyze to review contribution
  • Addressed all feedback from prompt-builder review
  • Verified contribution follows common standards and type-specific requirements

Required Automated Checks

  • Markdown linting: npm run lint:md
  • Spell checking: npm run spell-check
  • Frontmatter validation: npm run lint:frontmatter
  • Skill structure validation: npm run validate:skills
  • Link validation: npm run lint:md-links
  • PowerShell analysis: npm run lint:ps
  • Plugin freshness: npm run plugin:generate
  • Docusaurus tests: npm run docs:test

Security Considerations

  • This PR does not contain any sensitive or NDA information
  • Any new dependencies have been reviewed for security issues
  • Security-related scripts follow the principle of least privilege

Additional Notes

Seventh PR in the #1637 stack. Base branch: feat/1637-l5-corpora. New agent, subagent, and prompts at stable maturity. Collection registration and plugin regeneration land in a later PR in this stack (#1821).

@WilliamBerryiii WilliamBerryiii requested a review from a team as a code owner June 2, 2026 03:43
rezatnoMsirhC
rezatnoMsirhC previously approved these changes Jun 8, 2026
Comment thread .github/agents/hve-core/subagents/vally-test-author.agent.md Outdated
Comment thread .github/prompts/hve-core/evals-import.prompt.md Outdated
Comment thread .github/agents/content-policy-citation.agent.md Outdated
Comment thread .github/agents/hve-core/subagents/vally-test-author.agent.md Outdated
Comment thread .github/agents/hve-core/subagents/vally-test-author.agent.md Outdated
Comment thread .github/agents/content-policy-citation.agent.md Outdated
Comment thread .github/prompts/hve-core/evals-import.prompt.md Outdated
Comment thread .github/prompts/hve-core/vally-test-write.prompt.md Outdated
agreaves-ms
agreaves-ms previously approved these changes Jun 9, 2026
- add vally-test-author subagent and content-policy-citation agent
- add evals-import and vally-test-write prompts

✨ - Generated by Copilot
@WilliamBerryiii WilliamBerryiii changed the base branch from feat/1637-l5-corpora to main June 23, 2026 02:17
@WilliamBerryiii WilliamBerryiii dismissed stale reviews from agreaves-ms and rezatnoMsirhC June 23, 2026 02:17

The base branch was changed.

- align vally-test-author subagent with canonical template sections and JSON report path
- whitelist Vally Test Author in prompt-builder and allow nested subagent calls
- decouple skill paths and unify JSON output path in evals-import and vally-test-write prompts
- drop attribution suffix and set disable-model-invocation on content-policy-citation

🔧 - Generated by Copilot
@github-actions

github-actions Bot commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Eval Execution

Status: Failed — 3 spec(s) block merge

  • Artifacts evaluated: 14
  • Specs run: 14
  • Assertions passed: 38
  • Assertions failed (blocking): 3
  • Assertions failed (advisory): 27
  • Failed specs (merge-blocking): 3
Artifact Kind Status Specs Passed Failed (blocking) Failed (advisory)
github-backlog-manager agent ❌ fail 1 3 1 0
prompt-builder agent ❌ fail 1 3 1 0
vally-test-author agent ❌ fail 1 3 1 3
community-interaction instruction ⚠️ advisory-fail 1 0 0 4
github-backlog-planning instruction ⚠️ advisory-fail 1 2 0 2
github-backlog-update instruction ⚠️ advisory-fail 1 3 0 1
prompt-builder instruction ⚠️ advisory-fail 1 3 0 1
pull-request instruction ⚠️ advisory-fail 1 3 0 1
content-policy-citation instruction ⚠️ advisory-fail 1 0 0 7
evals-import prompt ⚠️ advisory-fail 1 2 0 2
pull-request prompt ⚠️ advisory-fail 1 3 0 1
vally-test-write prompt ⚠️ advisory-fail 1 3 0 1
prompt-builder skill ⚠️ advisory-fail 1 3 0 1
vally-tests skill ⚠️ advisory-fail 1 7 0 3

Legend — ✅ clean · ⚠️ advisory failures only (non-blocking) · ⏭️ skipped · ❌ merge-blocking failure

Only Failed specs (merge-blocking) gates this PR. Advisory assertion failures are signal-quality checks captured during iteration; review them, but they do not block merge and may be acceptable.

@codecov-commenter

codecov-commenter commented Jun 23, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 81.26%. Comparing base (44b42d4) to head (6367c6f).

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1834      +/-   ##
==========================================
+ Coverage   81.24%   81.26%   +0.01%     
==========================================
  Files         127      127              
  Lines       18831    18850      +19     
  Branches       12       12              
==========================================
+ Hits        15300    15319      +19     
  Misses       3528     3528              
  Partials        3        3              
Flag Coverage Δ
docusaurus 61.84% <ø> (ø)
pester 86.09% <100.00%> (+0.03%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
scripts/collections/Modules/CollectionHelpers.psm1 99.52% <100.00%> (+0.56%) ⬆️
scripts/collections/Validate-Collections.ps1 93.88% <100.00%> (ø)

... and 1 file with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

WilliamBerryiii and others added 9 commits June 23, 2026 09:59
- add Vally Test Author subagent with from-artifact and corpus-import modes
- add vally-test-write and evals-import commands
- enforce tos-violation safety lint coverage in vally-tests
- update collections, plugins, and evals/collections scripts

🧪 - Generated by Copilot
- capture error message in a variable for clarity
- conditionally call Write-CIAnnotation if the command exists

🔧 - Generated by Copilot
…tions

- move content-policy-citation from agent to shared instructions
- wire references across github backlog agents, prompts, and skills
- regenerate collections and plugins; add vally-test-author stimulus

✨ - Generated by Copilot
- tally totals per spec to avoid double-counting and add Specs column
- index only specs declaring a top-level stimuli key
- add regression tests for summary totals and stimulus indexing

🧪 - Generated by Copilot
# Conflicts:
#	.github/instructions/hve-core/prompt-builder.instructions.md
#	evals/behavior-conformance/skill-behavior.eval.yaml
- re-import CIHelpers in Prepare-Extension tests so Write-CIAnnotation resolves
- add deeplink/refus/stimul stems to cspell words
- add community-interaction instruction eval stimulus for coverage

🔧 - Generated by Copilot
Comment thread .github/instructions/hve-core/prompt-builder.instructions.md
- Prompt Evaluator
- Prompt Updater
- Researcher Subagent
- Vally Test Author

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should Prompt Builder have any additional instructions to guide prompt building to use the new Vally Test Author subagent? Also, do we want to add anything for this to the new .github/skills/rpi/prompt-builder skill(s)?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the nudge. Prompt Builder now has an optional Vally Conformance Authoring phase that dispatches the Vally Test Author subagent (mode=from-artifact, kind=auto) once Phase 3 converges, then surfaces the routed eval file and appended-stimuli count. On the skill side we are still deciding placement; leaning toward a short pointer from the prompt-builder skill rather than duplicating dispatch logic, and tracking that as a follow-up in this PR.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: the skill side is wired up too now. The prompt-builder skill documents an optional Vally conformance authoring step and adds a Vally Test Author row to its orchestration dispatch matrix, and the agent Phase 4 now defers to that skill as the canonical orchestration source (by name, not by path). While here we also pushed the remaining skill-internal mechanics out of the Vally Test Author subagent: routing, corpus template, safety lint, refusal taxonomy, dedupe, and the JSON report shape are all owned by the vally-tests skill now, with the agent referencing the skill by name rather than reaching into its files. Markdown lint passes.

Comment on lines +117 to +123
After the safety self-check passes, deduplicate against the target eval file before append:

1. Normalize the prompt text: trim leading and trailing whitespace, lowercase, then collapse all internal whitespace runs to a single space.
2. Compute the SHA-256 hash of the normalized text.
3. Compare the hash against the existing stimulus prompts in the target eval file (after applying the same normalization to each existing prompt).
4. Skip any stimulus whose hash matches an existing entry. Record the skipped hash and source row in the JSON report's `dedupe_results`.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need these instructions when the dedup protocol has been encoded into the skills that you've mentioned below?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. We removed them. The dedupe algorithm now lives in the vally-tests skill, so the subagent keeps only a one-line contract and defers to the skill instead of restating the steps.

3. Compare the hash against the existing stimulus prompts in the target eval file (after applying the same normalization to each existing prompt).
4. Skip any stimulus whose hash matches an existing entry. Record the skipped hash and source row in the JSON report's `dedupe_results`.

Helper scripts implement the normalization and hashing — delegate, do not re-implement:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to have instructions like:

Use the `valley-tests` skill to dedup the target eval file before appending.

I don't think we want agents to reach into skills to use their files, instead that should be up to the skill and its instructions.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, and done. The dedupe section now delegates to the vally-tests skill (its Helper Script Index) with a one-line behavior contract, rather than hard-coding script paths or re-implementing the SHA-256 normalize-and-hash logic.

Comment on lines +80 to +89
Before any write to disk, run the skill-local safety lint against the drafted stimulus YAML:

* PowerShell: `.github/skills/hve-core/vally-tests/scripts/Lint-VallyTestSafety.ps1 -Path <draft.yml>`
* Bash equivalent: `.github/skills/hve-core/vally-tests/scripts/lint-vally-test-safety.sh <draft.yml>`

Honor exit codes verbatim:

* Exit code 0 — clean. Proceed to dedupe and append.
* Exit code 1 — at least one refusal-taxonomy match. Refuse: do not write, emit the Refusal Template with the matched category substituted, and record the refusal in the JSON report.
* Exit code 2 — ambiguous (multiple categories matched or pattern parse error). Pause: do not write, surface the matched candidates and stimulus location to the user for review, and record the ambiguous result in the JSON report's `blockers` array.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned earlier, would it make sense for these types of instructions to just live in the vally-tests skill and for the agent to instruct the model to use the vally-tests skill to complete safety self-checks

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, and applied. The safety self-check now defers to the vally-tests skill: the subagent points at the skill Safety Refusal Taxonomy and Helper Script Index for the lint scripts and exit-code contract, rather than restating the script paths and codes inline.

- wire Vally Test Author into prompt-builder skill and dispatch matrix

- move routing, safety lint, dedupe, and report ownership into vally-tests skill

- split eval failures into gating vs advisory and surface them in the PR comment

✅ - Generated by Copilot
…re absent

- reconcile unattributed failures by the spec's overall advisory posture

- guard the exit-code fallback so all-advisory specs never block merge

- parse quoted advisory tag values so a \\\\alse\\\\ graduates correctly

- add stub fail-noname mode and a Pester case for the empty perStimulus path

🐛 - Generated by Copilot
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add Vally evaluation agents and prompts

4 participants