feat(agents): add Vally evaluation agents and prompts by WilliamBerryiii · Pull Request #1834 · microsoft/hve-core

WilliamBerryiii · 2026-06-02T03:43:11Z

Pull Request

Description

Added the Vally-facing AI artifacts plus the runner and collection-tooling support behind them.

The content-policy-citation agent (content-policy-citation.agent.md) provides citation discretion rules for the CI agentic PR-review workflow.
The vally-test-author subagent (vally-test-author.agent.md) drives test authoring against the vally-tests skill, supporting from-artifact and corpus-import modes with a seven-category refusal taxonomy (including tos-violation).
Two companion prompts, evals-import.prompt.md and vally-test-write.prompt.md, expose import and authoring entry points.
The Vally runner (VallyRunner.psm1, Invoke-VallyEvals.ps1) gains tag-aware run plans, spec backlink counting, and -Tag filtering for Invoke-VallySpec.
Collection tooling (CollectionHelpers.psm1, Validate-Collections.ps1) gains strict-safe maturity propagation via Get-CollectionMaturityVocabulary, Get-CollectionMaturityRank, and Resolve-StrictSafeMaturity.

Related Issue(s)

Closes #1819

Type of Change

Select all that apply:

Code & Documentation:

Bug fix (non-breaking change fixing an issue)
New feature (non-breaking change adding functionality)
Breaking change (fix or feature causing existing functionality to change)
Documentation update

Infrastructure & Configuration:

AI Artifacts:

Reviewed contribution with prompt-builder agent and addressed all feedback
Copilot instructions (.github/instructions/*.instructions.md)
Copilot prompt (.github/prompts/*.prompt.md)
Copilot agent (.github/agents/*.agent.md)
Copilot skill (.github/skills/*/SKILL.md)

Other:

Script/automation (.ps1, .sh, .py)
Other (please describe):

Sample Prompts (for AI Artifact Contributions)

User Request: "Write Vally tests for the prompt-builder agent."

Execution Flow: The vally-test-write.prompt.md entry point dispatches the vally-test-author subagent, which loads the vally-tests skill, scaffolds stimuli and expectations, selects graders, and runs the safety linter.

Output Artifacts: Stimulus and expectation YAML files under the relevant eval corpus.

Success Indicators: Generated specs pass Test-EvalSpec.ps1 and the safety linter reports no findings.

Testing

Validated via npm run lint:all (exit 0), including npm run lint:frontmatter, npm run lint:ps, and npm run lint:ai-artifacts. PowerShell coverage added and passing via npm run test:ps:

CollectionHelpers.Tests.ps1 — strict-safe maturity vocabulary, rank, and resolution (+267 lines).
Invoke-VallyEvals.Tests.ps1 — tag-aware run-plan and backlink behavior (+187 lines), with a new stub-vally.ps1 fixture.
Lint-VallyTestSafety.Tests.ps1 — tos-violation refusal category coverage.

Checklist

Required Checks

Documentation is updated (if applicable)
Files follow existing naming conventions
Changes are backwards compatible (if applicable)
Tests added for new functionality (if applicable)

AI Artifact Contributions

Used /prompt-analyze to review contribution
Addressed all feedback from prompt-builder review
Verified contribution follows common standards and type-specific requirements

Required Automated Checks

Markdown linting: npm run lint:md
Spell checking: npm run spell-check
Frontmatter validation: npm run lint:frontmatter
Skill structure validation: npm run validate:skills
Link validation: npm run lint:md-links
PowerShell analysis: npm run lint:ps
Plugin freshness: npm run plugin:generate
Docusaurus tests: npm run docs:test

Security Considerations

This PR does not contain any sensitive or NDA information
Any new dependencies have been reviewed for security issues
Security-related scripts follow the principle of least privilege

Additional Notes

Seventh PR in the #1637 stack. Base branch: feat/1637-l5-corpora. New agent, subagent, and prompts at stable maturity. Collection registration and plugin regeneration land in a later PR in this stack (#1821).

- add vally-test-author subagent and content-policy-citation agent - add evals-import and vally-test-write prompts ✨ - Generated by Copilot

The base branch was changed.

- align vally-test-author subagent with canonical template sections and JSON report path - whitelist Vally Test Author in prompt-builder and allow nested subagent calls - decouple skill paths and unify JSON output path in evals-import and vally-test-write prompts - drop attribution suffix and set disable-model-invocation on content-policy-citation 🔧 - Generated by Copilot

github-actions · 2026-06-23T05:17:37Z

Eval Execution

❌ Status: Failed — 3 spec(s) block merge

Artifacts evaluated: 14
Specs run: 14
Assertions passed: 38
Assertions failed (blocking): 3
Assertions failed (advisory): 27
Failed specs (merge-blocking): 3

Artifact	Kind	Status	Specs	Passed	Failed (blocking)	Failed (advisory)
`github-backlog-manager`	agent	❌ fail	1	3	1	0
`prompt-builder`	agent	❌ fail	1	3	1	0
`vally-test-author`	agent	❌ fail	1	3	1	3
`community-interaction`	instruction	⚠️ advisory-fail	1	0	0	4
`github-backlog-planning`	instruction	⚠️ advisory-fail	1	2	0	2
`github-backlog-update`	instruction	⚠️ advisory-fail	1	3	0	1
`prompt-builder`	instruction	⚠️ advisory-fail	1	3	0	1
`pull-request`	instruction	⚠️ advisory-fail	1	3	0	1
`content-policy-citation`	instruction	⚠️ advisory-fail	1	0	0	7
`evals-import`	prompt	⚠️ advisory-fail	1	2	0	2
`pull-request`	prompt	⚠️ advisory-fail	1	3	0	1
`vally-test-write`	prompt	⚠️ advisory-fail	1	3	0	1
`prompt-builder`	skill	⚠️ advisory-fail	1	3	0	1
`vally-tests`	skill	⚠️ advisory-fail	1	7	0	3

Legend — ✅ clean · ⚠️ advisory failures only (non-blocking) · ⏭️ skipped · ❌ merge-blocking failure

Only Failed specs (merge-blocking) gates this PR. Advisory assertion failures are signal-quality checks captured during iteration; review them, but they do not block merge and may be acceptable.

codecov-commenter · 2026-06-23T05:20:53Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 81.26%. Comparing base (44b42d4) to head (6367c6f).

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1834      +/-   ##
==========================================
+ Coverage   81.24%   81.26%   +0.01%     
==========================================
  Files         127      127              
  Lines       18831    18850      +19     
  Branches       12       12              
==========================================
+ Hits        15300    15319      +19     
  Misses       3528     3528              
  Partials        3        3

Flag	Coverage Δ
docusaurus	`61.84% <ø> (ø)`
pester	`86.09% <100.00%> (+0.03%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
scripts/collections/Modules/CollectionHelpers.psm1	`99.52% <100.00%> (+0.56%)`	⬆️
scripts/collections/Validate-Collections.ps1	`93.88% <100.00%> (ø)`

... and 1 file with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

🔧 - Generated by Copilot

- add Vally Test Author subagent with from-artifact and corpus-import modes - add vally-test-write and evals-import commands - enforce tos-violation safety lint coverage in vally-tests - update collections, plugins, and evals/collections scripts 🧪 - Generated by Copilot

- capture error message in a variable for clarity - conditionally call Write-CIAnnotation if the command exists 🔧 - Generated by Copilot

…tions - move content-policy-citation from agent to shared instructions - wire references across github backlog agents, prompts, and skills - regenerate collections and plugins; add vally-test-author stimulus ✨ - Generated by Copilot

- tally totals per spec to avoid double-counting and add Specs column - index only specs declaring a top-level stimuli key - add regression tests for summary totals and stimulus indexing 🧪 - Generated by Copilot

# Conflicts: # .github/instructions/hve-core/prompt-builder.instructions.md # evals/behavior-conformance/skill-behavior.eval.yaml

- re-import CIHelpers in Prepare-Extension tests so Write-CIAnnotation resolves - add deeplink/refus/stimul stems to cspell words - add community-interaction instruction eval stimulus for coverage 🔧 - Generated by Copilot

agreaves-ms · 2026-06-25T22:41:51Z

  - Prompt Evaluator
  - Prompt Updater
  - Researcher Subagent
+  - Vally Test Author


Should Prompt Builder have any additional instructions to guide prompt building to use the new Vally Test Author subagent? Also, do we want to add anything for this to the new .github/skills/rpi/prompt-builder skill(s)?

Thanks for the nudge. Prompt Builder now has an optional Vally Conformance Authoring phase that dispatches the Vally Test Author subagent (mode=from-artifact, kind=auto) once Phase 3 converges, then surfaces the routed eval file and appended-stimuli count. On the skill side we are still deciding placement; leaning toward a short pointer from the prompt-builder skill rather than duplicating dispatch logic, and tracking that as a follow-up in this PR.

Update: the skill side is wired up too now. The prompt-builder skill documents an optional Vally conformance authoring step and adds a Vally Test Author row to its orchestration dispatch matrix, and the agent Phase 4 now defers to that skill as the canonical orchestration source (by name, not by path). While here we also pushed the remaining skill-internal mechanics out of the Vally Test Author subagent: routing, corpus template, safety lint, refusal taxonomy, dedupe, and the JSON report shape are all owned by the vally-tests skill now, with the agent referencing the skill by name rather than reaching into its files. Markdown lint passes.

agreaves-ms · 2026-06-25T22:43:31Z

+After the safety self-check passes, deduplicate against the target eval file before append:
+
+1. Normalize the prompt text: trim leading and trailing whitespace, lowercase, then collapse all internal whitespace runs to a single space.
+2. Compute the SHA-256 hash of the normalized text.
+3. Compare the hash against the existing stimulus prompts in the target eval file (after applying the same normalization to each existing prompt).
+4. Skip any stimulus whose hash matches an existing entry. Record the skipped hash and source row in the JSON report's `dedupe_results`.
+


Do we need these instructions when the dedup protocol has been encoded into the skills that you've mentioned below?

Good catch. We removed them. The dedupe algorithm now lives in the vally-tests skill, so the subagent keeps only a one-line contract and defers to the skill instead of restating the steps.

agreaves-ms · 2026-06-25T22:45:44Z

+3. Compare the hash against the existing stimulus prompts in the target eval file (after applying the same normalization to each existing prompt).
+4. Skip any stimulus whose hash matches an existing entry. Record the skipped hash and source row in the JSON report's `dedupe_results`.
+
+Helper scripts implement the normalization and hashing — delegate, do not re-implement:


Would it make sense to have instructions like:

Use the `valley-tests` skill to dedup the target eval file before appending.

I don't think we want agents to reach into skills to use their files, instead that should be up to the skill and its instructions.

Agreed, and done. The dedupe section now delegates to the vally-tests skill (its Helper Script Index) with a one-line behavior contract, rather than hard-coding script paths or re-implementing the SHA-256 normalize-and-hash logic.

agreaves-ms · 2026-06-25T22:47:13Z

+Before any write to disk, run the skill-local safety lint against the drafted stimulus YAML:
+
+* PowerShell: `.github/skills/hve-core/vally-tests/scripts/Lint-VallyTestSafety.ps1 -Path <draft.yml>`
+* Bash equivalent: `.github/skills/hve-core/vally-tests/scripts/lint-vally-test-safety.sh <draft.yml>`
+
+Honor exit codes verbatim:
+
+* Exit code 0 — clean. Proceed to dedupe and append.
+* Exit code 1 — at least one refusal-taxonomy match. Refuse: do not write, emit the Refusal Template with the matched category substituted, and record the refusal in the JSON report.
+* Exit code 2 — ambiguous (multiple categories matched or pattern parse error). Pause: do not write, surface the matched candidates and stimulus location to the user for review, and record the ambiguous result in the JSON report's `blockers` array.


As mentioned earlier, would it make sense for these types of instructions to just live in the vally-tests skill and for the agent to instruct the model to use the vally-tests skill to complete safety self-checks

Makes sense, and applied. The safety self-check now defers to the vally-tests skill: the subagent points at the skill Safety Refusal Taxonomy and Helper Script Index for the lint scripts and exit-code contract, rather than restating the script paths and codes inline.

- wire Vally Test Author into prompt-builder skill and dispatch matrix - move routing, safety lint, dedupe, and report ownership into vally-tests skill - split eval failures into gating vs advisory and surface them in the PR comment ✅ - Generated by Copilot

📐 - Generated by Copilot

…re absent - reconcile unattributed failures by the spec's overall advisory posture - guard the exit-code fallback so all-advisory specs never block merge - parse quoted advisory tag values so a \\\\alse\\\\ graduates correctly - add stub fail-noname mode and a Pester case for the empty perStimulus path 🐛 - Generated by Copilot

WilliamBerryiii requested a review from a team as a code owner June 2, 2026 03:43

rezatnoMsirhC previously approved these changes Jun 8, 2026

View reviewed changes

agreaves-ms reviewed Jun 9, 2026

View reviewed changes

Comment thread .github/agents/hve-core/subagents/vally-test-author.agent.md Outdated

agreaves-ms reviewed Jun 9, 2026

View reviewed changes

Comment thread .github/agents/content-policy-citation.agent.md Outdated

agreaves-ms reviewed Jun 9, 2026

View reviewed changes

Comment thread .github/prompts/hve-core/evals-import.prompt.md Outdated

agreaves-ms reviewed Jun 9, 2026

View reviewed changes

Comment thread .github/prompts/hve-core/vally-test-write.prompt.md Outdated

agreaves-ms previously approved these changes Jun 9, 2026

View reviewed changes

feat(agents): add Vally evaluation agents and prompts

4afaf12

- add vally-test-author subagent and content-policy-citation agent - add evals-import and vally-test-write prompts ✨ - Generated by Copilot

WilliamBerryiii force-pushed the feat/1637-l6-agents branch from d4901bf to 4afaf12 Compare June 23, 2026 02:17

WilliamBerryiii changed the base branch from feat/1637-l5-corpora to main June 23, 2026 02:17

WilliamBerryiii and others added 9 commits June 23, 2026 09:59

fix(agents): correct user-invokable typo to user-invocable in schema

d8304f0

🔧 - Generated by Copilot

Merge branch 'main' into feat/1637-l6-agents

ba2ee67

Merge branch 'main' into feat/1637-l6-agents

2ec7df8

fix(plugins): improve error handling in plugin generation process

90156b2

- capture error message in a variable for clarity - conditionally call Write-CIAnnotation if the command exists 🔧 - Generated by Copilot

fix(evals): count eval summary totals by unique spec runs

f246469

- tally totals per spec to avoid double-counting and add Specs column - index only specs declaring a top-level stimuli key - add regression tests for summary totals and stimulus indexing 🧪 - Generated by Copilot

Merge remote-tracking branch 'origin/main' into feat/1637-l6-agents

ffb84c2

# Conflicts: # .github/instructions/hve-core/prompt-builder.instructions.md # evals/behavior-conformance/skill-behavior.eval.yaml

fix(ci): resolve PR validation and eval build failures

97ef1fa

- re-import CIHelpers in Prepare-Extension tests so Write-CIAnnotation resolves - add deeplink/refus/stimul stems to cspell words - add community-interaction instruction eval stimulus for coverage 🔧 - Generated by Copilot

agreaves-ms reviewed Jun 25, 2026

View reviewed changes

Comment thread .github/instructions/hve-core/prompt-builder.instructions.md

agreaves-ms reviewed Jun 25, 2026

View reviewed changes

agreaves-ms approved these changes Jun 25, 2026

View reviewed changes

WilliamBerryiii added 3 commits June 25, 2026 22:03

style(skills): align markdown tables to satisfy table-format check

d5aa4a3

📐 - Generated by Copilot

Uh oh!

Conversation

WilliamBerryiii commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request

Description

Related Issue(s)

Type of Change

Sample Prompts (for AI Artifact Contributions)

Testing

Checklist

Required Checks

AI Artifact Contributions

Required Automated Checks

Security Considerations

Additional Notes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Eval Execution

Uh oh!

codecov-commenter commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

WilliamBerryiii commented Jun 2, 2026 •

edited

Loading

github-actions Bot commented Jun 23, 2026 •

edited

Loading

codecov-commenter commented Jun 23, 2026 •

edited

Loading