Skip to content

feat(agents): add named research subagents to Task Researcher#2165

Draft
engineersamuel wants to merge 47 commits into
microsoft:mainfrom
engineersamuel:lucid-veil
Draft

feat(agents): add named research subagents to Task Researcher#2165
engineersamuel wants to merge 47 commits into
microsoft:mainfrom
engineersamuel:lucid-veil

Conversation

@engineersamuel

Copy link
Copy Markdown

Description

This PR gives Task Researcher a set of named, parallel research subagents ("lanes") plus a single public control to turn them on or off. When enabled, the agent fans out to specialized read-only subagents in parallel and then synthesizes their findings into one consolidated research document, so the user still receives a single .copilot-tracking/research artifact backed by deeper, broader investigation.

The change also tightens the untrusted-content boundary so lane and web findings are always treated as data, adds a deterministic comparison harness that scores research quality with subagents disabled versus enabled, and regenerates collections and plugin outputs so the agent and all five subagents install together.

Named research lanes

Four read-only, chat-only subagents return structured findings for the parent to synthesize; they write no files of their own.

  • Added the Codebase Locator, which maps where relevant code, tests, config, docs, schemas, and generated artifacts live.
  • Added the Codebase Analyzer, which traces behavior, data flow, state, error handling, and side effects with file:line evidence.
  • Added the Codebase Pattern Finder, which surfaces analogous implementations, conventions, reusable helpers, and anti-patterns.
  • Added the Web Search Researcher, which gathers external docs, SDK, API, and standards facts only when external uncertainty exists, behind a FAR (factual, actionable, relevant) gate and the untrusted-content boundary.

Single public control

The slash command in .github/prompts/hve-core/task-research.prompt.md exposes one knob and keeps the internal lane vocabulary out of the user interface:

/task-research topic="..." [chat={true|false}] [subagents={auto|true|false}]
  • auto (default) lets the agent's Lane Trigger Matrix pick the lightest sufficient path.
  • true forces the named lanes to fan out in parallel.
  • false keeps research direct unless that makes the request impossible.

The earlier mode={auto|focused|lanes} input was removed, so lane selection is now internal orchestration. The redundant Mode Selection block in task-researcher.agent.md was dropped, leaving the Lane Trigger Matrix as the single selector.

Lane subagent dispatch fix

During validation the lanes were not firing. A model: field on a subagent definition silently dropped it from the dispatchable agent-type registry. Removing the model: arrays from the four lane subagents restored registration and dispatch, which was confirmed end-to-end in a clean local install.

Safety, evals, and packaging

The remaining work hardened the surrounding boundary, evaluation, and packaging surface.

  • Extended the untrusted-content boundary instruction to .copilot-tracking/research/**, so lane and web findings are synthesized as data rather than executed as instructions.
  • Added a deterministic comparison harness under scripts/evals/task-researcher-comparison that scores Task Researcher output with subagents disabled and enabled, using lexical regression signals plus an optional DeepEval LLM-judge mode, along with advisory agent-behavior stimuli for Task Researcher.
  • Regenerated collections and plugin outputs so the agent and all five subagents ship together wherever the agent is installed.
  • Added a local plugin installer at scripts/plugins/Install-LocalCopilotPlugin.sh with plugin-id validation, path-deletion guards, a generated-subagent existence check, and shell-injection removal in the capture runner.

Related Issue(s)

None

Type of Change

Select all that apply:

Code & Documentation:

  • Bug fix (non-breaking change fixing an issue)
  • New feature (non-breaking change adding functionality)
  • Breaking change (fix or feature causing existing functionality to change)
  • Documentation update

Infrastructure & Configuration:

  • GitHub Actions workflow
  • Linting configuration (markdown, PowerShell, etc.)
  • Security configuration
  • DevContainer configuration
  • Dependency update

AI Artifacts:

  • Reviewed contribution with prompt-builder agent and addressed all feedback
  • Copilot instructions (.github/instructions/*.instructions.md)
  • Copilot prompt (.github/prompts/*.prompt.md)
  • Copilot agent (.github/agents/*.agent.md)
  • Copilot skill (.github/skills/*/SKILL.md)

Note for AI Artifact Contributors:

  • Agents: Research, indexing/referencing other project (using standard VS Code GitHub Copilot/MCP tools), planning, and general implementation agents likely already exist. Review .github/agents/ before creating new ones.
  • Skills: Must include both bash and PowerShell scripts. See Skills.
  • Model Versions: Only contributions targeting the latest Anthropic and OpenAI models will be accepted. Older model versions (e.g., GPT-3.5, Claude 3) will be rejected.
  • See Agents Not Accepted and Model Version Requirements.

Other:

  • Script/automation (.ps1, .sh, .py)
  • Other (please describe):

Sample Prompts (for AI Artifact Contributions)

User Request:

/task-research topic="Compare how the collection validators and plugin generator
handle missing subagent references across collections, tests, and packaging." subagents=auto

Execution Flow:

  1. Task Researcher reads the topic and evaluates it against the Lane Trigger Matrix.
  2. Because the request spans multiple subsystems, auto selects lane mode and fans out to the Codebase Locator, Codebase Analyzer, and Codebase Pattern Finder in parallel; Web Search Researcher is omitted because the request is local-only.
  3. Each lane returns structured, read-only findings to the parent without writing files.
  4. The parent synthesizes the findings into one research document under .copilot-tracking/research/<date>/.

Output Artifacts:

A single research markdown document, for example:

# Research: Collection validators and missing subagent references

## Current behavior
- Collection YAML lists subagents under each agent's `agents:` block ...

## Where it lives (Codebase Locator)
- collections/*.collection.yml ...

## Risks and validation steps
- ...

Success Indicators:

  • Session events show subagent.started entries for the expected lanes.
  • The final research document cites file:line evidence and reflects multiple subsystems.
  • For local-only topics, Web Search Researcher is not dispatched.

Testing

  • npm run eval:task-researcher:compare ran the comparison harness tests: 17 passed, 1 skipped (the DeepEval LLM-judge test skips unless DEEPEVAL_RUN_LLM=1).
  • npm run lint:md, npm run spell-check, npm run lint:frontmatter, npm run validate:skills, npm run lint:md-links, and npm run lint:ps all passed.
  • ruff check on the comparison harness reported no issues, and npm run lint:ai-artifacts reported no issues.
  • Lane dispatch was confirmed in a clean local install by inspecting subagent.started session events for the four named lanes.
  • A headless A/B comparison ran the same prompts with subagents=true and subagents=false in isolated instances and verified fan-out at the event level (see Additional Notes).

Checklist

Required Checks

  • Documentation is updated (if applicable)
  • Files follow existing naming conventions
  • Changes are backwards compatible (if applicable)
  • Tests added for new functionality (if applicable)

AI Artifact Contributions

  • Used /prompt-analyze to review contribution
  • Addressed all feedback from prompt-builder review
  • Verified contribution follows common standards and type-specific requirements

Required Automated Checks

The following validation commands must pass before merging:

  • Markdown linting: npm run lint:md
  • Spell checking: npm run spell-check
  • Frontmatter validation: npm run lint:frontmatter
  • Skill structure validation: npm run validate:skills
  • Link validation: npm run lint:md-links
  • PowerShell analysis: npm run lint:ps
  • Plugin freshness: npm run plugin:generate
  • Docusaurus tests: npm run docs:test

Security Considerations

  • This PR does not contain any sensitive or NDA information
  • Any new dependencies have been reviewed for security issues
  • Security-related scripts follow the principle of least privilege

Additional Notes

  • One consolidated output, by design. The named lanes do not write separate artifacts; they return structured findings that the parent merges into a single .copilot-tracking/research/... document.
  • subagents is the only user-facing knob. The direct, focused, and lane behaviors are internal orchestration driven by the Lane Trigger Matrix.
  • A/B evidence. Across four paired headless runs with fan-out verified from session events, the subagents arm scored higher on a fine-grained research-quality rubric (about 46.75 versus 42.0 out of 50), at roughly 7.7x the AI credits and 2.4x the wall-clock time. Every with run fanned out the expected local lanes and correctly omitted Web Search Researcher for local-only prompts. The sample shows a stable direction rather than a tight confidence interval, and inline runs remained strong on narrow, single-file questions, which is the split auto is designed to route.
  • New eval-only dependencies. The comparison harness adds deepeval and pytest as dev dependencies under scripts/evals/task-researcher-comparison; the DeepEval LLM-judge path is opt-in via DEEPEVAL_RUN_LLM=1.
  • Docusaurus tests not applicable. No files under docs/ changed in this PR, so npm run docs:test was not run.
  • Pre-existing, out of scope. A StrictMode fix to scripts/evals/Test-EvalSpec.ps1 makes an unrelated eval-coverage gap visible (several agents lack stimulus partials). That gap is not gated by CI and is tracked separately.

Samuel Mendenhall and others added 30 commits June 25, 2026 15:42
Add lane orchestration to Task Researcher with research lanes for codebase locator, analyzer, pattern finder, and external research. Includes lane trigger matrix, prompt templates, and synthesis rules. Preserves Researcher Subagent as the only default child agent.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Add optional research lane parameter to inputs
- Support lane names: Codebase locator, Codebase analyzer, Codebase pattern finder, External research
- Add lane-specific output requirements for each lane
- Update response format to include lane and status
- Record research lane in document using 'Research lane: <lane name>' format

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add mode and subagents command inputs with documentation for trigger-based
selection, focused vs lane-enabled research modes, and subagent fan-out
preferences. Extends argument-hint with new optional parameters and adds
requirement for trigger matrix evaluation.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Add task-researcher-lane-fanout stimulus for research lane fan-out behavior
- Add task-researcher-focused-mode stimulus for lightweight research mode
- Add task-researcher-generic-subagent stimulus for subagent preservation

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
🧪 - Generated by Copilot

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
🧪 - Generated by Copilot

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
🧪 - Generated by Copilot

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…de quality

- Remove committed hve_core_task_researcher_comparison.egg-info directory
- Add *.egg-info/ to .gitignore to prevent future packaging artifacts
- Tighten build_metrics() return type from list[LazyGEval | GEval] to list[LazyGEval]
- Add comprehensive docstrings to public functions: require_deepeval_llm_enabled, build_comparison_test_case, build_metrics
- Fix docstring whitespace issues (ruff W293)

🧪 - Generated by Copilot

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
🧪 - Generated by Copilot

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add explicit error handling to catch subprocess.CalledProcessError when the runner command fails, providing context about scenario and variant with user-friendly error messages to stderr and returning nonzero exit code.

🐛 - Generated by Copilot

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
🧪 - Generated by Copilot

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…collection

- Fix npm run eval:task-researcher:compare to explicitly specify tests path
- Prevents pytest from collecting 2979 unrelated repository tests
- Tests now pass cleanly: 6 passed, 1 skipped
- Resolves review finding: root command now passes deterministically

🧪 - Generated by Copilot

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
🧪 - Generated by Copilot

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
🧪 - Generated by Copilot

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
🔌 - Generated by Copilot

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
🤖 - Generated by Copilot

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- add locator, analyzer, pattern, and web research agents
- keep each agent scoped to a single research lane

🤖 - Generated by Copilot

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Update Task Researcher to invoke named subagents in lane mode.\n\n🧭 - Generated by Copilot\nCo-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- update task-research prompt fan-out guidance
- package codebase and web research subagents in hve-core

🧭 - Generated by Copilot
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- update stimulus, fixtures, and static scoring for named lanes
- refresh comparison README and fixture outputs

🔍 - Generated by Copilot

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- restore pre-existing model arrays for unrelated agents\n- add headers to task-researcher comparison Python files\n- clarify fixtures are synthetic and live verification is separate\n\n🛠️ - Generated by Copilot\n\nCo-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
🔒 - Generated by Copilot

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Named lane subagents no longer create per-lane artifact files
- Remove edit tools from codebase-locator, codebase-analyzer, codebase-pattern-finder, web-search-researcher
- Update output sections to return structured findings in chat response
- Remove file-writing prerequisites; replace with read-only setup steps
- Remove lane-specific output requirements from Researcher Subagent
- Update Task Researcher delegation and synthesis rules for read-only lanes
- Add synthesize-into-main-doc rule to task-research prompt

🔬 - Generated by Copilot

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
🔧 - Generated by Copilot

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add one focused-mode return sentence to keep the delegation section symmetric without reintroducing lane artifacts.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

🔧 - Generated by Copilot
🔒 - Generated by Copilot

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Replace names-all-four-subagents grader with per-lane and conditional web-search graders in task-researcher.yml stimuli
- Update codebase-lane required_evidence to list three local subagent files; remove Web Search Researcher
- Update external-api required_evidence to include web-search-researcher subagent path and Web Search Researcher
- Replace NAMED_SUBAGENT_MARKERS with LOCAL_LANE_MARKERS + WEB_LANE_MARKER constants
- Rewrite _score_mode_compliance: codebase-lane penalizes web search, focused penalizes any lane fanout, external requires both lanes and web search plus FAR evidence
- Update fixtures to match new scoring: remove web lane from codebase-lane, add FAR note and web lane to external-api, strip lane markers from focused-local
- Add test_focused_case_penalizes_any_lane_fanout test

🔬 - Generated by Copilot

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
\n🔒 - Generated by Copilot\n\nCo-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- add runner_argv_from_env() parsing TASK_RESEARCHER_RUNNER_ARGV as JSON array
- replace shell=True subprocess with safe argv list execution
- add ValueError handler for malformed TASK_RESEARCHER_RUNNER_ARGV
- add test_capture.py covering build_prompt and runner_argv_from_env

🔒 - Generated by Copilot

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Samuel Mendenhall and others added 17 commits June 25, 2026 15:49
- catch invalid runner argv before subprocess execution
- align empty env handling and add parser coverage
- restore README note about live captures

🛠️ - Generated by Copilot

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ath guards

- add validate_plugin_id() to reject unsafe plugin ID slugs
- add safe_installed_path() to refuse delete targets outside INSTALL_ROOT
- verify named Task Researcher subagents in verify_source_plugin()
- update scripts/plugins/README.md with security description

🔒 - Generated by Copilot

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Keep dry-run from mutating the install root while preserving path safety checks.\n\n🔒 - Generated by Copilot\nCo-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…nstalled_path

- safe_installed_path() now resolves both INSTALL_ROOT and candidate
  parent via pwd -P; dry-run falls back to lexical check when dirs
  do not exist
- add mkdir -p INSTALL_ROOT and INSTALL_ROOT/_direct in non-dry-run
  mode before safe path checks, satisfying spec requirement

🔒 - Generated by Copilot

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…bagents

- regenerate plugins/hve-core and plugins/hve-core-all with codebase-analyzer,
  codebase-locator, codebase-pattern-finder, and web-search-researcher subagents
- regenerate extension package.json and README artifacts for named subagents
- update evals/README.md ms.date to 2026-06-24
- rename evals/agent-behavior/stimuli/adr-creator.yml to adr-creation.yml so
  the agent backlink tag resolves to adr-creation.agent.md
- regenerate evals/agent-behavior/eval.yaml from corrected partials
- fix Test-EvalSpec.ps1 null-safe array wrap on Test-EvalSpecCompliance return

🔄 - Generated by Copilot

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
🔗 - Generated by Copilot

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Phrases like "unrelated" and "not relevant" describe intentional
exclusions, which signal good noise control. Scoring 0 on their presence
inverted the metric. Remove the word-match guard; rely on word-count only.

🔬 - Generated by Copilot

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
🔒 - Generated by Copilot

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
A model field silently de-registers a subagent from the dispatchable agent-type registry, so the named Task Researcher lanes never fanned out. Removing it aligns source with the already-regenerated plugin copies that omit it.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The Lane Trigger Matrix is the sole lane selector; the separate Mode Selection block duplicated it.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Match report recommendation logic to the documented coverage/actionability/noise rubric, label the static scorers as lexical signal and proxy heuristics, and add recommendation boundary tests.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Both terms are introduced by the Task Researcher comparison harness and lane stimuli.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Main's vally schema rejects boolean tag values, so the task-researcher advisory tags become strings. Regenerate the agent-behavior eval spec and hve-core plugin readme to include the named lanes.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@engineersamuel

Copy link
Copy Markdown
Author

@engineersamuel please read the following Contributor License Agreement(CLA). If you agree with the CLA, please reply with the following information.

@microsoft-github-policy-service agree [company="{your company}"]

Options:

  • (default - no company specified) I have sole ownership of intellectual property rights to my Submissions and I am not making Submissions in the course of work for my employer.
@microsoft-github-policy-service agree
  • (when company given) I am making Submissions in the course of work for my employer (or my employer has intellectual property rights in my Submissions by contract or applicable law). I have permission from my employer to make Submissions and enter into this Agreement on behalf of my employer. By signing below, the defined term “You” includes me and my employer.
@microsoft-github-policy-service agree company="Microsoft"

Contributor License Agreement

@microsoft-github-policy-service agree company="Microsoft"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant