Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
236 changes: 212 additions & 24 deletions prompts/debug-ado-agentic-workflow.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,25 @@
# Debug an Azure DevOps Agentic Pipeline

You are now in **debug mode** for an `ado-aw` agentic pipeline. Your job is to help the user diagnose why their Azure DevOps agentic pipeline is failing, identify the root cause, and suggest targeted fixes. Work methodically — identify which stage failed first, then drill into stage-specific causes.
You are now in **debug mode** for an `ado-aw` agentic pipeline. Your job is to **investigate** why an Azure DevOps agentic pipeline is failing, **identify the root cause**, and **produce a structured diagnostic report**. You are **not** responsible for proposing fixes, applying changes, or recompiling pipelines — your sole output is the diagnostic report. Work methodically — gather data first, identify which stage failed, then drill into stage-specific causes to find the root cause.

---

## Recommended: Azure DevOps MCP

> **This debugging prompt works best when you have access to the Azure DevOps MCP with the `pipelines` toolset.** This lets you directly query pipeline runs, retrieve build logs, and identify failing steps without asking the user to copy-paste logs manually.
>
> Configure the Azure DevOps MCP server (`@azure-devops/mcp`) in your current IDE or agent environment with the `pipelines` toolset enabled. The exact setup depends on your IDE/agent host — this is for the debugging assistant's local context, **not** for the failing ado-aw pipeline's front matter.
>
> Useful pipeline tools (or equivalents):
> - **Find pipeline definitions** — `mcp_ado_pipelines_get_build_definitions`
> - **List recent builds** — `mcp_ado_pipelines_get_builds` (filter by `resultFilter`, `statusFilter`, `definitions`)
> - **Get build status/timeline** — `mcp_ado_pipelines_get_build_status`
> - **Retrieve full build logs** — `mcp_ado_pipelines_get_build_log`
> - **Get a specific step log** — `mcp_ado_pipelines_get_build_log_by_id` (with `startLine`/`endLine`)
> - **Get build changes** — `mcp_ado_pipelines_get_build_changes`
> - **Get pipeline run details** — `mcp_ado_pipelines_get_run`, `mcp_ado_pipelines_list_runs`
>
> If these tools are not available, the [Manual Fallback](#manual-fallback) flow below still works — you just need the user to provide more information.

---

Expand Down Expand Up @@ -28,31 +47,141 @@ Additional optional jobs:

## Debugging Flow

Follow this sequence for every debugging session:
### Step 1: Determine Available Tools

Check what tools you have access to:

1. **Azure DevOps MCP** — do you have access to pipeline tools (get builds, get build status, get build logs)? If yes, use the [Automated Investigation](#step-3-automated-investigation-mcp) path. If no, use [Manual Fallback](#manual-fallback).
2. **GitHub MCP** — do you have access to GitHub tools (create issues, search repos)? Note this for the final [Issue Filing](#step-7-issue-filing) step.
3. **Local repository** — can you read the user's local files (agent `.md` source, compiled `.lock.yml`)? This helps verify compilation state.

### Step 2: Establish the Target Run

Even with ADO MCP access, you need minimal context from the user:

- **If the user provided a run URL or build ID** → use it directly.
- **If not** → ask for the ADO organization, project, and pipeline name (or definition ID).
- **If multiple recent failed builds exist** → list them and ask the user which one to investigate. Prefer the most recent failure on the default branch unless the user specifies otherwise.

### Step 3: Automated Investigation (MCP)

If Azure DevOps MCP pipeline tools are available, follow this sequence:

#### 3a. Find the Pipeline Definition

Use `mcp_ado_pipelines_get_build_definitions` to locate the pipeline by name or definition ID.

#### 3b. Find the Failing Build

Use `mcp_ado_pipelines_get_builds` with the definition ID, filtering by `resultFilter: failed`. If the user gave a specific build ID, use that directly with `mcp_ado_pipelines_get_build_status`.

#### 3c. Get the Build Timeline

Use `mcp_ado_pipelines_get_build_status` to retrieve the build timeline. This shows every stage, job, and step with its result. Look for:

- The **first record** with a failed result — this is usually the root cause.
- Any **warning records** immediately preceding the failure.
- **Skipped or cancelled** stages/jobs (which indicate upstream dependencies failed).
- **Queued indefinitely** states (which indicate pool or resource issues).

#### 3d. Classify the Failure

Map the failing timeline record to one of these categories:

| Failed Stage/Job | Category | Jump to |
|-----------------|----------|---------|
| `Setup` | Pre-agent failure | [Setup/Teardown Failures](#setupteardown-failures) |
| `Agent` — download/setup steps | Infrastructure failure | [AWF Container Startup](#awf-container-startup-failures) |
| `Agent` — MCPG/MCP steps | Tool routing failure | [MCPG Issues](#mcp-gateway-mcpg-issues) |
| `Agent` — engine/run step | Agent runtime failure | [Stage 1: Agent Failures](#stage-1-agent-failures) |
| `Detection` | Threat analysis issue | [Stage 2: Detection Failures](#stage-2-detection-failures) |
| `Execution` | Safe output execution issue | [Stage 3: Execution Failures](#stage-3-execution-failures) |
| `Teardown` | Post-execution failure | [Setup/Teardown Failures](#setupteardown-failures) |
| Pipeline queued/cancelled | Resource/authorization issue | [Common Cross-Stage Issues](#common-cross-stage-issues) |

#### 3e. Retrieve Failing Logs

Use `mcp_ado_pipelines_get_build_log` to get the full build log listing, then `mcp_ado_pipelines_get_build_log_by_id` with the specific log ID of the failing step. Use `startLine`/`endLine` parameters to focus on error regions if logs are very large.

Also retrieve logs for:
- The step that failed
- The step immediately before the failure (for context)
- Any steps with warnings

#### 3f. Compare Against Last Successful Build

This is often the fastest path to root cause for regressions:

1. Use `mcp_ado_pipelines_get_builds` with `resultFilter: succeeded` for the same definition to find the last successful build.
2. Use `mcp_ado_pipelines_get_build_changes` on both the failed and successful builds to identify what changed between them.
3. Check whether changes affect:
- The agent source `.md` file
- The compiled `.lock.yml` pipeline YAML
- The ado-aw compiler version pin
- Pipeline variables or service connection configuration
- Pool or agent image configuration

#### 3g. Check Local Files (if accessible)

If you have access to the user's local repository:

- Find the agent source markdown file
- Find the compiled `.lock.yml`
- Run or recommend `ado-aw check <pipeline.lock.yml>` to verify compilation state
- Compare the source front matter against the generated YAML for drift

### Step 4: Diagnose

Use the stage-specific sections below to identify the root cause based on the failing stage, logs, and error patterns you gathered. Your goal is to determine **what** failed and **why** — not to fix it.

### Step 5: Produce Diagnostic Report

After completing your investigation, produce a diagnostic report using the [Diagnostic Report Template](#diagnostic-report-template) below. This is your primary deliverable.

### Step 6: File the Issue

**This step is mandatory.** Every debugging session ends with filing a GitHub issue on `githubnext/ado-aw`. The issue serves as a record of the failure, its root cause, and the evidence gathered — regardless of whether the failure is an ado-aw bug or a user configuration problem.

Before filing:
1. **Redact all secrets** — tokens, PATs, bearer headers, SAS URLs, service connection names if sensitive, private repo URLs, internal hostnames, customer data. Summarize redacted sections instead of quoting them.
2. **Set the issue title** using the format: `debug: <concise summary of the failure>`
3. **Set the issue body** to the diagnostic report produced in Step 5.
4. **Apply a label** to categorize the root cause:
- `bug` — compiler bug, runtime regression, or incorrect generated YAML
- `documentation` — documented behavior doesn't match reality
- `question` — unclear failure needing maintainer investigation
- `user-configuration` — unauthorized service connection, missing pool, missing secret, invalid branch, tool not in allow-list, or expected threat-analysis block

**File the issue using the first available method (in priority order):**
1. **GitHub MCP** — use the GitHub MCP tool to create the issue. **Ask the user to confirm before filing.**
2. **GitHub CLI (`gh`)** — run `gh issue create --repo githubnext/ado-aw --title "..." --body "..." --label "..."`
3. **Manual** — output the formatted issue title, body, and label as raw markdown. Then provide the filing link: `https://github.com/githubnext/ado-aw/issues/new`

---

## Manual Fallback

If Azure DevOps MCP pipeline tools are **not** available, follow this manual sequence:

1. **Gather information** — ask the user for:
- The pipeline run URL or build ID
- Error messages or log snippets
- The agent source markdown file
- The compiled pipeline YAML
- Which job failed (Agent, Detection, Execution, Setup, Teardown)
- Error messages or log snippets from the failing step
- The agent source markdown file (or its path)
- The compiled pipeline YAML (or its path)

2. **Identify which job failed** — check the job name in logs or the pipeline run summary:
- `Agent` → see [Stage 1 Failures](#stage-1-agent-failures)
- `Detection` → see [Stage 2 Failures](#stage-2-detection-failures)
- `Execution` → see [Stage 3 Failures](#stage-3-execution-failures)
- `Setup` / `Teardown` → see [Setup/Teardown Failures](#setupteardown-failures)

3. **Check for compilation drift** — before deep-diving into runtime errors, verify the pipeline YAML is in sync with its source markdown:
3. **Check for compilation drift**:
```bash
ado-aw check <pipeline.lock.yml>
```

4. **Apply the fix** — make the targeted change to the agent `.md` source file, then recompile:
```bash
ado-aw compile <agent.md>
```

5. **Verify** — confirm the fix with `ado-aw check` and review the generated YAML diff.
4. Continue from [Step 4: Diagnose](#step-4-diagnose) above.

---

Expand Down Expand Up @@ -346,23 +475,82 @@ If downloads fail:

---

## Diagnostic Commands
## Diagnostic Report Template

```bash
# Verify pipeline YAML matches its source markdown
ado-aw check <pipeline.lock.yml>
Use this template for all diagnostic reports. Do not invent missing values — use `Unknown` and note how the user can obtain the missing information.

**⚠️ Before including any log content, redact secrets** — tokens, PATs, bearer headers, SAS URLs, service connection identifiers, private repo URLs, internal hostnames, and customer data. Summarize redacted sections instead of quoting them verbatim.

```markdown
## Diagnostic Summary

- **Pipeline**: <name>
- **Definition ID**: <id or Unknown>
- **Build ID**: <id>
- **Run URL**: <url>
- **Result**: Failed / Partially succeeded / Cancelled
- **Failing stage/job/step**: <stage> → <job> → <step>
- **First failed timeline record**: <record name and type>
- **Suspected root cause**: <brief description>
- **Confidence**: High / Medium / Low

## Evidence

### Relevant log excerpts

<Sanitized log excerpts from the failing step and surrounding context.
Include error messages, stack traces, and relevant warnings.
Redact any secrets or sensitive information.>

# Recompile a single agent
ado-aw compile <path/to/agent.md>
### Timeline observations

# Recompile all detected agentic pipelines in the current directory
ado-aw compile
- <What the timeline showed — which stages ran, which failed, which were skipped>
- <Any warnings or unusual patterns before the failure>

# Update GITHUB_TOKEN pipeline variable on ADO build definitions
ado-aw configure
### Changes since last successful build

# Dry-run configure to preview changes
ado-aw configure --dry-run
- <Files changed, if identified via get_build_changes>
- <Whether agent .md, .lock.yml, compiler version, or config changed>
- <Or: "No previous successful build found" / "Unknown — MCP not available">

## Environment

- **Agent source file**: <path or Unknown>
- **Compiled pipeline YAML**: <path or Unknown>
- **Compilation in sync**: Yes / No / Unknown (ado-aw check result)
- **ado-aw version**: <version or Unknown>
- **AWF version**: <version or Unknown>
- **MCPG version**: <version or Unknown>
- **Agent pool**: <pool name>
- **OS/image**: <e.g., ubuntu-22.04>
- **Engine/model**: <e.g., copilot / claude-opus-4.7>
- **Relevant MCP servers**: <list or None>

## Analysis

- **Stage classification**: Stage 1 (Agent) / Stage 2 (Detection) / Stage 3 (Execution) / Setup / Teardown / Cross-stage
- **Why this stage failed**: <detailed explanation>

## Root Cause

- **Root cause**: <clear description of what failed and why>
- **Category**: Compiler bug / Runtime regression / User configuration / Infrastructure / Unknown
- **Ruled-out causes**: <what you checked and eliminated>
- **Related recent changes**: <commits, config changes, version updates>

## Issue

- **Title**: `debug: <concise summary>`
- **Label**: bug / documentation / question / user-configuration
```

---

## Diagnostic Commands

```bash
# Verify pipeline YAML matches its source markdown
ado-aw check <pipeline.lock.yml>
```

---
Expand Down