Add live golden-path (Tier 2) pipeline for azd ai agent extension#8758
Add live golden-path (Tier 2) pipeline for azd ai agent extension#8758v1212 wants to merge 29 commits into
Conversation
Adds eng/pipelines/ext-azure-ai-agents-live.yml, an on-demand/weekly Azure DevOps pipeline that drives the real 'azd ai agent' CLI through tmux against live Azure (TME), exercising init -> provision -> deploy -> invoke -> down for both code and container deploy modes. This is the live counterpart to the PR-gate checks (Tier 0 offline + Tier 1 recording/playback in #8754). Per Azure SDK EngSys / SFI guidance, live access stays out of the automatic PR pipeline (trigger: none) and runs only via '/azp run ext-azure-ai-agents-live' or the weekly schedule. The Tier 2 tmux driver (test_full_e2e.py, test_tier2.py) is migrated from the #8692 prototype; CI auth detection is extended to recognize Azure DevOps (TF_BUILD) and an explicit E2E_USE_AZ_CLI_AUTH override.
…zSessionActive The azure-sdk-tests service connection uses Workload Identity Federation, whose az session is isolated to the task's private AZURE_CONFIG_DIR and expires after ~10 min. Running the ~50 min golden-path test (and the cleanup) as plain bash steps after a separate login step would fail auth on both counts. Run them inside AzureCLI@2 with keepAzSessionActive:true (matching build-cli.yml) so the session stays refreshed and reaches azd (auth.useAzCliAuth) through tmux, which inherits AZURE_CONFIG_DIR. Subscription/tenant are now read in-script via az account show instead of cross-step pipeline variables.
test_tier2.py always ran sequentially, but kept a tautological if-condition (len==1 or len>1), an unused concurrent.futures import, a no-op --serial flag, and a docstring/print claiming parallel execution. Simplify to an explicit sequential loop and update the docstring to match. Also fix test_full_e2e.py's module docstring to point at README.md (LOCAL-TEST-GUIDE.md does not exist).
There was a problem hiding this comment.
Pull request overview
This PR adds a dedicated Azure DevOps pipeline and supporting Python drivers/docs to run the Tier 2 live golden-path E2E for the azure.ai.agents azd extension (init → provision → deploy → invoke → down) against real Azure resources, outside of the GitHub PR gate.
Changes:
- Added an on-demand + weekly ADO pipeline (
ext-azure-ai-agents-live) to run Tier 2 live E2E for bothcodeandcontainerdeploy modes. - Added a tmux-driven Python Tier 2 runner (
test_tier2.py) and a full golden-path driver (test_full_e2e.py) adapted for ADO CI auth detection. - Added documentation for running the live Tier 2 tests locally and in CI.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 6 comments.
| File | Description |
|---|---|
| eng/pipelines/ext-azure-ai-agents-live.yml | New ADO pipeline wiring to build azd + extension, run Tier 2 live E2E under AzureCLI@2, and publish logs/cleanup. |
| cli/azd/extensions/azure.ai.agents/tests/e2e-live/test_tier2.py | Tier 2 orchestrator: runs code + container golden paths sequentially with isolation and timeout handling. |
| cli/azd/extensions/azure.ai.agents/tests/e2e-live/test_full_e2e.py | tmux-driven end-to-end golden path driver (init/provision/deploy/invoke/down) with CI/local auth switching. |
| cli/azd/extensions/azure.ai.agents/tests/e2e-live/README.md | Docs for CI setup (ADO registration/service connection/secrets) and local WSL execution. |
- Use the ambient azure-sdk org secret `azuresdk-github-pat` for GH_TOKEN instead of an empty `GitHubPat` placeholder variable (mirrors eval-waza.yml); removes a misleading masked variable and the need for admin PAT setup. - Harden the AzureCLI@2 inline script: `set -euo pipefail` and assign-then-verify subscription/tenant so an `az account show` failure fails fast (a plain `export X=$(...)` would have masked the error from set -e). - Reword the extension-install comment to be self-contained (it no longer inaccurately claims to mirror lint-ext-azure-ai-agents.yml). - Clarify the test_full_e2e.py auth prerequisite: only local WSL runs leave auth.useAzCliAuth unset; CI auto-enables az CLI auth. - Clear tmux scrollback after env setup so the exported GH token cannot leak into capture() output on failures/timeouts. - _cleanup_leaked_resources now checks azd down's return code and reports failures instead of always printing "Cleanup complete".
- Stream child E2E output live with a watchdog-enforced hard timeout instead of buffering everything via capture_output - Shell-escape the GitHub token (shlex.quote) before exporting in tmux - Clean up the per-mode AZD_CONFIG_DIR temp copy unless E2E_KEEP_ARTIFACTS - Use sha256 instead of md5 for the agent-name uniqueness suffix - Derive the agent binary arch from uname -m instead of hard-coding amd64
- Shell-escape HOME/PATH/TENANT, the cd target, and the agent name with shlex.quote() (consistent with the earlier token fix) - On Tier 2 timeout, kill the child's detached tmux server so reused CI agents do not accumulate orphaned tmux sockets
jongio
left a comment
There was a problem hiding this comment.
The Go rewrite (commits c62a696 and 4d27988) directly addresses @vhvb1989's direction request. The interactive driver now uses go-expect + vt10x + creack/pty with event-driven synchronization (waitForQuiet) instead of fixed-interval polling. This is solid work.
What looks good:
- Event-driven sync:
promptQuiet/listSettlewaits for the UI to stop emitting before dispatching, rather than sleeping a fixed interval. Correct approach for flake-free interactive tests. - Timeout hierarchy: per-mode
runTimeout(60m) <go test -timeout(125m) < ADO step (130m) < job (150m). Each layer has room for graceful teardown before the next one kills it. - Cleanup safety:
assertSafeTestDirblocks dangerousos.RemoveAllon protected paths;t.Cleanup(r.teardown)guaranteesazd down --force --purgeruns even on failure; thecondition: always()pipeline step is a crash-path backstop. - Platform-portable assert logic:
assert.gobuilds on all platforms (no build tag), with the regex-free standalone-"4" detection explained clearly. - Pipeline arch detection:
uname -mmapping to GOARCH means the step won't break if the pool moves off amd64.
One observation for the conversation with @vhvb1989:
The latest commits already deliver what was requested. The old Python/tmux files are gone, the Go driver is in place, and it uses exactly the libraries and approach described in the direction-change comment. The driver is local to the extension's tests/e2e-live/ for now (as suggested), ready to lift into the shared module when #8795 lands.
Minor notes (non-blocking):
tier2_live_test.goat 859 lines is large for a single file, but splitting a sequential test flow (init/provision/deploy/invoke/down + all helpers) across multiple files would hurt readability. Acceptable.- The
dispatchPromptsource-line annotations (e.g.,init.go:722) are useful for maintenance but will become stale as the extension evolves. The README already documents this trade-off ("changes there can require updatingdispatchPrompt"). Consider a lint or generator if drift becomes a maintenance burden. findServiceNameuses line-by-line string matching to parseazure.yaml. This is pragmatic for the known scaffolded output format and avoids pulling in a YAML library for one test helper. Fine as-is.
|
Thanks @vhvb1989 — done. Ported the interactive Per "one thing to validate" note: I did a source-grounded spike and calibrated the prompt dispatch to the real |
|
Thanks @jongio. The driver changed — the Go rewrite removed the original Python/tmux code, so those first-review points are gone. The three notes on the new code are acknowledged trade-offs (the |
jongio
left a comment
There was a problem hiding this comment.
My prior analysis (two rounds against this same HEAD) still holds. The Go rewrite using go-expect + vt10x + creack/pty is solid: event-driven synchronization via waitForQuiet, clean timeout hierarchy, three-layer resource cleanup, and well-tested assert logic.
One code modernization note below. Also: the PR currently has merge conflicts that need resolving before merge.
No blocking issues from my side.
There was a problem hiding this comment.
The Go rewrite from tmux+Python to a PTY-based Go driver is a significant improvement: same toolchain as the rest of the extension, no Python/tmux dependency on the agent, and the event-driven dispatch loop (wait for survey UI to settle, then match the rendered prompt) is more robust than fixed polling.
A few observations on the final state:
-
findServiceNamenaive YAML walk (tier2_live_test.go): The line-by-line parser assumes a specific indentation structure. Sinceazd ai agent initalways scaffolds the same format this is fine today, but if the template output changes it will silently break. Consider a targetedgrep -m1approach or importing a minimal YAML decoder if the function gets reused. -
Timeout cascading is well-designed: per-mode 60m > sum-of-phases 50m (10m teardown margin), step 130m > go-test 125m, job 150m > step. This ensures graceful teardown always runs before hard kills, preventing resource leaks. The commit that adjusted these (db53b06) explains the invariant clearly.
-
responseHasExpectedAnswerrejects "4.0" by design (the4followed by.then digit rule). Worth a brief code comment noting this is intentional, since a future maintainer might wonder why a valid math answer is rejected. The existing unit tests cover this implicitly via the "version" case ("4.1") but an explicit "4.0" case in the table would make the design intent obvious. -
Dependency versions (
go-expect,creack/pty,vt10x) are the same 2022 pseudo-versions that AlecAivazis/survey uses in its own test suite. They are proven to work well together for this exact use case. No action needed, but worth noting in case a future dep audit flags them as stale.
Per AGENTS.md Go 1.26 guidance, switch exitCode() from errors.As to errors.AsType[*exec.ExitError]. Add an explicit "4.0" case to the assert table and document why a decimal answer is deliberately rejected.
# Conflicts: # cli/azd/extensions/azure.ai.agents/cspell.yaml
jongio
left a comment
There was a problem hiding this comment.
One minor nit on the safety-net cleanup step: the 2>/dev/null on line 213 swallows all diagnostic output from azd down. If a crash-path cleanup fails, the pipeline log won't show why. Since the task already sets continueOnError: true (so a non-zero exit can't fail the build), consider 2>&1 || true instead to keep diagnostic output visible for post-mortem investigation while still preventing the set -e from aborting the loop.
Not blocking, the current pattern is defensible for a quiet safety net.
Use 2>&1 instead of 2>/dev/null in the always() cleanup loop so a failed force-purge is visible for debugging a resource leak; keep || true so set -e doesn't abort the loop.
jongio
left a comment
There was a problem hiding this comment.
The Go rewrite of the Tier 2 driver is well-structured: event-driven prompt dispatch, proper timeout hierarchy (per-mode 60m < go test 125m < step 130m < job 150m), and LIFO cleanup ordering so teardown always runs before the config dir it depends on is removed.
One minor observation below on findServiceName; everything else looks solid after 12+ rounds of iteration.
jongio
left a comment
There was a problem hiding this comment.
The crash-path cleanup now surfaces stderr via 2>&1 instead of swallowing it, making failed force-purge attempts visible in the pipeline log. The || true guard still prevents the loop from stopping on individual azd down failures.
findServiceName (tier2_live_test.go:649) parses azure.yaml by string matching rather than a YAML library. This works for the predictable output of azd ai agent init, but could silently break on edge-case YAML (a comment containing services:, or a multi-line string value). Low risk since the generated files are predictable, but worth noting if the test starts failing on valid projects.
Replace the indentation/colon line scan with a minimal struct unmarshal (gopkg.in/yaml.v3, already a direct dep) so the service-name lookup tolerates comments and indentation changes in scaffolded azure.yaml.
jongio
left a comment
There was a problem hiding this comment.
One suggestion on context propagation in the test runner.
jongio
left a comment
There was a problem hiding this comment.
One suggestion on context propagation in the test runner.
jongio
left a comment
There was a problem hiding this comment.
Reviewed the full diff (8 files, +1550/-1). My prior feedback from yesterday was addressed in the latest commits (errors.AsType, stderr visibility in cleanup, struct-based YAML parsing, t.Context() propagation).
Fresh pass on the current state: the test architecture is well-designed. The event-driven prompt loop (go-expect waitForQuiet instead of fixed-interval polling), the bounded ring buffer for diagnostics, the LIFO cleanup ordering, and the tiered timeout hierarchy (per-phase < per-mode < go-test < ADO step < ADO job) are all solid engineering.
One non-blocking note for future maintenance: the dispatchPrompt cases are tightly coupled to the extension's prompt strings (each case annotated with source file:line). The README documents this, but consider adding a CI step or a grep-based check that flags when prompt-emitting lines in the extension source change without a corresponding update here. That would catch desync earlier than a full Tier 2 run.
jongio
left a comment
There was a problem hiding this comment.
Checked the Go PTY test driver, answer matcher logic, console helpers, pipeline YAML, and README.
Prior feedback (errors.AsType, t.Context(), yaml struct parsing, stderr redirect) is all addressed. CI is green across all checks.
The timeout hierarchy, LIFO cleanup ordering, assertSafeTestDir guard, and event-driven prompt loop with loop detection are correctly implemented. The standalone-4 checker handles edge cases well (model names, versions, status codes). Dependencies were already in go.sum as indirect deps; the go.mod change just promotes them to direct.
One non-blocking suggestion on the dispatchPrompt default case.
There was a problem hiding this comment.
Reviewed all 8 files (1554 additions). My earlier comments (errors.AsType, t.Context(), yaml.Unmarshal for service name parsing, unhandled-prompt logging) are all addressed in the current revision.
Checked specifically:
- Timeout hierarchy math: per-mode 60 min runTimeout x2 + teardown fits inside the
go test -timeout 125mcap, and the ADO step (130 min) and job (150 min) caps sit above that. Thecondition: always()cleanup step provides a second safety net if the process is force-killed mid-teardown. - Pipeline security: no hardcoded secrets, proper WIF auth via service connection, the GH_TOKEN comes from an ambient org secret. The
set -euo pipefail+ assign-then-verify pattern for subscription/tenant avoids theexport X=$(cmd)pitfall. - PTY wiring:
Setsid: true, Setctty: truewith stdin as the pty slave correctly gives the child a controlling terminal. The event-drivenwaitForQuietapproach with the loop detector (repeat >= 3/5) handles both slow CI and stuck prompts without hanging. - Cleanup ordering: LIFO
t.Cleanupensuresazd downruns before the copiedAZD_CONFIG_DIRis deleted. Thecontext.Background()in teardown is correct (must survive test-context cancellation). assertSafeTestDirguard: protects against destructiveRemoveAllon protected paths.
No blocking issues. The test-only dependencies (go-expect, vt10x, creack/pty) are at the same versions used by the survey/expect ecosystem and are already present in go.sum as indirect deps (no go.sum diff needed).
jongio
left a comment
There was a problem hiding this comment.
Prior feedback addressed: errors.AsType, t.Context(), yaml.Unmarshal for service name parsing, default-case logging for unhandled prompts.
One maintainability note on the prompt dispatch (inline).
# Conflicts: # cli/azd/extensions/azure.ai.agents/cspell.yaml
jongio
left a comment
There was a problem hiding this comment.
One minor nit on the location dispatch case for consistency with the other tightened prompt matches.
Merging main shifted prompt_service.go, leaving the cited line numbers (143/190/226) pointing at the wrong lines. Drop the brittle file:line reference and rely on the function name plus the quoted picker messages, which the case clauses already match exactly.
jongio
left a comment
There was a problem hiding this comment.
Previous feedback (errors.AsType, t.Context(), yaml.Unmarshal for service name parsing, tightening prompt dispatch matches, logging default-case fall-throughs) all addressed in recent commits. Code quality is solid.
One observation: the PR description mentions a condition: always() AzureCLI@2 step as a "second safety net" for crash-path cleanup, but the pipeline YAML only has PublishPipelineArtifact@1 with condition: always(). The Go .Cleanup(r.teardown) approach is sufficient given the timeout hierarchy (job 150m > step 130m > go test 125m > per-mode 60m), so this isn't a gap in practice, but the description is slightly overstated. Consider updating the PR body to reflect the actual cleanup mechanism (t.Cleanup + LIFO ordering) or adding the explicit cleanup step if you want the belt-and-suspenders guarantee for cases where the Go process is killed before t.Cleanup runs.
No blocking issues. The event-driven prompt loop design, the timeout hierarchy, and the �ssertSafeTestDir guard are all well thought out.
ab890b1 tightened the subscription case to has("select an azure subscription"), but that string is the extension's fmt.Println preamble, not the survey "?" line activePrompt reads. ensureSubscription passes an empty PromptSubscriptionRequest, so the picker renders azd-core's default message "Select subscription" (prompt_service.go:507). The tightened match missed it, sending the prompt to the default Enter and selecting the wrong subscription in a live run. Match "select subscription" instead.
jongio
left a comment
There was a problem hiding this comment.
Incremental commit fixes the subscription dispatch to match azd-core's actual picker message (Select subscription from prompt_service.go:507) rather than the extension's preamble text. Verified: activePrompt lowercases before dispatch, so the case-sensitive has() correctly matches.
Minor note for future awareness: select subscription (singular) is also a prefix/substring of select subscriptions to include in the filter (config_sub_filter.go:188). This won't cause false matches since that multi-subscription filter prompt doesn't appear in the agent init flow, but worth keeping in mind if future init steps add subscription filtering.
RickWinter
left a comment
There was a problem hiding this comment.
This adds a Tier 2 live golden-path test (init/provision/deploy/invoke/down) for the azure.ai.agents extension, driven through a real PTY with go-expect + vt10x, plus an on-demand/weekly Azure DevOps pipeline that runs it against the TME subscription. The shape is right: it is gated behind AZURE_AI_AGENTS_E2E_LIVE, kept out of the PR pipeline, runs the two deploy modes sequentially, and registers an azd down teardown with a condition: always() safety net so a crash does not leak billable resources. go.sum already carries the three new direct dependencies, so the extension module still builds, and the answer matcher is well-factored and covered on its own by assert_test.go.
One thing is worth tightening before a green run is trusted as signal: the invoke assertion matches a standalone "4" against the entire combined CLI output rather than the agent's actual response, so incidental azd chatter (an elapsed-seconds token, a count, a step number) could turn the check green while the agent is broken. That is the difference between a live test that verifies the agent and one that mostly verifies the pipeline ran, so I would scope it to the response payload.
The rest are non-blocking. The numerous file:line source citations will rot silently the first time those files are edited, and the pipeline hand-writes the extension manifest into config.json where it can drift from the real one. Nothing here runs in PR CI, so the blast radius is bounded to the manual or scheduled live run. None of it blocks merge; I would fix the invoke matcher scope before relying on a green run and treat the other two as follow-ups.
| continue | ||
| } | ||
|
|
||
| if !responseHasExpectedAnswer(out) { |
There was a problem hiding this comment.
responseHasExpectedAnswer here runs against out, which is the full combined stdout+stderr of azd ai agent invoke, not just the agent's response. The matcher is careful to reject gpt-4o-mini, 4.1, and 404, but it still accepts any standalone 4, and azd's own output can emit one (think "4 seconds", "4 warnings", a step counter). If that happens the invoke check goes green even when the model never answered, which quietly defeats the assertion. Can we scope the match to the agent's actual response (parse the invoke output field) instead of the whole stream?
| // Deploy mode is NOT an interactive prompt in the template/--agent-name | ||
| // flow: init auto-resolves it to "container" when a manifest is provided | ||
| // (init_from_code.go:1373), so it must be chosen via the --deploy-mode flag | ||
| // (init.go:1306). r.mode is exactly "container" or "code". |
There was a problem hiding this comment.
nit: these file:line anchors (init.go:1306 here, init_from_code.go:1373 above, and the dozen more in dispatchPrompt) are genuinely helpful context, but they rot the first time someone edits those files and nothing fails when they do. Could we anchor to the function name or the message string/constant instead of a line number, so the reference survives unrelated edits above it?
| "azure.ai.agents": { | ||
| "id": "azure.ai.agents", | ||
| "namespace": "ai.agent", | ||
| "capabilities": ["custom-commands", "lifecycle-events", "mcp-server", "service-target-provider", "metadata"], |
There was a problem hiding this comment.
nit: this inlines the extension's manifest (capabilities, displayName, description, usage) into the pipeline. It drifts from the real manifest the moment that changes, and a stale capabilities list here could either mask a capability-resolution regression or break the run for a reason unrelated to the test. Since we just built the binary one step up, can we let azd install/register it (or read the real manifest) rather than hand-writing config.json?
jongio
left a comment
There was a problem hiding this comment.
Confirmed RickWinter's invoke assertion concern (replied on his thread with details). responseHasExpectedAnswer runs on the full combined stdout+stderr stream, and patterns like "4 seconds" or "step 4:" pass the standalone-four check since space/colon aren't word runes. Consider adding -o raw to the invoke args so the assertion is scoped to the server response payload only.
No other new findings. The event-driven dispatch loop, layered timeout hierarchy (per-mode 60m < go test 125m < ADO step 130m < job 150m), and LIFO cleanup ordering are correctly structured. Prior review feedback has been addressed in the current HEAD.
Fixes #8759
Summary
eng/pipelines/ext-azure-ai-agents-live.yml: an on-demand / weekly Azure DevOps pipeline that runs the Tier 2 live golden path (init → provision → deploy → invoke → down) for theazure.ai.agentsextension against a real Azure (TME) subscription, for bothcodeandcontainerdeploy modes.tier2_live_test.go,console_test.go,assert.go) undercli/azd/extensions/azure.ai.agents/tests/e2e-live/, drivingazd ai agentthrough a pseudo-terminal withgo-expect+vt10x+creack/pty(replacing the original tmux/Python prototype from test: add static E2E tests for azure.ai.agents extension #8692), with CI auth detection for Azure DevOps (TF_BUILD/E2E_USE_AZ_CLI_AUTH).README.mddocumenting how to run (CI + local) and the one-time ADO setup.Together with the PR-gate tests in #8754 (Tier 0 offline + Tier 1 recording/playback), this covers Tier 0/1/2 from the original prototype (#8692).
At a glance
What it tests — the full live golden path
init → provision → deploy → invoke → downofazd ai agent, driving a real CLI over a pseudo-terminal (Gogo-expect+vt10x+creack/pty), for bothcode(zip) andcontainer(ACR build) deploy modes, run sequentially. The interactiveinitphase is an event-driven prompt loop (wait for the survey UI to settle, read the rendered screen, dispatch on the verbatim prompt strings) rather than a fixed script, so it tolerates the live model/region/project branches. Theinvokestep asserts a live model answers "what is 2+2?" with a standalone4/four.How to trigger —
trigger: none/pr: none, so it never runs on PR push. Run it via: a PR comment/azp run ext-azure-ai-agents-live(requires repo write), the weekly Mon 07:00 UTC schedule, or a manual ADO queue (pickdeployModes=both/code/container).Identity / subscription / what it does — authenticates as the
azure-sdk-testsADO service connection (Workload Identity Federation / OIDC) into the shared TME test subscription, regioneastus2. It creates a fresh Foundry project +gpt-4o-minideployment (+ an ACR in container mode) and the agent service, exercises them, then tears them down.azlogs in via OIDC and azd reuses that session (auth.useAzCliAuth+keepAzSessionActive: true, so the WIF token survives the multi-minute run).Leftover resources? — No, by design. Each mode registers a
t.Cleanupthat runsazd down --force --purge, and a teardown failure fails the run (never green while leaking). Acondition: always()AzureCLI@2step is a second safety net: it force-purges any project dirs left under/tmp/e2e-tests/tier2-*/after a crash or timeout. (Crash-path cleanup is best-effort, so the TME subscription is worth an occasional spot-check.)Runs on Azure DevOps, not GitHub Actions — the YAML lives in this GitHub repo (
eng/pipelines/…) but is an ADO pipeline (extends: 1es-redirect,task: AzureCLI@2,$(Build.*),##vso[...]), executed by theazure-sdkADO project. GitHub CI for this PR stays green/inert; the live run only happens in ADO.Post-merge admin action — needs an Azure DevOps admin on the
azure-sdkproject (Azure SDK EngSys) — not a GitHub admin — to: (a) register the pipeline asext-azure-ai-agents-live(the exact name/azp runuses) against this repo + YAML path; (b) confirm theazure-sdk-testsservice connection maps to the TME subscription with RBAC to create Foundry projects and deploy models (Contributor + Azure AI Developer + Cognitive Services Contributor, or equivalent); (c) kick the first/azp runto validate thekeepAzSessionActiveauth path. Merging the PR itself only needs a GitHub maintainer approval. TheinitGitHub token is already wired via the ambientazuresdk-github-patorg secret, so no extra secret setup is needed.Testing
/azp run(the test is gated behindAZURE_AI_AGENTS_E2E_LIVE=1and never runs in PR CI).gofmt,go vet,golangci-lint(ubuntu + windows), and ago test -ccompile all clean on thelinuxbuild tag; the no-tagassert.goanswer-matching logic is covered byassert_test.go(runs on the host);README.mdpassescspell.t.Context()as the run-context parent sogo test -timeoutcancellation propagates; LIFOt.Cleanupordering so teardown runs before the copiedAZD_CONFIG_DIRis removed; a destructive-RemoveAllguardrail on the test dir; and a robust standalone-token check for the invoke result (accepts4orfour, ignores incidental4s such asgpt-4o-mini/4.1/404).runTimeout(60 min, acontextdeadline) > the sum of the per-phase budgets (init/provision/deploy/invoke ≈ 32 min) so a slow-but-healthy run is never preempted mid-phase; both modes fit undergo test -timeout 125m, and the ADOAzureCLI@2step (130 min) and job (150 min) caps sit strictly above that, so a run can never be hard-killed mid-azd down(which would leak live resources).