feat: unify azure yaml, agent & resource definitions, $ref includes, and IaC-less init#8818
feat: unify azure yaml, agent & resource definitions, $ref includes, and IaC-less init#8818huimiu wants to merge 24 commits into
Conversation
* feat(agents): add microsoft.foundry azure.yaml schema Schema-only scaffolding for unifying Microsoft Foundry agent config in azure.yaml (design spec PR #8590, section 2.3). - Add the host: microsoft.foundry conditional to schemas/v1.0 and schemas/alpha azure.yaml.json, composing the extension schema at the service level via allOf and turning off project/runtime/docker/image/config. - Add microsoft.foundry to the host examples list. - Publish the Foundry extension schemas under cli/azd/extensions/azure.ai.agents/schemas/: microsoft.foundry.json plus per-resource files (Agent, Skill, Routine, Connection, Toolbox, Deployment, FileRef), ported from the PM preview repo with $id rewritten to the azure-dev path and relative $refs preserved. - microsoft.foundry.json uses additionalProperties: true at the project level (deliberate deviation from the preview's false) so future Foundry resource types do not break the schema, per the brief and design spec section 2.3. Authoring-only: no service-target wiring, provider registration, or alpha-feature gating (those are later PRs). * fix(agents): relax PromptAgent to accept skill without instructions PromptAgent now requires name + kind plus at least one of instructions or skill (anyOf), instead of always requiring instructions. A prompt agent backed by a skill (which supplies the instructions) no longer fails schema validation. This fixes the complex sample's summarizer-agent validation failure. * fix(agents): enforce Foundry schema dependencies Require project when hosted agents define docker or runtime settings, and enforce routine trigger-specific required fields for schedule, webhook, and event triggers. * fix(agents): align Foundry schema constraints Use conditional schema constraints for hosted-agent project requirements and routine trigger-specific fields. * feat: target foundry azure.yaml schema at integration branch with validatable samples * fix: normalize foundry azure.yaml schema URLs to short raw.githubusercontent form
…undry config (#8627) * feat: resolve $ref file includes with overlay overrides in Foundry agents config * fix: modernize $ref cycle check with slices and trim whitespace in include paths
* feat(agents): add --image flag to agent init command Add --image flag to `azd ai agent init` to allow users to specify a pre-built container image directly during init. When provided: - Writes the image URL to the `image` field in agent.yaml - Skips ACR connection prompts (user manages their own registry) - Validates the image URL contains registry/image format - Returns error if combined with --deploy-mode code This enables non-interactive init workflows for VNet scenarios where users bring their own private ACR with pre-built images. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * feat(agents): short-circuit template/language selection when --image is set When --image is provided without --manifest, synthesize a minimal hosted container manifest and route through the manifest flow, skipping the init-mode/template/language prompts and code scaffolding (a pre-built image has no source to scaffold). Requires --agent-name; validates the image early before any project/template init. The image is written to agent.yaml and ACR is skipped via the existing --image handling. * fix(agents): skip ACR connection prompt for --image in interactive mode The interactive ACR connection discovery/prompt (configureAcrConnection via configureFoundryProjectEnv) was gated on a.isCodeDeploy, which is false for a pre-built --image (container deploy). Pass a.skipACR() to the four selectFoundryProject call sites so an image-based hosted agent skips ACR setup in interactive mode too, matching the AZD_AGENT_SKIP_ACR env behavior. The hosted-agent region filter now also applies to --image, which is correct since it runs as a hosted agent. * fix(agents): skip startup-command prompt for pre-built image agents A pre-built container image runs its own entrypoint, so no startup command applies. Skip resolveStartupCommandForInit when the agent template references a pre-built image (agentUsesPreBuiltImage), covering both --image (synthesized manifest) and a -m manifest that already specifies an image. Without this, the container-deploy path prompted "Enter the command to start your agent" because the image src dir has no main.py to auto-detect. * fix(agents): persist AZD_AGENT_SKIP_ACR in the no-model-resources path configureModelChoice only wrote AZD_AGENT_SKIP_ACR in the deferred-headless and main model-config paths. A completing no-model-resources flow (e.g. a pre-built --image agent selected interactively) returned without setting it, so provision would still create an ACR despite --image. Set it via skipACR() before returning, matching the other two call sites. * fix(agents): make --image deployments use the pre-built image end-to-end Real init/provision/deploy validation showed that --image init correctly wrote image: and AZD_AGENT_SKIP_ACR=true, but deploy still defaulted to the Dockerfile build path in --no-prompt mode. Treat AZD_AGENT_SKIP_ACR=true as an explicit BYO-image signal for hosted container agents so Package returns a remote pre-built image artifact and Publish is skipped. The same E2E also found the synthesized manifest used responses protocol version v1, while invoke requires 1.0.0. Generate 1.0.0 for --image manifests. * fix(agents): tighten --image reference validation Replace the slash-only validation with a stricter fully-qualified image reference regex that requires an explicit registry host, repository path, and optional tag or sha256 digest. Add coverage for digest, localhost/port registry, and malformed refs such as URL schemes, namespace-only refs, missing repository, short digest, and uppercase repository names. --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* feat(agents): bicep-less init for azd ai agent init `azd ai agent init` now produces a project without an on-disk `infra/` directory and stamps `infra.provider: microsoft.foundry` on azure.yaml so provisioning is routed to this extension's provider. Init UX changes: - After the starter scaffold, the `infra/` tree is removed. - `azure.yaml` gets `infra.provider: microsoft.foundry` written and `infra.path: ./infra` removed. - ACR and Application Insights connection prompts are skipped during init; the extension's provisioning provider owns those resources. - The "no infra/ directory" warning for existing projects is suppressed when the provider is already declared. Core changes: - Relax the `infra.provider` JSON Schema (v1.0 and alpha) from a fixed enum to `pattern: "^[a-z0-9.]+$"` + `examples`, keeping typo catching while allowing extension-registered providers. New code: - internal/project/provisioning_provider.go: `FoundryProviderName` constant, single source of truth. - internal/cmd/init.go: `writeFoundryProvider`, `hasFoundryProviderDeclared` helpers. - internal/cmd/init_foundry_resources_helpers.go: `bicepless` parameter on `configureFoundryProjectEnv` and `selectFoundryProject` that short-circuits connection discovery. Tests: - TestHasFoundryProviderDeclared covers the warning-suppression predicate. - TestConfigureFoundryProjectEnv_BicepLessShortCircuits verifies the short-circuit fires before any Foundry data-plane call (uses a nil credential to make a regression crash loudly). * feat(agents): microsoft.foundry provisioning provider Adds an in-memory Bicep synthesizer and a provisioning provider that implements the full azdext provisioning gRPC surface for projects declaring `infra.provider: microsoft.foundry`. The provider generates, compiles, and deploys an ARM template on demand without writing Bicep to disk. Synthesizer (internal/synthesis): - Embeds the templates/main.bicep + modules/acr.bicep and the precompiled main.arm.json via `go embed`. - `Synthesize(Input)` returns the parameterized ARM JSON plus a manifest describing the synthesized resources. - `AcceptedHosts` input controls which agent host kinds the synthesizer will materialize (currently `azure.ai.agent`). Provider (internal/project): - `FoundryProvisioningProvider` implements all 9 azdext provisioning methods (Initialize, State, Deploy, Destroy, Preview, Parameters, Outputs, Resources, GetDeployment). - `ensureResourceGroup` creates the RG on demand. - Synthesizes from agent.yaml + host shapes, threads outputs back through the env, surfaces typed errors via exterrors. Service-target integration: - `AgentServiceTargetProvider.Initialize` is split into a cheap stub and lazy `ensureDeployContext` / `ensureEnv` calls so registration no longer requires a deploy context. - All 5 deploy-time entry points (Package, Publish, Deploy, Endpoints, GetTargetResource) gate on the lazy helpers. Registration: - `extension.yaml` declares the `provisioning-provider` capability with `microsoft.foundry`. - `internal/cmd/listen.go` registers `WithProvisioningProvider(FoundryProviderName, ...)`. Errors: - `internal/exterrors/codes.go` adds provisioning op names and error codes used by the provider. * feat(agents): add `--infra` eject flag to `azd ai agent init` Implements the bicepless-foundry spec's eject command: `azd ai agent init --infra` writes the synthesized Bicep templates from azure.yaml to ./infra/ on disk. Behavior follows RFC #8065 and the spec (spec/bicepless-foundry/spec.md §Eject Command): | Trigger | Behavior | | --------------------------------------------- | ------------------------------------------------------------------------------------------- | | empty dir + --infra | Run normal init flow, then write ./infra/ | | existing foundry project, no ./infra/, --infra | Standalone eject: write ./infra/; no init prompts, no agent-code change, azure.yaml unmutated | | ./infra/ already exists + --infra | Refuse with CodeInfraEjectExists, suggest delete + retry | | not an azd agent project + --infra | Refuse with CodeInfraEjectNoFoundryService | New code: - internal/cmd/init_infra.go: `ejectInfra` walks `synthesis.TemplatesFS()` to ./infra/, writes `main.parameters.json` from the synthesizer's parameter map, and prints the spec's success block. Atomic cleanup of partial writes on error. - internal/cmd/init.go: `--infra` flag plus top-of-RunE short-circuit for the standalone-eject case and a post-init hook for the empty-dir case. The standalone branch validates that no init-driving inputs (-m, --src, positional arg) were also passed; honoring them would silently drop the user's argument. The post-init chain skips silently when init returned without producing a foundry service so the user never sees "nothing to eject" after init succeeded. Provider lifecycle hardening (reviewer findings, all critical/major): - `FoundryProvisioningProvider.Initialize` is now cheap: tenant lookup and credential construction moved into a lazy `ensureCredential` helper called on-demand by `deploymentsClient` / `ensureResourceGroup`. azd-core may call Initialize on providers it never deploys with (env refresh, multi-provider projects); making it cheap avoids needless RPCs and lets metadata-only calls (Parameters, PlannedOutputs) succeed without auth. - `Destroy` no longer silently leaks resources. With `Force=true` it deletes the entire resource group (the previous behavior only removed the deployment record). Without `--force` it returns a structured `CodeDestroyRequiresForce` naming the RG and pointing at `azd down --force`. - `armOutputsToProto` and `armInputsToProto` JSON-encode non-string values via a shared `encodeParamValue` helper. Prior behavior collapsed arrays/objects via Go's default %v formatter (`["a","b"]` -> `[a b]`) which is unparseable downstream. - `Parameters` and `armParameters` are nil-safe on `synthResult` so a programming error elsewhere returns a structured Internal error instead of panicking. - `Preview` returns `Compatibility` (feature not implemented in this version) instead of `Validation`; the user supplied no invalid input. - `findFoundryService` and its eject equivalent return `Dependency` instead of `Validation` for missing-service errors; a missing service is a missing dependency, not malformed input. Multiple-services error split into its own `CodeInfraEjectMultipleFoundryServices` so telemetry can differentiate "none" from "too many". Tests (all passing under `go test ./... -short`): - `init_infra_test.go`: 10 test cases covering refusals (azure.yaml missing, ./infra/ exists, ./infra/ is a file, no foundry service, multiple foundry services, conflicting args with -m/--src/positional) and happy paths (file shape, parameters shape, no-docker omits ACR param). - `foundry_provisioning_provider_test.go`: `TestEncodeParamValue` (10 sub-cases for the JSON-encode helper), `TestArmOutputsToProto_JSONEncodesNonStrings`, `TestArmInputsToProto_JSONEncodesNonStrings`, `TestParameters_NilSafeOnMissingSynthResult`, `TestArmParameters_NilSafeOnMissingSynthResult`, `TestDestroy_RefusesWithoutForce`, `TestFindFoundryService_DependencyCategory`, `TestPreview_NotImplemented` updated to assert `Compatibility` category. End-to-end verified via `azd x build` + manual smoke tests against the installed extension: standalone eject succeeds, three conflicting- args combinations refuse cleanly with the structured error, refuse- when-./infra/-exists and refuse-when-multiple-services produce the expected output. Deliberate spec deviations (documented elsewhere, deferred): - Spec example lists `infra/modules/foundry-project.bicep`; our `main.bicep` is monolithic so the eject writes only the modules that actually exist (`main.bicep`, `modules/acr.bicep`, `abbreviations.json`, `main.parameters.json`). - Spec uses `--force` to overwrite an existing ./infra/; we follow the spec's later guidance to ask the user to delete and re-run. - Provider's on-disk reuse path (`Deploy` reads ./infra/main.bicep when present) is not implemented; the provider always uses the embedded ARM JSON. `--infra` is therefore inspection-only today; edits to the ejected `main.bicep` are not honored on the next `azd provision`. Tracked as separate work. - Telemetry fields `init.infra_flag` and `provision.synthesis_source` not yet emitted. * feat(agents): on-disk Bicep path + Preview via ARM what-if Adds support for projects that have an on-disk `./infra/` directory (typically the output of `azd ai agent init --infra`), and implements the provisioning Preview operation via ARM what-if for both inline and on-disk paths. On-disk path: - internal/project/ondisk_template.go detects `./infra/main.bicep`, compiles it via the bicep CLI, and feeds the resulting ARM JSON through the same deploy + outputs pipeline as the inline path. - `Parameters()` returns host-derived values regardless of source. Preview: - internal/project/preview_helpers.go renders an ARM what-if change summary (Create / Modify / Delete / NoChange / Ignore). - `Preview()` calls `ensureResourceGroup` first so what-if has a scope to query against. Robustness: - preprovision tolerates a missing agent.yaml on the inline-agents path (the synthesizer can still run from host shapes alone). - Deployment output names are canonicalized case-insensitively so ARM output lookups survive casing drift in the template. * chore(agents): polish: lint, drop starter clone, RG fix, comment trim Cross-cutting cleanup pass after the provider feature stabilized. - Satisfy golangci-lint and cspell on the bicepless feature. - Drop the `Azure-Samples/azd-ai-starter-basic` clone from init. `azd ai agent init` now invokes `azd init -t <emptyStagedDir> <targetDir>` to scaffold the project shape directly. Net -474 LoC. - Ensure the resource group exists before ARM what-if runs in Preview. - Trim verbose design-commentary from the bicepless code; comments now describe what code does and why, not the historical rationale. * fix(agents): purge soft-deleted Cognitive Services accounts on `azd down --force --purge` Foundry's CognitiveServices (AIServices) accounts soft-delete on RG removal and reserve the name for ~48h. The provider previously ignored options.GetPurge(), so the next `azd provision` failed with InvalidTemplateDeployment / FlagMustBeSetForRestore. Destroy now mirrors azd-core's built-in bicep provider purge flow: enumerate live accounts in the RG before BeginDelete (capturing name + location), delete the RG (soft-deleting the accounts), then purge each via DeletedAccountsClient.BeginPurge. - Hard-fails on purge / enumeration errors (silent skip would just reproduce the same bug on the next provision). - Skips the purge path when the RG is already gone at Destroy time (idempotent re-run). Documented in the Destroy docstring; users with a leftover from a prior incomplete cleanup can purge it manually via `az cognitiveservices account purge`. - Without --purge, the account remains soft-deleted (auto-expires per Azure retention) - unchanged behavior from prior commits. New OpCognitiveAccountList / OpCognitiveAccountPurge op-names give telemetry a way to distinguish the enumeration vs the purge step. Extracts collectPurgeableAccounts as a pure helper for unit testing; the SDK pager + poller calls stay inline in Destroy (consistent with the rest of the provider's untestable Azure SDK call sites). * fix(agents): address PR review feedback on bicepless feature - foundry_provisioning_provider.go: thread the caller's context into envValues so the Environment().GetValues gRPC call honors cancellation/timeouts instead of using context.Background(). - init_infra.go: correct the ejectInfra doc comment; it claimed infra.provider "stays azure.ai.agents" but the provider is microsoft.foundry. Now states the declared provider is left unchanged. - init.go: the missing-infra warning pointed users at running `azd ai agent init` in an empty directory, which no longer produces an infra/ directory now that init is bicep-less. Point to `azd ai agent init --infra` instead. * feat(provisioning): use subscription-scoped template for preview parity Switch to subscription-scoped ARM deployment so `azd provision --preview` runs what-if without pre-creating the resource group. The RG now appears in the what-if output as a Create operation, matching built-in bicep provider behavior. * feat(provisioning): render extension preview changes and stream progress Add a structured changes list to the provisioning preview proto and map it into the core preview renderer so `azd provision --preview` shows a colored resource diff for extension providers (previously only a string summary was sent and dropped). Route Deploy/Preview/Destroy progress to the console spinner via an injected console instead of debug-only logs, finishing the existing TODOs. * fix(agents): emit preview changes, clean up model deployments on down, fix ACR connection Preview now returns structured what-if changes so the core renderer shows a colored resource diff. Destroy deletes model deployments before the resource group so the Cognitive Services account delete no longer rolls back with CannotDeleteAccountWithDeployments. The ACR project connection uses the project identity client id plus the 2025-04-01-preview api-version, matching a working Foundry hosted-agent connection (fixes empty-credential and MissingApiVersionParameter failures). Includes a temporary go.mod replace pointing at the in-tree azd core; the core proto change must land and be published before this merges.
) * feat(agents): support Terraform as an IaC option for azd ai agent init Add `--infra=terraform` to `azd ai agent init` so it ejects a Terraform module (azurerm + azapi) to ./infra/ instead of Bicep, provisioning the same resources and emitting the same output contract, then stamps `infra.provider: terraform` so azd-core's built-in Terraform provider handles provisioning. Bicep paths are unchanged. - `--infra` flag changed from bool to string with NoOptDefVal=bicep: bare `--infra` ejects Bicep, `--infra=terraform` ejects Terraform, `--infra=bicep` is explicit, absent stays bicepless default. - Embed templates/terraform/*.tf; generate main.tfvars.json at eject time. - Terraform path stamps infra.provider: terraform in azure.yaml (the one place eject mutates it); Bicep path leaves azure.yaml untouched. - foundry_project_name defaults to env-name derivation when AZURE_AI_PROJECT_NAME is unset, mirroring the Bicep provider's sanitizeFoundryName. Refs #8705 * feat(agents): omit acr.tf and ACR outputs when no agent uses docker On the Terraform eject path, only emit ACR when at least one agent declares a docker: block: - acr.tf is written only in the docker case (it owns container_registry_name); main.tf has no ACR references otherwise. - outputs.tf is rendered from outputs.tf.tmpl (text/template) and includes the three AZURE_CONTAINER_REGISTRY_* / ACR connection outputs only when acr.tf is present, omitting them entirely otherwise (Terraform resolves resource references statically, so they cannot reference absent resources). - include_acr is no longer a module variable or a main.tfvars.json key -- the presence of acr.tf is the include-ACR decision. - main.tf also derives resource_group_name as rg-{env} and the project name from environment_name when those env vars are unset, so the bicepless-default -> Terraform flow provisions without extra setup. - Trimmed the ejected .tf comments to drop azd-internal references. Accepts asymmetry with the Bicep path (which always emits acr.bicep). Refs #8705 * docs(agents): fix misleading tfvars comment in writeTfvarsFile The comment claimed an 'ordered slice of lines' but the code uses a map[string]any. json.MarshalIndent sorts map keys, so the generated main.tfvars.json is deterministic; the comment now says so. Addresses a Copilot review note on PR #8756. The two other Copilot notes (replace() regex on main.tf:33, coalesce() empty-string fallback on main.tf:12) are false positives: Terraform's replace() treats a /.../-wrapped substring as a regex, and coalesce() skips empty strings, both verified against the HashiCorp docs and a live azd provision --preview.
) * feat(agents): private networking for host: microsoft.foundry Add a declarative network: block to the Foundry service in azure.yaml and teach the bicep-less synthesizer to provision a VNet-bound (network-secured) account from it. Additive: an absent block yields today's public account. - schema: network: surface (mode byo|managed, byo vnet/subnets tri-state, managed isolationMode, dns create-or-reference) on microsoft.foundry.json - synthesizer: decode network:, resolve ${VAR}, validate (mode coherence, vnet ARM id, subnet tri-state/CIDR, DNS rg/sub), emit network params + NetworkMode for telemetry - templates: new modules/network.bicep, subnet.bicep, private-endpoint-dns.bicep; resources.bicep/main.bicep guard the network path on enableNetworkIsolation (publicNetworkAccess Disabled, networkAcls Deny, agent networkInjections, account private endpoint + 3 AI DNS zones); main.arm.json regenerated - provider: pass azd env for ${VAR} resolution, emit provision.network_mode, warn that network: is ignored when endpoint: (brownfield) is set - docs/tests: synthesizer network tests, eject module assertions, extension README network section, telemetry-data.md provision.network_mode * fix(agents): preserve network ${VAR} refs through eject The existing on-disk provision flow resolves ${VAR} in main.parameters.json from the azd environment at provision time. Eject must therefore keep ${VAR} references verbatim instead of resolving them eagerly from the process env and freezing a literal into the ejected file. - synthesis.Input gains PreserveVarRefs; when set, byo.vnet.id and dns.subscription pass through verbatim and the format checks that cannot run on an unexpanded placeholder are skipped (concrete-but-malformed still fails) - eject (init --infra) sets PreserveVarRefs so the ejected main.parameters.json stays environment-portable; the provision path still resolves and validates - tests: synthesizer preserve-mode (pass-through + concrete validation) and an eject e2e asserting ${AZURE_VNET_ID}/${AZURE_DNS_SUBSCRIPTION_ID} survive * test(agents): add Foundry private-networking E2E harness Bash E2E harness validating host: microsoft.foundry private networking, designed to minimize Azure resource-operation time: - ONE real network account is provisioned (create+own matrix cell) with a BYO --image agent, then deploy + invoke prove the agent works under the VNet. - Scenario 1 (bicep-less) and the other 3 matrix cells (subnet create/reference x DNS own/reference) are verified with 'azd provision --preview' (ARM what-if), which creates nothing. - Scenario 2 (eject) is verified against the live account: eject -> what-if reports no changes (idempotent), proving the on-disk template + provision-time ${VAR} resolution reproduces the in-memory topology; a manual infra/ edit then surfaces as the only delta. Guards the ${VAR}-preservation fix end-to-end. - A shared BYO VNet (+ reference subnets / external DNS zones) is created once and reused across cells. Files: run-network-e2e.sh (phases 0-6 orchestrator), assert-resources.sh (live az topology checks: publicNetworkAccess Disabled, account PE groupIds, 3 AI DNS zones, agent-subnet delegation), lib.sh (logging/assert/azure.yaml mutation), README.md (cost rationale, prerequisites, cleanup). Westus account region per requirement; AcrPull granted to the project MI on the ABAC registry. * test(agents): run network E2E phases 0-4 without --image Decouple the private-networking E2E from the BYO-image init UX (PR 8689) so it runs against the current branch today: - Replace 'azd ai agent init --image' with a hand-authored azure.yaml fixture (foundry service + network: block + agent image:), created via 'azd env new'. image: yields includeAcr=false, matching BYO image, so no ACR at provision. Verified the fixture synthesizes: mode=byo, enableNetworkIsolation=true, includeAcr=false, ${VAR} resolved. - Gate phase 5 (deploy + invoke) behind RUN_DEPLOY=true: the headless BYO-image deploy needs the AZD_AGENT_SKIP_ACR short-circuit from PR 8689, otherwise deploy defaults to build and fails. Phases 0-4 (local gates, shared VNet, what-if matrix, one real provision, eject idempotency) validate all the networking code without it. - Fix the ABAC registry role: grant 'Container Registry Repository Reader' (ABAC-aware) instead of AcrPull; move the grant into the gated deploy phase. - Drop the --image preflight; README updated (scenario table, prerequisites, RUN_DEPLOY usage, role). * fix(agents): correct network bicep preflight + network-mode output casing Two product bugs surfaced by live E2E provisioning (ARM what-if does not catch either; both require a real deployment): 1. networkInjections preflight failure. The account and the network module deploy in the same template, so subnetArmId: network!.outputs.agentSubnetId compiled to an unresolved reference() at the CognitiveServices RP preflight, which then failed to convert networkInjections to its typed contract (InvalidResourceProperties). Build the subnet ARM id as a deterministic string from the concrete vnetId param instead, and add an explicit dependsOn so the subnet still exists before injection. Recompiled main.arm.json. 2. AZURE_FOUNDRY_NETWORK_MODE missing from canonicalOutputNames. ARM mangles output-name casing (AZURE_..._MODE -> azurE_..._MODE); without the canonical remapping the env key was stored mis-cased and azd env get-value AZURE_FOUNDRY_NETWORK_MODE returned empty. Added it to the restore list and a regression case to TestArmOutputsToProto_RepairsMangledKeyCase. Validated end-to-end: real westus network-isolated Foundry account provisions green with all topology assertions passing (publicNetworkAccess Disabled, networkAcls Deny, private endpoint, agent-subnet delegation, 3 AI DNS zones, network mode byo), across the full subnet create/reference x DNS own/reference matrix, plus eject idempotency (what-if reports no changes). * test(agents): make network E2E harness run green end-to-end Fixes found while running the harness against live Azure (phases 0-4): - Hand-authored project must include an agent.yaml (kind: hosted + image:) alongside azure.yaml; the foundry provider requires an agent definition file. - setup_project now sets AZURE_RESOURCE_GROUP (the subscription-scoped template creates the RG but the provider needs the name) and AZD_AGENT_SKIP_ACR=true (BYO-image deploy signal). - Phase 0 refreshes the dev extension from current source (build -> pack -> publish -> install) so the run tests local code, registering the provisioning-provider capability + microsoft.foundry provider. Gated by SKIP_EXT_REFRESH. - What-if matrix gates on a successful ARM what-if (exit 0) rather than grepping a summary-only preview; this still validates reference-mode subnet/zone existence and delegation against the real VNet. - Idempotent private-dns zone creation (reruns no longer fail on existing zones). - Add MAX_PHASE to stop early while iterating. - ACR grant uses the ABAC-aware Container Registry Repository Reader role. - Fix set -u unbound-variable crash in the phase-4 assert message. - .gitignore the transient per-run log directories. Phases 0-4 (local gates, shared VNet, what-if matrix, one real provision + topology assertions, eject idempotency) pass green. Phase 5 (deploy + invoke) stays gated behind RUN_DEPLOY=true and needs a reachable BYO agent image. * test(agents): build phase-5 image in ABAC-enabled ACR Update the Foundry private-network E2E harness so phase 5 can build the ~/agents/echo-dual image itself instead of requiring a prebuilt external image. - Add BUILD_IMAGE=true, ECHO_DUAL_DIR, ACR_NAME/ACR_RG, IMAGE_REPO/IMAGE_TAG. - Create the target ACR with --role-assignment-mode rbac-abac and reject reuse of non-ABAC registries. - Grant the caller Container Registry Repository Writer before the ACR Task push. Resolve the caller object id from the ARM token oid claim to avoid Microsoft Graph / Conditional Access failures. - Build with the required `az acr build --source-acr-auth-id [caller]` form. - Keep the project MI grant on the ABAC-aware Container Registry Repository Reader role for image pull. - Add TARGET_RG support so investigation runs can keep VNet, DNS, ACR, and the real Foundry env in a single RG. Live validation: the harness created an ABAC ACR, granted caller writer, built and pushed ~/agents/echo-dual with --source-acr-auth-id [caller], provisioned a private-networked Foundry account, and granted the project MI Repository Reader. The subsequent deploy failed from this public runner with the expected private endpoint 403, which is documented. * test(agents): grant private ACR pull to Foundry project identity Live phase-5 validation showed hosted-agent image pull uses the Foundry project managed identity, not the parent account identity. Update the network E2E harness to resolve AZURE_AI_PROJECT_ID via ARM and grant the project MI the ABAC-aware Container Registry Repository Reader role on the BYO ACR, falling back to the account MI only for older API shapes. Also persist AZURE_TENANT_ID in the azd env so postdeploy hooks do not fail on VM/managed-identity runners after deploy succeeds. * docs(agents): add BYO image VNet cheatsheet Add a concise README cheatsheet for initializing, provisioning, deploying, and invoking a hosted Foundry agent with a BYO container image under VNet private networking. Include ACR requirements for ABAC and private-only registries. * docs(agents): move private networking guide to docs Keep the extension README concise by moving the detailed Foundry private networking schema, requirements, and BYO-image VNet cheatsheet into `docs/private-networking.md`, with a short README pointer. * fix(agents): surface managed-network isolation output Live managed-network provisioning showed that the resources module emitted AZURE_FOUNDRY_MANAGED_ISOLATION_MODE but the subscription-scoped main template never forwarded it, so azd env only received AZURE_FOUNDRY_NETWORK_MODE. Forward the output from main.bicep, add it to the provider canonical output-name restore list, and cover ARM casing repair with a regression test. Also document the managed VNet provisioning scenario in the private-networking guide. Live validation: provisioned network.mode=managed in westus and verified the account had publicNetworkAccess Disabled, networkAcls Deny, networkInjections with useMicrosoftManagedNetwork=true, AZURE_FOUNDRY_NETWORK_MODE=managed, and AZURE_FOUNDRY_MANAGED_ISOLATION_MODE=AllowInternetOutbound. * fix(agents): keep managed-network data plane reachable Live managed-network deploy validation showed that managed mode configures the hosted-agent runtime to use a Microsoft-managed network but does not create a customer private endpoint for the Foundry data plane. Disabling public access in that mode makes azd deploy/invoke fail with `403 Public access is disabled`. Keep public data-plane access enabled for managed mode while preserving BYO mode behavior (public access disabled + private endpoint). Update the private networking guide with managed deploy/invoke guidance. Live validation: provisioned managed mode, converted the test ACR to ABAC, built the echo-dual image with `az acr build --source-acr-auth-id [caller]`, granted the Foundry project MI `Container Registry Repository Reader`, deployed successfully, and invoked the hosted agent successfully. * feat(agents): secure-by-default private networking for Foundry services Realign the azure.yaml `network:` surface to the natural Azure resource shape and make a network-bound Foundry account private in every mode. Reverses the prior managed-mode regression that flipped the account's publicNetworkAccess back to Enabled. Service sample 18 confirms managed mode supports a private data plane (customer private endpoint + the Microsoft-managed egress network), so declaring `network:` now always disables public data-plane access. Config: flat `network:` block with two orthogonal axes, no `mode` enum. - peSubnet (required) -> account private endpoint; omitting it while `network:` is declared is an error, never a silent public fallback. - agentSubnet (optional) -> present injects the agent into a customer subnet (BYO egress); absent uses the Microsoft-managed network (managed egress), where isolationMode becomes valid. Synthesizer/templates: - derive egress from agentSubnet presence (useManagedEgress); replace the networkMode param with a useManagedEgress bool. - disablePublicDataPlaneAccess = enableNetworkIsolation (always private). - add a managedNetworks/default child resource carrying isolationMode for managed egress. - validate peSubnet-required, isolationMode-managed-only, and single-VNet. Docs/tests/e2e: - rewrite docs/private-networking.md (host: azure.ai.agent, the value the provision provider actually accepts on this branch). - add synthesizer unit tests + a compiled-ARM regression guard. - add a live E2E harness (8-cell what-if matrix, BYO + managed-iso real provisions, eject idempotency) with an automatic jumpbox SOCKS tunnel so deploy/invoke can reach the private data plane; assert real account network-injection state rather than azd's echoed output. * docs(agents): use 'azd ai agent init --image' and assert real state in private-networking cheatsheet Managed-egress cheatsheet now scaffolds the agent via 'azd ai agent init --image' (writes agent.yaml) instead of assuming a hand-authored manifest, matching the BYO cheatsheet. Replace the env-output 'Expected outputs' block (azd echoing its own AZURE_FOUNDRY_* classification) with real resource-state validation: account publicNetworkAccess=Disabled and the managedNetworks/default isolationMode, with the invoke echo response as the end-to-end proof. * docs(agents): order init before azure.yaml in managed-egress cheatsheet 'azd ai agent init --image' scaffolds azure.yaml/agent.yaml, so it must come before the network: block the reader adds to the generated service. Matches the actual timing and the BYO cheatsheet ordering. * docs(agents): simplify private-networking cheatsheets - Drop the redundant 'azd env set AZD_AGENT_SKIP_ACR true': passing --image to 'azd ai agent init' already derives skipACR() and writes the env var into the environment init creates/selects. Reuse that env (no separate 'azd env new') so init and provision share one environment. - BYO cheatsheet: remove the export-variable indirection; inline placeholders directly into 'azd env set', matching the managed-egress cheatsheet. - Managed-egress cheatsheet: remove the weak 'validate via az show' / env-output block; a successful invoke over the private endpoint is the end-to-end proof. * test(agents): fix jumpbox fallback for DNS-reference deploy In peered fallback mode, populate JB_HOST and wait for SSH before writing /etc/hosts. DNS-reference accounts can also assign different PE IPs per privatelink zone, so pin each account FQDN from the private DNS A record instead of mapping every FQDN to the first PE NIC IP. * test(agents): exclude private-networking e2e harness from PR Remove the ad-hoc live validation harness from the committed diff while keeping it available locally for investigation runs. * test(agents): ignore bicep generator metadata in ARM drift check Compare normalized ARM JSON instead of raw bytes so CI/dev Bicep version metadata does not cause false stale-template failures while still catching semantic template drift. * feat(agents): validate distinct subnet names; drop debug log * feat(agents): refuse Terraform eject when network: is declared The Terraform module has no VNet/private-endpoint/DNS/networkInjections resources, so ejecting it for a network: service would silently drop the config and provision a public account. Fail fast with a clear error instead, preserving the network: secure-by-default contract. Bicep eject remains the supported path for private networking. Also fixes a merge artifact (single-arg ejectInfra call in a test). * docs+test(agents): document Bicep eject customization; verify full network param set - docs: add 'Advanced: eject the Bicep and customize it' section to the private-networking cheatsheet (eject -> manual edit -> provision -> deploy -> invoke), with two worked Bicep edits and a Terraform-unsupported note. - test: assert the complete network parameter set lands in the ejected main.parameters.json for BYO egress, and extend the managed-egress eject test with the full param contract. * docs(agents): merge networking Requirements/Known limitations into one Limitations section * docs(agents): rename networking quick guides to parallel Scenario 1-3 headings
* feat(agents): move Foundry networking to project service * fix(agents): reject conflicting Foundry network schema * fix(agents): align split Foundry schemas and ACR scan * fix(agents): preserve pre-split Foundry provisioning fallback * docs(agents): split complex Foundry schema example
jongio
left a comment
There was a problem hiding this comment.
Large feature-branch merge bringing Foundry azure.yaml modeling into main. Each constituent feature was separately reviewed on the feature branch.
Observations:
-
foundry_provisioning_provider.gois 1,341 lines. While justified by the complexity of the full provisioning lifecycle, consider whether the on-disk template logic (currently delegated toondisk_template.go) and the ARM deployment orchestration could be further separated if this file grows. -
CI shows 37 failing/pending checks. Worth confirming these aren't regressions from this merge before final approval.
-
The lazy initialization refactor in
service_target_agent.go(ensureDeployContext/ensureEnv) is a solid pattern that avoids unnecessary credential construction on metadata-only calls. ThedeployContextReadyfield interacting with theagentDefinitionPathshort-circuit could be documented with a brief invariant comment explaining when each is set. -
The unified agent definition shape with deprecation warnings (
WarnLegacyAgentShape) and theLoadAgentDefinitionhelper centralizing resolution across inline, config-nested, and on-disk sources is well-designed for the migration period. -
Test coverage is thorough across the new code (provisioning provider, synthesizer, init infra, resource services, on-disk templates, preview helpers).
📋 Prioritization NoteThanks for the contribution! The linked issue isn't in the current milestone yet. |
jongio
left a comment
There was a problem hiding this comment.
Incremental review of the 3 commits since my earlier pass (ecc8206...8a5859d):
- cspell.yaml additions (myprivacr, privatelink, cheatsheet, aerr/rerr/perr): fine.
- golangci G204/G304 exclusions: acceptable for this extension since bicep build invocations and agent file reads use paths validated at entry points.
- Go 1.26 modernization: maps.Copy replacing manual for-range loops,
ew("str") replacing strPtr helpers, strings.SplitSeq range usage, import reordering. All correct idiomatic upgrades.
No new issues found in the delta. My prior observations (foundry_provisioning_provider.go size, the deployContextReady invariant documentation, and the collectProjectDeployments/collectConnections brownfield handling) still apply as low-priority improvement suggestions for follow-up.
CI is still running (57 in-progress, 8 queued at time of review). Worth confirming the full suite passes before merging given the 133-file scope.
jongio
left a comment
There was a problem hiding this comment.
Incremental review of 1 commit since my last pass (8a5859d...74f99af):
The test fixture update in helpers_test.go adds kind: hosted and name: my-agent to the resolveAgentProtocol test YAML, aligning fixtures with the AgentDefinition struct shape (yaml.go:155-157). Correct and consistent with how service_target_agent_test.go already models its fixtures.
CI concern: The ext-azure-ai-agents-test workflow is failing on Test_AIAgent_Init_NoPrompt_WithProject. The test expects agent.yaml at pr-gate-test-agent/src/pr-gate-test-agent/agent.yaml but the init command doesn't produce it there. The DNS lookup warning for the AI endpoint is expected in playback mode, but the missing output file suggests the init flow's output directory structure may have shifted as part of this feature branch. Worth verifying whether the test assertion or the init path logic needs updating before merge.
My prior observations (foundry_provisioning_provider.go size, deployContextReady invariant documentation) remain applicable as low-priority follow-up items.
jongio
left a comment
There was a problem hiding this comment.
Incremental review of 1 commit (74f99af...e5f0601).
ext-azure-ai-agents-test still fails on Test_AIAgent_Init_NoPrompt_WithProject, now persistent across two consecutive commits. The test expects agent.yaml at pr-gate-test-agent/src/pr-gate-test-agent/agent.yaml but init doesn't produce it there. The init path resolution or the test assertion needs updating before merge.
| // Whether the project contains a Foundry agent service (host: azure.ai.agent) | ||
| // still using the deprecated config-nested shape (a populated `config:` block) | ||
| // instead of service-level properties. Tracks migration of older Foundry | ||
| // projects off the legacy shape. | ||
| FoundryAgentLegacyConfigKey = AttributeKey{ | ||
| Key: attribute.Key("foundry.agent.legacy_config_shape"), | ||
| Classification: SystemMetadata, | ||
| Purpose: FeatureInsight, | ||
| } | ||
| ) |
There was a problem hiding this comment.
Agreed. This puts Foundry-specific detection logic (checking for azure.ai.agent host + populated Config) directly into the core project.Load() path. It couples the project loader to extension domain concepts. Deferring until #8158 clarifies the extension telemetry story makes sense. The field definition in fields.go is also core-level surface area that should wait.
There was a problem hiding this comment.
Removed the telemetry changes, thanks!
jongio
left a comment
There was a problem hiding this comment.
No new code since my last review (still at e5f0601). The one item I'd want resolved before merge: the core project loader in project.go shouldn't have Foundry-specific detection logic (azure.ai.agent host + Config check) until the extension telemetry story (#8158) is sorted. The fields.go addition is the same concern. The pkg/foundry package location question is non-blocking and can move later with just import path updates.
jongio
left a comment
There was a problem hiding this comment.
Reviewed the 1-commit delta (d3b70fe) since my last pass. The FoundryAgentLegacyConfigKey removal is complete: field definition, detection logic in project.Load(), test subcase, and all three doc references removed consistently. Verified no dangling references remain in the tree.
PR currently shows merge conflicts that need resolution before the requested reviewers can assess the final merged state.
…-yaml-to-main # Conflicts: # cli/azd/extensions/azure.ai.agents/internal/cmd/listen.go # cli/azd/internal/grpcserver/external_provisioning_provider_test.go
jongio
left a comment
There was a problem hiding this comment.
Incremental review of 11 commits since my last pass (d3b70fe...fc327e6).
Delta summary:
- CI workflow refactored to use
ci-build.ps1 -BuildRecordMode, producing separate production and record binaries. Tier 0 tests now run against the production binary; the record binary is swapped in only for Tier 1 playback. Cleaner separation than the previous GOFLAGS approach. - Test reorganization (#8800): tests moved from catch-all coverage files into source-matched files. Mechanical change, low risk.
- AGENTS.md reference updated to match the renamed test file.
- Merge from main brings dependency bumps, CodeQL enablement, and extension improvements that were reviewed via their own PRs.
CI concerns:
- 2 checks failing (
test,azure-dev - ext - azure.ai.toolboxes CrossBuild LinuxARM64), 33 still in progress. Worth confirming these pass before merge, especially given the 128-file scope. - The
Test_AIAgent_Init_NoPrompt_WithProjectfailure I flagged in my earlier review may be resolved by the workflow changes here (the new ci-build.ps1 flow alters how the extension binary is built and installed). If theext-azure-ai-agents-testjob passes on this HEAD, that concern is cleared.
Prior observations (low-priority follow-ups, not blocking):
foundry_provisioning_provider.goat 1,341 linesdeployContextReadyinvariant could use a brief doc commentnetworkEnvMapsilent nil return on errorcollectProjectDeployments/collectConnectionsbrownfield error handling
jongio
left a comment
There was a problem hiding this comment.
Incremental review of 5 commits (fc327e6...f0ef5e9).
The delta defaults dependency_resolution to remote_build to prevent Foundry 400s, defaults the resource group to rg-<env> to match azd convention, emits AZURE_TENANT_ID in provisioning outputs so postdeploy hooks don't fail, writes endpoint: on the project service for brownfield reuse, stops declaring existing deployments in azure.yaml, and derives the project service key from the Foundry project name.
CI: test check is still failing. Test_AIAgent_Init_NoPrompt_Defer and Test_AIAgent_Init_NoPrompt_WithProject can't find agent.yaml at the expected paths. Persistent across multiple commits. The test cassettes or path assertions need updating for the changed init output structure.
| if len(keys) == 0 { | ||
| return "" | ||
| } | ||
| sort.Strings(keys) |
There was a problem hiding this comment.
sort.Strings(keys) can be slices.Sort(keys) per Go 1.26 conventions. Minor, but keeps new code consistent with the modernization work in this PR.
| progress("Using existing Foundry project (endpoint set); skipping provisioning") | ||
| // Best-effort tenant lookup so AZURE_TENANT_ID is still surfaced for the | ||
| // existing-project path (no resources are provisioned here). | ||
| _ = p.ensureCredential(ctx) |
There was a problem hiding this comment.
The error from ensureCredential is discarded silently. If the user's login has expired, AZURE_TENANT_ID won't be populated and downstream steps (e.g. postdeploy hooks) will fail with a confusing "AZURE_TENANT_ID is not set" message instead of a clear auth error.
Consider logging it so the debug trace shows what happened:
| _ = p.ensureCredential(ctx) | |
| if err := p.ensureCredential(ctx); err != nil { | |
| log.Printf("[debug] best-effort tenant lookup for brownfield deploy: %v", err) | |
| } |
jongio
left a comment
There was a problem hiding this comment.
Covering areas my earlier passes didn't reach (synthesizer, init infra, deployment naming).
Issues to address:
- synthesizer.go:182 - raw azure.yaml parsed without resolving
$refincludes first - foundry_provisioning_provider.go:306 -
rg-<env>default used by Destroy without checking azd ownership - foundry_provisioning_provider.go:1104 - deployment name
azd-foundry-<env>can collide across projects in the same subscription - init_infra.go:162 - partial eject failure leaves
./infrain an unretryable state - init_infra.go:37 -
--imageflag silently ignored when--infratargets an existing project
| } | ||
|
|
||
| var root projectFile | ||
| if err := yaml.Unmarshal(in.RawAzureYAML, &root); err != nil { |
There was a problem hiding this comment.
[HIGH] Synthesize unmarshals the raw azure.yaml bytes directly. If a service uses $ref file includes (the new feature in this PR), those refs won't be resolved here, so the decoded service body will contain {"$ref": "some-file.yaml"} objects instead of the actual content. That produces zero-valued deployment params and likely an invalid ARM template.
Run the $ref resolver (from pkg/foundry) over the raw YAML before passing it to Synthesize, or have Synthesize accept pre-resolved service config instead of raw bytes.
| // defaultResourceGroupName returns the resource group azd provisions into when | ||
| // AZURE_RESOURCE_GROUP is unset, matching azd's standard rg-<env> convention. | ||
| func defaultResourceGroupName(envName string) string { | ||
| return "rg-" + envName |
There was a problem hiding this comment.
[HIGH] Destroy uses this default (rg-<env>) when AZURE_RESOURCE_GROUP isn't set. If someone runs azd down --force before ever provisioning (or after env state loss), this targets a resource group the provider never created. With common env names like dev, that could be an unrelated RG.
Consider failing closed when the env doesn't carry a resource group that this provider wrote (e.g. check for an azd-managed tag or a persisted deployment record), rather than falling back to a convention-based name for a destructive operation.
|
|
||
| // deploymentName is stable per azd env so re-runs update one record. | ||
| func (p *FoundryProvisioningProvider) deploymentName() string { | ||
| return deploymentNamePrefix + p.envName |
There was a problem hiding this comment.
[MEDIUM] The deployment name is azd-foundry-<env>. Two different projects in the same subscription that share an env name (e.g. dev) will write the same ARM deployment record. State() would then read the other project's outputs.
Adding a stable project discriminator (short hash of the project path or the foundry project name) to the deployment name would avoid the collision.
| // ejectBicep writes the embedded Bicep tree plus the synthesized | ||
| // main.parameters.json into infraDir and prints the summary. It does not | ||
| // modify azure.yaml; the declared infra.provider is left unchanged. | ||
| func ejectBicep(infraDir string, params map[string]any) error { |
There was a problem hiding this comment.
[MEDIUM] If ejectBicep succeeds (writes templates to ./infra) but a later step like writeParametersFile or writeOutputsFile fails, the partial ./infra directory is left behind. Because ejectInfra refuses to run when ./infra already exists (line 111), the command isn't safely retryable until the user manually deletes the directory.
Wrapping the full eject in a deferred cleanup on error (remove ./infra if any step after MkdirAll fails) would make retries work without manual intervention.
| // standalone-eject branch would silently drop. `--infra` on an existing | ||
| // project runs eject only; honoring a positional path, -m, or --src would | ||
| // falsely imply the input was acted upon. | ||
| func validateStandaloneEjectArgs(args []string, flags *initFlags) error { |
There was a problem hiding this comment.
[MEDIUM] validateStandaloneEjectArgs rejects positional args, --manifest, and --src when --infra targets an existing project, but --image passes through without error. The existing-project fast path returns before image validation runs, so the flag is silently dropped. That's a confusing UX: a user who passes --image expects it to take effect.
Either reject --image here alongside the other init-driving flags, or document that --image is a no-op in standalone eject mode.
Azure Dev CLI Install InstructionsInstall scriptsMacOS/Linux
bash: pwsh: WindowsPowerShell install MSI install Standalone Binary
MSI
Documentationlearn.microsoft.com documentationtitle: Azure Developer CLI reference
|
| @@ -391,6 +391,16 @@ Emitted on `azd provision` / `azd up` to measure adoption and safety of `infra.l | |||
| | `provision.layer.explicit_dependson_count` | measurement | Layers using the explicit `infra.layers[].dependsOn` override | | |||
There was a problem hiding this comment.
Should this be removed along with the other telemetry stuff? Or is this still applicable?
There was a problem hiding this comment.
This doc entry is still applicable. The removed changes (in fields.go / project.go) were Foundry-detection fields that coupled core project.Load() to extension logic, per JeffreyCA's comment about #8158. This provision.network_mode attribute is separate: the extension emits it directly via attribute.String("provision.network_mode", networkMode) in foundry_provisioning_provider.go during provisioning template preparation, so it doesn't depend on any core field definition. Should stay.
jongio
left a comment
There was a problem hiding this comment.
Re-examined the core changes after re-request. No new commits since my last review (still at f0ef5e9).
Responded to @trangevi's question on telemetry-data.md: the provision.network_mode doc entry is independent of the removed fields.go changes and should stay, since the extension emits the attribute directly via OTel in foundry_provisioning_provider.go.
Verified during this pass:
- Proto addition (
ProvisioningDeploymentPreviewChange, field 4) is forward-compatible; existing extensions won't break. NewExternalProvisioningProviderFactorysignature change (addinginput.Console) is wired correctly through IoC inprovisioning_service.go.pkg/foundry/includes.gohas proper cycle detection, depth bounds (32), and path rebasing.slices.Clone(chain)before append prevents parent chain mutation.- Schema change from
enumtopattern+examplesfor theproviderfield correctly allows extension-registered providers.
Previous findings (synthesizer raw-yaml parse order, destroy default RG, deployment name collision, partial eject, silently ignored --image) still apply and should be addressed before merge.
| // toPtr returns a pointer to its arg, for non-addressable expressions where | ||
| // new() can't be used directly. | ||
| // | ||
| //go:fix inline | ||
| func toPtr[T any](v T) *T { return new(v) } | ||
|
|
There was a problem hiding this comment.
This seems inaccurate - we should be able to just use new() inline
There was a problem hiding this comment.
Review was primarily performed with Agent assistance due to the size of the PR. I've reviewed the suggestions manually and believe they make sense, but please also review accordingly.
I've marked as "request changes" primarily due to the schema item, if that is deemed to be fine by the azd core folks and I don't notice before this PR needs merged, my review can be dismissed
|
|
||
| kebabAuth := normalizeAuthType(strings.TrimSpace(cfg.AuthType)) | ||
| key, customKeys := connectionCredentialArgs(kebabAuth, cfg.Credentials, expand) | ||
| body, err := buildConnectionBody( |
There was a problem hiding this comment.
buildConnectionBody only handles api-key, custom-keys, and none, but agent init can emit connections with OAuth2 / AgenticIdentityToken auth. That means a generated azure.yaml can provision fine and then fail at azd deploy. Consider supporting the same auth types the init/schema path can emit, or skipping the deploy-time upsert for auth types provisioned elsewhere.
(flagged by azd-code-reviewer)
| } | ||
| // Source directory is the service's directory. Fall back to the directory of | ||
| // a legacy on-disk agent.yaml when the service path was not resolved. | ||
| srcDir := p.servicePath |
There was a problem hiding this comment.
When an explicit AGENT_DEFINITION_PATH override is set, loadContainerAgentDefinition lets that definition win, but here we prefer p.servicePath and only fall back to the override's directory when servicePath is empty. For a code-deploy agent whose override file lives outside the service path, this zips the wrong source tree and can miss the entry point. Consider preferring filepath.Dir(p.agentDefinitionPath) when the override is set, or tracking the resolved definition source dir separately.
(flagged by azd-code-reviewer)
| // of silently overwriting each other -- AddService overwrites by name. | ||
| // Seed it with the agent service name, which the caller adds before this | ||
| // runs, so a resource colliding with the agent is caught too. | ||
| usedNames := map[string]string{} |
There was a problem hiding this comment.
usedNames is only seeded with the agent service name, not the project's existing services. Since AddService overwrites by name, a resource whose sanitized key collides with an unrelated existing service (e.g. a hand-authored api service) will silently replace it instead of failing fast. Consider seeding usedNames from the current project services up front, and only allowing the intended existing azure.ai.project key to be reused.
(flagged by azd-code-reviewer)
There was a problem hiding this comment.
Confirmed. usedNames at line 68 is seeded only with agentServiceName (line 69-71). It doesn't account for services that already exist in the project's azure.yaml. If a resource name sanitizes to something like api and the user already has a hand-authored api service, AddService will silently overwrite it.
The fix should seed usedNames from the project's existing services before emitting resource services. Something like fetching current services via azdClient.Project().ListServices() and adding each name to the map.
| if err := projectpkg.UpsertAgentEnvVars(svc, envUpdates); err != nil { | ||
| return fmt.Errorf("failed to update agent definition: %w", err) | ||
| } | ||
| if _, err := azdClient.Project().AddService(ctx, &azdext.AddServiceRequest{Service: svc}); err != nil { |
There was a problem hiding this comment.
AddService replaces the whole service from the extension proto shape, which doesn't carry core fields like uses:. Since this PR writes resource dependencies via SetServiceConfigValue("uses"), this round-trip can drop the agent's dependency edges. Consider persisting only the changed env-var properties with a path-based update, or re-reading and reapplying uses: before saving. I've filed an issue for it here, but we'll need to address it in the extension for now.
(flagged by azd-code-reviewer)
There was a problem hiding this comment.
Agreed. Line 183 calls azdClient.Project().AddService() with the service object that went through UpsertAgentEnvVars. The azdext.AddServiceRequest{Service: svc} replaces the entire service definition. If svc came from the extension proto shape (which doesn't carry core azure.yaml fields like uses:), then the SetServiceConfigValue("uses") calls made earlier for dependency edges would be lost on this round-trip.
A targeted update (patching only the changed env vars) instead of a full service replacement would avoid the problem.
| "description": "Optional. Provides additional configuration for Azure AI Agent deployment." | ||
| }, | ||
| "image": false, | ||
| "config": false, |
There was a problem hiding this comment.
Setting config: false here means existing valid v1.0 azure.yaml files that put agent settings under services.<name>.config will now fail validation against the published v1.0 schema — a breaking change for anyone already on v1.0. Could we keep the old config shape accepted in v1.0 (allowing both shapes) and reserve the stricter service-level-only contract for alpha / a new schema version? Flagging this one to discuss with the azd core team before merging.
(flagged by azd-code-reviewer)
There was a problem hiding this comment.
@tg-msft, @JeffreyCA, @vhvb1989, would appreciate your weigh in here
There was a problem hiding this comment.
Worth noting that config: false at line 393 is inside a conditional block that only fires when host equals azure.ai.agent (line 384). Since azure.ai.agent is a new host type introduced in this PR, there shouldn't be existing v1.0 azure.yaml files in the wild using it. The breaking-change risk is limited to anyone who adopted the feature branch schema early.
If that early-adopter group is non-empty, a deprecation path (accept but warn) would be safer than a hard schema rejection.
There was a problem hiding this comment.
I'm leaning towards just having both schemas support both, and after some reasonable time remove the older config
I believe the yaml json schema also supports "deprecated": true so we don't need to outright remove the older shape
There was a problem hiding this comment.
"Since azure.ai.agent is a new host type introduced in this PR, there shouldn't be existing v1.0 azure.yaml files in the wild using it. The breaking-change risk is limited to anyone who adopted the feature branch schema early."
I don't think this is true, unless I'm not understanding what you mean. azure.ai.agent has been the host type we've been using the whole time.
| } | ||
|
|
||
| hasUnsetEnvVar := false | ||
| substituted, err := envsubst.Eval(string(enc), func(varName string) string { |
There was a problem hiding this comment.
This marshals the parameter entry to JSON first, then substitutes raw env values into the encoded string. Env values that are themselves valid azd values — containing ", backslashes (Windows paths), or newlines — will produce invalid JSON after substitution. Consider substituting into the string values before marshaling, or JSON-escaping replacements when injecting them into the encoded parameter.
(flagged by azd-code-reviewer)
There was a problem hiding this comment.
Confirmed. The pattern at line 243-268 is: marshal to JSON, substitute env vars into the JSON string, then unmarshal back. If an env value contains characters that are significant in JSON (double quotes, backslashes, newlines), the substituted string produces invalid JSON.
For example, an env var MY_PATH=C:\Users\foo would inject a raw backslash into the JSON string, and the unmarshal at line 268 would fail or misinterpret the escape.
The fix: JSON-escape each env value before substitution. Something like:
escaped, _ := json.Marshal(v) // produces "C:\\Users\\foo"
return string(escaped[1:len(escaped)-1]) // strip surrounding quotes| } | ||
| properties: { | ||
| adminUserEnabled: false | ||
| publicNetworkAccess: 'Enabled' |
There was a problem hiding this comment.
This ACR is always created with publicNetworkAccess: 'Enabled' and no private endpoint/DNS, even when enableNetworkIsolation disables the Foundry account data plane. For Docker-backed agents in a network-isolated project that leaves a dependency outside the isolation boundary, and it can break pulls in locked-down egress scenarios. Consider passing the isolation context into this module and private-linking the registry (privatelink.azurecr.io), or explicitly blocking unsupported network-isolated Docker scenarios.
(flagged by azd-code-reviewer)
There was a problem hiding this comment.
Confirmed. Line 53 hardcodes publicNetworkAccess: 'Enabled' with no parameter to toggle it. The template also has no private endpoint or private DNS zone resource. When enableNetworkIsolation is set on the Foundry project, this ACR sits outside the isolation boundary.
At minimum, the template should accept an enableNetworkIsolation parameter and conditionally set publicNetworkAccess: 'Disabled' with a private endpoint when isolation is requested.
| }, nil | ||
| } | ||
|
|
||
| return &templateSource{ |
There was a problem hiding this comment.
On the on-disk path, Initialize doesn't load p.armTemplate. If infra/main.bicep / main.bicepparam exists at initialize time but disappears before resolveTemplate, this fallback returns templateModeEmbedded with armTemplate: p.armTemplate — which is nil on the on-disk init path — turning a handled race into a later ARM deploy failure with an empty template. Consider loading the embedded template before falling back here, or returning a validation error.
(flagged by azd-code-reviewer)
There was a problem hiding this comment.
The code at line 587-590 already handles this race explicitly (logging "on-disk template disappeared"), but the fallback at line 607-611 uses p.armTemplate which is never populated on the on-disk init path. Initialize (line 134) skips synthesis entirely when on-disk Bicep is detected, so p.armTemplate stays nil.
The race is unlikely (user deletes infra/main.bicep between init and provision), but when it happens the deployment silently uses a nil template. Returning an error instead of falling through would be safer for this edge case.
jongio
left a comment
There was a problem hiding this comment.
Incremental pass after @trangevi's review. No new commits since my last review; focused on verifying and engaging with the new findings.
Confirmed concerns (with added evidence/suggestions):
resource_services.go:68-usedNamesdoesn't seed from existing project services, so resource names that collide with hand-authored services silently overwrite them.optimize_apply.go:183-AddServicefull-replaces the service, which can drop core fields likeuses:that were set viaSetServiceConfigValueearlier in the flow.ondisk_template.go:252- JSON marshal then env-var substitution then unmarshal. Env values with JSON-significant characters (quotes, backslashes, newlines) break the round-trip. Suggested ajson.Marshalescape approach in the thread.acr.bicep:53-publicNetworkAccessis hardcoded to'Enabled'with no private endpoint, which breaks the isolation boundary whenenableNetworkIsolationis set.foundry_provisioning_provider.go:607- The on-disk-to-embedded fallback usesp.armTemplatewhich is nil on the on-disk init path. Unlikely race, but should return an error instead of silently using a nil template.
Nuanced:
azure.yaml.json:393-config: falseapplies only whenhostisazure.ai.agent(new host type in this PR). Breaking-change risk is limited to early feature-branch adopters.
The $ref resolver gap I flagged earlier (Synthesize doesn't resolve $ref includes) and the usedNames collision issue are the two highest-impact items for correctness. The JSON substitution injection is also a concrete data-corruption vector worth fixing before merge.
Summary
Brings the Foundry azure.yaml feature work from
huimiu/foundry-azure-yamlintomain.This branch is based on commit
ecc820695(the latestMerge branch 'main' into feature/foundry-azure-yaml), which is intentionally before the two test-only commits on the feature branch that repointed schema$id/$ref/$schemaURLs at the feature branch. As a result, all schema URLs here still point atmain(raw.githubusercontent.com/Azure/azure-dev/main/...), which is what we want for a merge intomain. No changes were made on top of the feature work — it's the feature content with the main schema URLs preserved.The branch already merged the current
maintip, so it is up to date withmain(0 behind).What's included
Models Foundry projects, agents, and related resources directly in
azure.yaml, with$reffile includes, secure-by-default networking, and Bicep-less / Terraform init paths.azure.yaml schema & resource modeling
microsoft.foundryazure.yaml schema (feat(agents): add microsoft.foundry azure.yaml schema #8603)$reffile includes$reffile includes with overlay overrides in Foundry config (feat(agents): resolve $ref file includes with overlay overrides in Foundry config #8627)$refresolver + shared$ref-aware azure.yaml edit helper (feat: generalize $ref resolver + shared $ref-aware azure.yaml edit helper #8777)Init & IaC
azd ai agent(feat(agents): bicep-less init for azd ai agent init #8643)azd ai agent(feat(agents): support Terraform as an IaC option for azd ai agent #8756)--imageflag to the agent init command (feat(agents): add --image flag to agent init command #8689)Networking
Notes