Harden Azure provisioning recovery by davidfowl · Pull Request #15697 · microsoft/aspire

davidfowl · 2026-03-30T02:11:16Z

Description

This PR introduces AzureProvisioningController, a serialized control loop that coordinates all run-mode Azure provisioning operations. It replaces the inline provisioning logic that previously lived in AzureProvisioner with a channel-based queue that serializes startup provisioning, dashboard commands, CLI commands, and background drift detection through a single processing loop.

Controller architecture

The controller uses a Channel<QueuedOperation> with a single reader. Every operation — provision, reprovision, reset, change-location, change-context, delete, drift-check — is modeled as a typed intent record that gets enqueued and processed one at a time. This eliminates races between concurrent dashboard commands, CLI commands, and the periodic drift monitor.

Within a provisioning pass, individual resources fan out concurrently but are ordered by dependency. Each resource gets a per-resource ProvisioningTaskCompletionSource that downstream resources await before starting their own deployment. The TCS is completed through exactly two paths (CompleteProvisioning / FailProvisioning), so dependents unblock as soon as their prerequisites finish rather than waiting for the entire batch.

What the provisioning stack can do now

Resource commands (per-resource):

Reprovision — clears cached deployment state for a resource and its children/role-assignments, then redeploys
Change location — prompts for a new Azure region, deletes the existing ARM resource if it conflicts, sets a per-resource location override, and reprovisions
Forget state — clears cached deployment state without reprovisioning

Environment commands (all resources):

Reset provisioning state — wipes all cached deployment state and resets the environment to NotStarted
Change Azure context — re-prompts for subscription/tenant/resource-group/location, then reprovisions all resources with the new context
Reprovision all — clears and redeploys all Azure resources while preserving location overrides
Delete Azure resources — deletes the resource group and resets state

Background drift detection:

Periodic timer probes ARM to verify each running resource still exists
Marks missing resources as "Missing in Azure" and the environment as "Drifted"
Non-overlapping — queues at most one check through the same serialized channel

Azure resource metadata:

Both fresh and cached-state resources now expose: azure.subscription.id, azure.resource.group, azure.tenant.id, azure.tenant.domain, azure.location, and resource.source (full ARM deployment id)
Failed resources stamp the predicted deployment id before the ARM call, so agents and tools can still query Azure even when provisioning fails

Location overrides:

Per-resource overrides are persisted in deployment state and survive resets/reprovisioning
When changing location, the controller deletes the existing Azure resource first to avoid ARM InvalidResourceLocation conflicts
Stale overrides are cleared when the effective location changes

Other changes

BicepProvisioner — hardened checksum reuse validation, unified Azure identity metadata across fresh/cached paths, predicted deployment-id stamping for failed resources
RunModeProvisioningContextProvider — refactored Azure context acquisition and interactive prompting
AzureResourcePreparer — wires per-resource commands into the app model
Only registers AzureProvisioningController in run mode (fixes DI failures in publish/test scenarios)

Test coverage

Controller regression tests covering: reprovision, reset, change-location, change-context, delete, drift detection, dependency ordering, command state management, location override preservation
Provisioner regression tests covering: checksum validation, cached-state identity properties, stale location overrides, failed-resource metadata stamping

Checklist

Is this feature complete?
- Yes. Ready to ship.
- No. Follow-up changes expected.
Are you including unit tests for the changes and scenario tests if relevant?
- Yes
- No
Did you add public API?
- Yes
  - If yes, did you have an API Review for it?
    - Yes
    - No
  - Did you add <remarks /> and <code /> elements on your triple slash comments?
    - Yes
    - No
- No
Does the change make any security assumptions or guarantees?
- Yes
  - If yes, have you done a threat model and had a security review?
    - Yes
    - No
- No
Does the change require an update in our Aspire docs?
- Yes
  - Link to aspire.dev issue:
    - New issue
- No

github-actions · 2026-03-30T02:11:31Z

🚀 Dogfood this PR with:

⚠️ WARNING: Do not do this without first carefully reviewing the code of this PR to satisfy yourself it is safe.

curl -fsSL https://raw.githubusercontent.com/microsoft/aspire/main/eng/scripts/get-aspire-cli-pr.sh | bash -s -- 15697

Or

Run remotely in PowerShell:

iex "& { $(irm https://raw.githubusercontent.com/microsoft/aspire/main/eng/scripts/get-aspire-cli-pr.ps1) } 15697"

…219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The controller depends on IAzureProvisioningOptionsManager which is only registered in run mode. Moving the controller registration inside the run-mode block fixes the DI resolution failure in publish/test scenarios. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…r fallback AzureProvisioner is resolved in all modes as an eventing subscriber but depends on AzureProvisioningController. Register the controller and a no-op IAzureProvisioningOptionsManager unconditionally so DI succeeds in publish/test mode. In run mode, RunModeProvisioningContextProvider overrides the no-op via AddSingleton (registered before TryAdd). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

github-actions · 2026-03-30T05:33:34Z

Re-running the failed jobs in the CI workflow for this pull request because 1 job was identified as retry-safe transient failures in the CI run attempt.
GitHub was asked to rerun all failed jobs for that attempt, and the rerun is being tracked in the rerun attempt.
The job links below point to the failed attempt jobs that matched the retry-safe transient failure rules.

Tests / Hosting-1 / Hosting-1 (windows-latest) - Job-level runner or infrastructure failure matched the transient allowlist.

davidfowl · 2026-03-30T07:41:00Z

@copilot fix the failing test

Copilot · 2026-03-30T08:34:38Z

@copilot fix the failing test

I investigated all Azure test classes locally and all tests pass (BicepProvisionerTests, EnvironmentResourceExtensionsTests, ResourcePreparerTests, ProvisioningContextProviderTests, DeployerTests, BicepResourceTests — 178+ tests total). I was unable to access the CI logs to identify the specific failing test due to DNS restrictions in this environment. The previous two commits (95026c9, 7bbaf1f) addressed DI resolution failures for AzureProvisioningController in publish/test mode — could you point to the specific failing test or share the CI run URL so I can investigate further?

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

azcliprod.blob.core.windows.net
- Triggering command: /usr/bin/../../opt/az/bin/python3 /usr/bin/../../opt/az/bin/python3 -Im azure.cli account get-access-token --output json --resource REDACTED (dns block)
https://api.github.com/graphql
- Triggering command: /usr/bin/gh gh pr list --repo microsoft/aspire --head davidfowl/azure-provisioning-refactor --json number,title committer.name ptables docker netw�� ls --format /usr/bin/docker HEAD (http block)
https://api.github.com/repos/microsoft/aspire/actions/runs
- Triggering command: /usr/bin/gh gh run list --repo microsoft/aspire --branch davidfowl/azure-provisioning-refactor --limit 5 --json databaseId,status,conclusion,name docker netw�� ls --format /usr/bin/docker --verify davidfowl/azure--p rosoft.developercontrolplane.linls docker (http block)
- Triggering command: /usr/bin/gh gh run list --repo microsoft/aspire --branch davidfowl/azure-provisioning-refactor --limit 3 get-access-tokenrun-controllers /home/REDACTED/.ca--kubeconfig get/packages/mic/tmp/aspire-dcp7J1czA/kubeconfig netw�� ls oj /usr/bin/docker account TR7N1/kubeconfignetwork /home/REDACTED/.cols docker (http block)
https://api.github.com/repos/microsoft/aspire/commits/7bbaf1f11e709efb1696d30fc76e2795e57bd97a/check-runs
- Triggering command: /usr/bin/gh gh api /repos/microsoft/aspire/commits/7bbaf1f11e709efb1696d30fc76e2795e57bd97a/check-runs?per_page=100 controlplane.linget-access-token netw�� ls oj /usr/bin/docker --local artifacts/bin/As-rs /home/REDACTED/.dotnet/tools echo ls docker netw�� ls --format rosoft.developercontrolplane.linux-amd64/0.22.9/tools/dcp --noprofile (http block)
- Triggering command: /usr/bin/gh gh api /repos/microsoft/aspire/commits/7bbaf1f11e709efb1696d30fc76e2795e57bd97a/check-runs?per_page=100 docker netw�� ls --format /usr/bin/docker set-url origin name docker _.py�� ls --format /usr/bin/docker AzureProvisioninuname tests/Aspire.Ora-rs t/lsb_release docker (http block)
https://api.github.com/repos/microsoft/aspire/pulls
- Triggering command: /usr/bin/gh gh api repos/microsoft/aspire/pulls?head=microsoft:davidfowl/azure-provisioning-refactor docker netw�� ls --format /usr/bin/docker --local committer.email get/packages/mic--filter-not-trait docker netw�� ls --format /usr/bin/docker --oneline 2hMCk/kubeconfig/usr/bin/az tnet/tools/unameaccount docker (http block)
- Triggering command: /usr/bin/gh gh api repos/microsoft/aspire/pulls?head=microsoft:davidfowl/azure-provisioning-refactor&state=open docker netw�� ls --format /usr/bin/docker ls --format /usr/bin/docker docker netw�� k/aspire/aspire/artifacts/log/De-- --format /usr/bin/docker ls --format es docker (http block)
https://api.github.com/repos/microsoft/aspire/statuses/7bbaf1f11e709efb1696d30fc76e2795e57bd97a
- Triggering command: /usr/bin/gh gh api /repos/microsoft/aspire/statuses/7bbaf1f11e709efb1696d30fc76e2795e57bd97a docker netw�� ls --format /usr/bin/docker azd auth token -/home/REDACTED/.nuget/packages/microsoft.developercontrolplane.linux-amd64/0.22.9/--format (http block)

If you need me to access, download, or install something from one of these locations, you can either:

Configure Actions setup steps to set up my environment, which run before the firewall is enabled
Add the appropriate URLs or hosts to the custom allowlist in this repository's Copilot coding agent settings (admins only)

Remove explicit References.Add for role assignment resources — the dependency is already discovered through bicep template parameters. The extra reference caused duplicate parameter keys in the manifest. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

github-actions · 2026-03-30T17:42:22Z

🎬 CLI E2E Test Recordings — 52 recordings uploaded (commit 73db241)

View recordings

Test	Recording
AddPackageInteractiveWhileAppHostRunningDetached	▶️ View Recording
AddPackageWhileAppHostRunningDetached	▶️ View Recording
AgentCommands_AllHelpOutputs_AreCorrect	▶️ View Recording
AgentInitCommand_DefaultSelection_InstallsSkillOnly	▶️ View Recording
AgentInitCommand_MigratesDeprecatedConfig	▶️ View Recording
AspireAddPackageVersionToDirectoryPackagesProps	▶️ View Recording
AspireUpdateRemovesAppHostPackageVersionFromDirectoryPackagesProps	▶️ View Recording
Banner_DisplayedOnFirstRun	▶️ View Recording
Banner_DisplayedWithExplicitFlag	▶️ View Recording
Banner_NotDisplayedWithNoLogoFlag	▶️ View Recording
CertificatesClean_RemovesCertificates	▶️ View Recording
CertificatesTrust_WithNoCert_CreatesAndTrustsCertificate	▶️ View Recording
CertificatesTrust_WithUntrustedCert_TrustsCertificate	▶️ View Recording
ConfigSetGet_CreatesNestedJsonFormat	▶️ View Recording
CreateAndRunAspireStarterProject	▶️ View Recording
CreateAndRunAspireStarterProjectWithBundle	▶️ View Recording
CreateAndRunEmptyAppHostProject	▶️ View Recording
CreateAndRunJavaEmptyAppHostProject	▶️ View Recording
CreateAndRunJsReactProject	▶️ View Recording
CreateAndRunPythonReactProject	▶️ View Recording
CreateAndRunTypeScriptEmptyAppHostProject	▶️ View Recording
CreateAndRunTypeScriptStarterProject	▶️ View Recording
CreateJavaAppHostWithViteApp	▶️ View Recording
CreateStartAndStopAspireProject	▶️ View Recording
CreateTypeScriptAppHostWithViteApp	▶️ View Recording
DescribeCommandResolvesReplicaNames	▶️ View Recording
DescribeCommandShowsRunningResources	▶️ View Recording
DetachFormatJsonProducesValidJson	▶️ View Recording
DoctorCommand_DetectsDeprecatedAgentConfig	▶️ View Recording
DoctorCommand_WithSslCertDir_ShowsTrusted	▶️ View Recording
DoctorCommand_WithoutSslCertDir_ShowsPartiallyTrusted	▶️ View Recording
GlobalMigration_HandlesCommentsAndTrailingCommas	▶️ View Recording
GlobalMigration_HandlesMalformedLegacyJson	▶️ View Recording
GlobalMigration_PreservesAllValueTypes	▶️ View Recording
GlobalMigration_SkipsWhenNewConfigExists	▶️ View Recording
GlobalSettings_MigratedFromLegacyFormat	▶️ View Recording
InvalidAppHostPathWithComments_IsHealedOnRun	▶️ View Recording
LogsCommandShowsResourceLogs	▶️ View Recording
PsCommandListsRunningAppHost	▶️ View Recording
PsFormatJsonOutputsOnlyJsonToStdout	▶️ View Recording
PublishWithDockerComposeServiceCallbackSucceeds	▶️ View Recording
RestoreGeneratesSdkFiles	▶️ View Recording
RunWithMissingAwaitShowsHelpfulError	▶️ View Recording
SecretCrudOnDotNetAppHost	▶️ View Recording
SecretCrudOnTypeScriptAppHost	▶️ View Recording
StagingChannel_ConfigureAndVerifySettings_ThenSwitchChannels	▶️ View Recording
StopAllAppHostsFromAppHostDirectory	▶️ View Recording
StopAllAppHostsFromUnrelatedDirectory	▶️ View Recording
StopNonInteractiveMultipleAppHostsShowsError	▶️ View Recording
StopNonInteractiveSingleAppHost	▶️ View Recording
StopWithNoRunningAppHostExitsSuccessfully	▶️ View Recording
TypeScriptAppHostWithProjectReferenceIntegration	▶️ View Recording

_{📹 Recordings uploaded automatically from CI run #23757560248}

davidfowl · 2026-03-31T06:31:33Z

/deployment-test

github-actions · 2026-03-31T06:32:04Z

🚀 Deployment tests starting on PR #15697...

This will deploy to real Azure infrastructure. Results will be posted here when complete.

View workflow run

github-actions · 2026-03-31T06:55:49Z

❌ Deployment E2E Tests failed — 21 passed, 8 failed, 0 cancelled

View test results and recordings

View workflow run

Test	Result	Recording
Deployment.EndToEnd-VnetSqlServerConnectivityDeploymentTests	✅ Passed	▶️ View Recording
Deployment.EndToEnd-VnetKeyVaultConnectivityDeploymentTests	✅ Passed	▶️ View Recording
Deployment.EndToEnd-VnetSqlServerInfraDeploymentTests	✅ Passed	▶️ View Recording
Deployment.EndToEnd-AcaCompactNamingDeploymentTests	✅ Passed	▶️ View Recording
Deployment.EndToEnd-VnetKeyVaultInfraDeploymentTests	✅ Passed	▶️ View Recording
Deployment.EndToEnd-VnetStorageBlobConnectivityDeploymentTests	✅ Passed	▶️ View Recording
Deployment.EndToEnd-AzureAppConfigDeploymentTests	✅ Passed	▶️ View Recording
Deployment.EndToEnd-AcaStarterDeploymentTests	✅ Passed	▶️ View Recording
Deployment.EndToEnd-AzureStorageDeploymentTests	✅ Passed	▶️ View Recording
Deployment.EndToEnd-AzureServiceBusDeploymentTests	✅ Passed	▶️ View Recording
Deployment.EndToEnd-AzureEventHubsDeploymentTests	✅ Passed	▶️ View Recording
Deployment.EndToEnd-AksStarterDeploymentTests	✅ Passed	▶️ View Recording
Deployment.EndToEnd-AcaDeploymentErrorOutputTests	✅ Passed	▶️ View Recording
Deployment.EndToEnd-AksStarterWithRedisDeploymentTests	✅ Passed	▶️ View Recording
Deployment.EndToEnd-AzureContainerRegistryDeploymentTests	✅ Passed	▶️ View Recording
Deployment.EndToEnd-AcaExistingRegistryDeploymentTests	✅ Passed	▶️ View Recording
Deployment.EndToEnd-AzureKeyVaultDeploymentTests	✅ Passed	▶️ View Recording
Deployment.EndToEnd-VnetStorageBlobInfraDeploymentTests	✅ Passed	▶️ View Recording
Deployment.EndToEnd-AzureLogAnalyticsDeploymentTests	✅ Passed	▶️ View Recording
Deployment.EndToEnd-AuthenticationTests	✅ Passed
Deployment.EndToEnd-AppServiceReactDeploymentTests	✅ Passed	▶️ View Recording
Deployment.EndToEnd-TypeScriptVnetSqlServerInfraDeploymentTests	❌ Failed	▶️ View Recording
Deployment.EndToEnd-AcaManagedRedisDeploymentTests	❌ Failed	▶️ View Recording
Deployment.EndToEnd-TypeScriptExpressDeploymentTests	❌ Failed	▶️ View Recording
Deployment.EndToEnd-AcaCustomRegistryDeploymentTests	❌ Failed	▶️ View Recording
Deployment.EndToEnd-AcrPurgeTaskDeploymentTests	❌ Failed	▶️ View Recording
Deployment.EndToEnd-PythonFastApiDeploymentTests	❌ Failed	▶️ View Recording
Deployment.EndToEnd-AcaCompactNamingUpgradeDeploymentTests	❌ Failed	▶️ View Recording
Deployment.EndToEnd-AppServicePythonDeploymentTests	❌ Failed	▶️ View Recording

karolz-ms

Provided some comments--hope this helps.

The hardest part of writing a Kubernetes-like controller is dealing both model changes and real-world changes. Each can drift independently, and each might be inconsistent with the request you have just dequeued (e.g. you might receive a startup request for a resource that is already running etc.) But I think the AzureProvisioningController is fortunately much simpler: the model is static, and the primary goal of the controller is to serialize asynchronous operations. I have no major concerns over this approach here.

karolz-ms · 2026-04-02T19:30:44Z