Skip to content

fix(ecs): retain ECR images on resource deletion#6378

Open
ekaya97 wants to merge 1 commit intoanomalyco:devfrom
ekaya97:fix/retain-ecr-images-on-delete
Open

fix(ecs): retain ECR images on resource deletion#6378
ekaya97 wants to merge 1 commit intoanomalyco:devfrom
ekaya97:fix/retain-ecr-images-on-delete

Conversation

@ekaya97
Copy link
Contributor

@ekaya97 ekaya97 commented Feb 2, 2026

Problem

The docker-build provider deletes images from ECR when the Image resource is destroyed. This causes ECS services to fail with CannotPullContainerError when:

  1. A rolling deployment replaces the Image resource
  2. The old ECR image is deleted before tasks fully roll over
  3. Fargate Spot interrupts a task, and ECS cannot restart it because the image no longer exists

Fixes #6377

Solution

Add retainOnDelete: true to the Image resource to prevent automatic image deletion. This ensures ECS task definitions can always pull their referenced image digests.

Users should configure ECR lifecycle policies to clean up old untagged images periodically.

Related

@vimtor
Copy link
Collaborator

vimtor commented Feb 2, 2026

hey @ekaya97

i'll try to reproduce this later today

maybe it's related to your expire image config you mentioned here? #6296

@ekaya97
Copy link
Contributor Author

ekaya97 commented Feb 2, 2026

@vimtor

no, there are no lifecycle policies defined. this was a change from pulumi v3/v4.

Edit: this actually prepares the repo config mentioned in #6296 - by disabling the delete operation by pulumi, we can safely apply lifecycle policies without side effects.

@vimtor vimtor self-assigned this Feb 2, 2026
@vimtor
Copy link
Collaborator

vimtor commented Feb 2, 2026

so you think this has been happening for a long time then? @ekaya97

@ekaya97
Copy link
Contributor Author

ekaya97 commented Feb 2, 2026

honestly, in my other project i didn't have this issue, this is new. i am not sure (and can't look up) what sst version that project was using.
the fact that touching the dockerfile and forcing rebuild fixes the issue pretty much confirms that @pulumi/docker-build runs the delete operation automatically.

Edit:
pulumi/pulumi#15982

That's the issue. Pulumi runs Create/Delete on the exact same digest. That's why the image is deleted when we update a service but don't touch the Dockerfile.
The Auto-scale/Spot failure may just be timing issue on my part.
I suppose the PR will not actually solve this, because retainOnDelete doesn't apply in the replacement scenario.
replaceOnChanges or deleteBeforeReplace may solve this
https://www.pulumi.com/docs/reference/pkg/nodejs/pulumi/pulumi/interfaces/CustomResourceOptions.html#replaceonchanges

Otherwise, it would make sense to completely decouple the Image component from Service parent.

Edit2:
retainOnDelete does apply. It just simply doesn't run delete operation against provider. digest collisions are also not expected since ECR is idempotent.

@mkilp
Copy link
Contributor

mkilp commented Feb 2, 2026

+1 for this change, I also just updated my PR for the lifecycle policy example and helper function. IMO both are an important change for $$$ and reliability.

@ekaya97 ekaya97 force-pushed the fix/retain-ecr-images-on-delete branch from 7808d8a to fca6347 Compare February 5, 2026 12:05
@vimtor
Copy link
Collaborator

vimtor commented Feb 5, 2026

did you run bun install @ekaya97?

bun.lock should be updated

@mkilp
Copy link
Contributor

mkilp commented Feb 5, 2026

@vimtor wouldn't this be a good reason to maybe check out #6292? That was the github actions PR which should automate test builds. Which could be important once we start adding/updating dependencies.

The docker-build provider v0.0.8 had a bug causing unnecessary
delete/replace cycles on Image resources, which deleted ECR images
that running ECS tasks still referenced by digest.

Upgrading to v0.0.14 fixes the root cause
(pulumi/pulumi-docker-build#606) instead of working around it
with retainOnDelete.

Fixes anomalyco#6377

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@ekaya97 ekaya97 force-pushed the fix/retain-ecr-images-on-delete branch from 50efa0f to 7068dea Compare February 5, 2026 12:36
@vimtor
Copy link
Collaborator

vimtor commented Feb 5, 2026

@mkilp yep, i have planned to check that one with the core team and provide better contribution guidelines

Copy link
Collaborator

@vimtor vimtor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

worked like a charm

proof of it working

tried deploying the following examples:

  • aws-task
  • aws-nextjs-container
  • aws-python-container
  • aws-hono-container
Image Image Image Image Image

thank you @ekaya97 @mkilp @jamesgibbons92

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ECR Container is removed after deployment - production fault

3 participants