Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix deletion of Guacamole VM in stopped state #4245

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

marrobi
Copy link
Member

@marrobi marrobi commented Jan 2, 2025

Draft - what think about this approach with a script to ensure child resources are deleted? Need to do other VMs.

Add a configuration option to Guacamole VM templates to skip shutdown and force deletion, resolving the issue where VMs in a stopped state cannot be deleted. Update relevant Terraform and Porter files to implement this change and increment version numbers accordingly.

Fixes #4135

Fixes microsoft#4135

Add `skip_shutdown_and_force_delete` configuration to Guacamole VM templates to fix deletion issue.

* Modify `main.tf` files in `templates/workspace_services/guacamole/user_resources/guacamole-azure-export-reviewvm/terraform`, `guacamole-azure-import-reviewvm/terraform`, `guacamole-azure-linuxvm/terraform`, and `guacamole-azure-windowsvm/terraform` to include `skip_shutdown_and_force_delete = true` under the `virtual_machine` block in the `features` section.
* Update `porter.yaml` files in `templates/workspace_services/guacamole/user_resources/guacamole-azure-export-reviewvm`, `guacamole-azure-import-reviewvm`, `guacamole-azure-windowsvm`, and `guacamole-azure-linuxvm` to increment the version numbers.
* Update `CHANGELOG.md` to include the fix for the deletion issue.

---

For more details, open the [Copilot Workspace session](https://copilot-workspace.githubnext.com/microsoft/AzureTRE/issues/4135?shareId=XXXX-XXXX-XXXX-XXXX).
@marrobi
Copy link
Member Author

marrobi commented Jan 2, 2025

@tamirkamara @jonnyry attempt number 2, thoughts on this approach?

@marrobi
Copy link
Member Author

marrobi commented Jan 3, 2025

@dram1964 fancy testing this one?

@dram1964
Copy link

dram1964 commented Jan 3, 2025

Hi @marrobi, I did test the original fix, but didn't realise that this left the os_disk deployed. Happy to test out the new code, but I was wondering whether this is the best approach.
Given that the virtual_machine {skip_shutdown_and_force_delete = true } feature is designed to preserve sub-resources of the VM, the additional script is a 'fix for the fix'.
Would it be better to add an az vm start operation to the 'delete' step, since the 'delete' operation appears to expect the vm to be started?

@marrobi
Copy link
Member Author

marrobi commented Jan 3, 2025

Hi @marrobi, I did test the original fix, but didn't realise that this left the os_disk deployed. Happy to test out the new code, but I was wondering whether this is the best approach. Given that the virtual_machine {skip_shutdown_and_force_delete = true } feature is designed to preserve sub-resources of the VM, the additional script is a 'fix for the fix'. Would it be better to add an az vm start operation to the 'delete' step, since the 'delete' operation appears to expect the vm to be started?

Agree, that's another option, just worry about turning a VM on that has been intentionally stopped. Imagine it has a start-up script that does stuff, or you turn it off as it has some malware. I'd prefer to just get rid of it.

@tamirkamara
Copy link
Collaborator

tamirkamara commented Jan 5, 2025

Wouldn't it be easier to use azure cli to delete the VM if it's not on and then let TF do its thing without any other modification?
We already have the VM resource ID available so just need to query if it exists in the right state and delete.
p.s. there seem to be a PR on AzureRM for this. So probably should add a comment to any fix here that can be removed when will move to version 4.14 (maybe?)

@marrobi
Copy link
Member Author

marrobi commented Jan 6, 2025

Ah, hadn't seen that - making a note to follow to hashicorp/terraform-provider-azurerm#28412

I used this approach as thought it was generic and could be reused for other services where we get lingering resources. Providing they are tagged they will get deleted.

If it's going to be reverted, don't mind, as long as it's fixed :)

@dram1964
Copy link

dram1964 commented Jan 9, 2025

@dram1964 fancy testing this one?

Hi @marrobi - sorry I'm having trouble testing this. I've copied your changes into my repo, but can't build the linuxdsvm image (make user_resource_bundle BUNDLE=guacamole-azure-linuxvm WORKSPACE_SERVICE=guacamole).

The problem seems to be caused by this line in the Dockerfile.tmpl:

COPY --from=scripts --link . ${BUNDLE_DIR}/scripts/

And the error is:

failed to resolve source metadata for docker.io/library/scripts:latest

I have added the porter-build-context.env and created the templates\scripts\delete_azure_resources.sh - but I don't think the scripts env var is actually being set.

I guess this is probably because I'm on an earlier version of the TRE (v0.19.1), with older versions of linuxdsvm (1.0.3)?

@marrobi
Copy link
Member Author

marrobi commented Jan 9, 2025

@dram1964 fancy testing this one?

Hi @marrobi - sorry I'm having trouble testing this. I've copied your changes into my repo, but can't build the linuxdsvm image (make user_resource_bundle BUNDLE=guacamole-azure-linuxvm WORKSPACE_SERVICE=guacamole).

The problem seems to be caused by this line in the Dockerfile.tmpl:

COPY --from=scripts --link . ${BUNDLE_DIR}/scripts/

And the error is:

failed to resolve source metadata for docker.io/library/scripts:latest

I have added the porter-build-context.env and created the templates\scripts\delete_azure_resources.sh - but I don't think the scripts env var is actually being set.

I guess this is probably because I'm on an earlier version of the TRE (v0.19.1), with older versions of linuxdsvm (1.0.3)?

@dram1964 ah, yes, could be the new porter support for build contexts has only recently been added so need to be on the latest version of porter. The latest is in the main branch devcontainer.

@jonnyry
Copy link
Collaborator

jonnyry commented Jan 12, 2025

@marrobi which resources are left lingering after the delete has applied? Is it just the osdisk?

If so delete_os_disk_on_deletion should do the trick according to this comment:

features {
        virtual_machine {
          skip_shutdown_and_force_delete = true
          delete_os_disk_on_deletion = true
        }
      }

@marrobi
Copy link
Member Author

marrobi commented Jan 13, 2025

Hmm, might be the nic too. Think you the one that hit the issue with the initial attempt in the E2E tests?

@jonnyry
Copy link
Collaborator

jonnyry commented Jan 15, 2025

Hmm, might be the nic too. Think you the one that hit the issue with the initial attempt in the E2E tests?

Just had a look through the runs but unfortunately the logs have been purged already 🫤

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Unable to delete a Guacamole User Resource if it is in the 'stopped' state
4 participants