From 5621c3ebd62f4d44c272c7bfb5cac03547c3b3f1 Mon Sep 17 00:00:00 2001 From: Loren Yu Date: Mon, 18 Aug 2025 16:51:10 -0700 Subject: [PATCH 1/8] Add docs from platform operations --- CONTRIBUTING.md | 59 +++++++++++ template-only-docs/troubleshooting.md | 142 ++++++++++++++++++++++++++ 2 files changed, 201 insertions(+) create mode 100644 template-only-docs/troubleshooting.md diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 1aea0589b..f22aef7b2 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -1,7 +1,66 @@ # Contributing to the infrastructure template +## Getting started + If you are looking to contribute, get started by reading the following docs. - Read about the [infrastructure's module architecture](/docs/infra/module-architecture.md) to learn how the architecture of the infrastructure code is designed and how the modules interact with each other. - Read the [template development workflow](/template-only-docs/template-development-workflow.md) to understand how to develop and test changes to the template because working on the platform templates is unlike working on most other applications. - Read the [infrastructure style guide](/docs/infra/style-guide.md) to understand best practices for Terraform and shell scripts. + +## Isolate PRs to a single infrastructure layer and make each individual PR backwards compatible + +Each PR should only touch one of the infrastructure layers at a time. The infrastructure layers are: + +- Accounts layer (infra/accounts) +- Network layer (infra/networks) +- Build repository layer (infra/app/build-repository) +- Database layer (infra/app/database) +- Service layer (infra/app/service) + +**If you need to make a change that affects multiple layers, break it down into multiple steps so that each step can be done in a backwards compatible manner that affects one layer at a time.** + +For example, if you want to change the name of the ECR image repository, you should break the change down into the following steps involving three PRs. + +1. Create PR #1 that adds a new image repository with a new name (This PR only modifies the build repository layer). +2. After merging the PR #1, manually apply the changes to the platform test repos (platform-test, platform-test-nextjs, platform-test-flask) since changes to the build-repository layer aren’t automatically applied as part of the CD workflow. +3. Create PR #2 that updates the publish-release.sh script to use the new image repository and modifies the task definition to pull from the new image repository (This PR only modifies the service layer). After merging PR #2, deploys should automatically be publishing builds to the new build repository and the service tasks should be using images from the new build repository +4. Create PR #3 that removes the old build repository(This PR only modifies the build repository layer) +When creating a release that includes a breaking change such as this one, include migration instructions on how to apply the changes without incurring downtime. In this example, the migration notes will look something like: + + Step 1: Do a targeted apply to create the new build repository + Step 2: Commit and apply the changes to the service layer + Step 3: Apply the rest of the changes to the build repository layer to remove the old build repository + +## Pay attention to testing and rollout process when reviewing PRs + +When reviewing template PRs, in addition to the usual things you look for, pay particular attention to: + +### Manual testing + +Unlike application development, the automated test suite for infrastructure has much less coverage, so it is more important than usual to review test plans and evidence of successful testing to demonstrate that things work. Ask yourself the following questions: +What evidence would I need to see to be confident that things are working as intended? +In what ways could things be working differently as intended under the hood but still look the same based on the evidence provided? + +### Rollout process + +Sometimes template changes do not propagate cleanly to the platform test repos. See Platform test repo(s) do not have the latest changes from template-infra. + +Also, unlike application changes, infrastructure changes aren’t always automatically applied. Make sure to think about how the changes will be applied before merging and make sure the changes get applied after merge. Double check by making sure the latest deploys (including in platform-test-nextjs and platform-test-flask test repos) completed successfully and that the terraform plans on main show no configuration changes. + +```bash +platform-test$ git pull +platform-test$ make infra-update-app-database APP_NAME=app ENVIRONMENT=dev # should show no configuration changes +platform-test$ make infra-update-app-service APP_NAME=app ENVIRONMENT=dev # should show no configuration changes +``` + +## Make note of breaking changes + +If your PR will introduce a breaking change, then after the PR is approved, but before you merge it into main: + +1. Prefix the commit title with ⚠️. This indicates to the Platform Admins who will make the next release that there is a breaking change included in the release. +2. Add a section in the commit description for “Release notes” and indicate what needs to be included in the release notes on how to handle the breaking change. + +## Troubleshooting + +See the [troubleshooting guide](/template-only-docs/troubleshooting.md) for common issues and how to resolve them. diff --git a/template-only-docs/troubleshooting.md b/template-only-docs/troubleshooting.md new file mode 100644 index 000000000..3d19649e2 --- /dev/null +++ b/template-only-docs/troubleshooting.md @@ -0,0 +1,142 @@ +# Troubleshooting Guide for Template Infra + +## Template CI Infra Checks fails on main + +If the [Template CI Infra Checks (template-only-ci-infra.yml)](https://github.com/navapbc/template-infra/actions/workflows/template-only-ci-infra.yml) workflow fails on the main branch and there isn’t a bug in the code, it may mean that a prior run of the workflow did not properly clean up account resources. + +### Preventing the problem from getting worse + +If you notice Template CI Infra Checks failing on main, tell people to pause on doing anything that would trigger a Template CI Infra Check run, since further runs will just create more issues you have to look into and more things you have to clean up. + +Things that trigger Template CI Infra Checks runs include: + +* Pushes to main branch +* Opening PRs or updating PRs with new commits on template-infra + +### Diagnosing the immediate problem + +Look in the GitHub logs for the Template CI Infra check that failed. The logs are very long and therefore are collapsed into groups. + +Errors that may indicate a problem with cleanup include: + +* “OIDC provider already exists” during the SetUpAccount step +* “IAM role already exists” during the SetUpDevEnvironment step +* “SNS topic already exists” during the SetUpDevEnvironment step + +### Diagnosing the root cause + +If you have good reason to believe this is a one time thing, then you can skip this step and proceed to clean up the AWS account to unblock the Template CI Infra Checks workflow. Otherwise, it is important to find out what caused the test to not properly clean up and fix that first so that you don’t end up repeating the problem. + +Look at the GitHub logs for previous runs of the Template CI Infra Checks workflow that also failed, starting from the one you were initially looking into. + +Look in the following Teardown\* steps for errors: + +* TeardownAccount +* TeardownBuildRepository +* TeardownDevEnvironment + +Causes for errors in these steps may include: + +* Inability to delete non-empty buckets. In order to delete non-empty buckets, you first need to set force\_destroy \= true and prevent\_destroy \= false for the bucket and run a terraform apply before running terraform destroy. +* Bugs in the template\_infra\_test.go file +* Bugs in template-only-bin/destroy-\* scripts + +### Clean up the AWS account + +Login to the [nava-platform AWS account](https://nava-platform.signin.aws.amazon.com/console) and check for the following to clean up: + +* [ECS clusters](https://us-east-1.console.aws.amazon.com/ecs/v2/getStarted?region=us-east-1) for the service +* [Load balancers](https://us-east-1.console.aws.amazon.com/ec2/home?region=us-east-1#LoadBalancers:) for the service +* [SNS topics](https://us-east-1.console.aws.amazon.com/sns/v3/home?region=us-east-1#/homepage) for monitoring alerts +* [IAM roles](https://us-east-1.console.aws.amazon.com/iamv2/home?region=us-east-1#/roles) for GitHub actions, the service, and others +* [S3 buckets](https://s3.console.aws.amazon.com/s3/get-started?region=us-east-1) for terraform state file, terraform logs, and load balancer access logs +* [DynamoDB tables](https://us-east-1.console.aws.amazon.com/dynamodbv2/home?region=us-east-1#service) for terraform state locks +* [Identity providers](https://us-east-1.console.aws.amazon.com/iamv2/home?region=us-east-1#/identity_providers) for the GitHub OIDC provider + +Note: The Template CI Infra does not currently spin up any databases. + +Note: Loren has a branch called `lorenyu/clean` with two scripts that you can use: + +* `template-only-bin/clean-account.sh` +* `template-only-bin/destroy-vpc.sh` + +### Verify solution + +Re-run Template CI Infra Checks on main branch + +## Platform test repo(s) do not have the latest changes from template-infra + +If the platform test repo does not have the latest changes from template-infra check the infra template’s [Template Deploy](https://github.com/navapbc/template-infra/actions/workflows/template-only-cd.yml) workflow to see if there was any failure in deploying the template to the platform-test, platform-test-flask, or platform-test-nextjs repos. + +### Failure due to merge conflict with files that have changed on the project + +If the Template Deploy failed during the “git apply” command with a “patch does not apply” error, it probably means that there was a merge conflict and you will have to apply the change manually. This is because template-infra has files that will have changed on the platform test repos, and therefore a patch in those files might not apply cleanly. Files that might cause merge conflicts include: + +* [.github/workflows/cd-app.yml](https://github.com/navapbc/template-infra/blob/main/.github/workflows/cd-app.yml#L7-L14) – push trigger is commented out in template-infra but uncommented in platform-test\* repos +* [.github/workflows/ci-infra-service.yml](https://github.com/navapbc/template-infra/blob/main/.github/workflows/ci-infra-service.yml#L4-L16) – push trigger is commented out in template-infra but uncommented in platform-test\* repos +* [infra/project-config/main.tf](https://github.com/navapbc/template-infra/blob/main/infra/project-config/main.tf) – placeholders in template-infra’s version will be replaced with actual values for the platform-test\* repos +* [infra/app/app-config/main.tf](https://github.com/navapbc/template-infra/blob/main/infra/app/app-config/main.tf#L6) – on platform-test and platform-test-flask (but not platform-test-nextjs), has\_database will be set to true, but it defaults to false in template-infra + +### Failure due to bugs in update-template.sh script + +It’s also possible that there is a failure in the [update-template.sh script](https://github.com/navapbc/template-infra/blob/main/template-only-bin/update-template.sh). If so, make sure to fix the bug and make sure that all changes from the templates have been propagated to the platform test repos so that they stay in sync with the templates and so that future changes can continue to propagate successfully. + +### Deploy template changes to platform test repos + +Sometimes changes to templates need to be manually applied to the platform test repos ([platform-test](https://github.com/navapbc/platform-test), [platform-test-flask](https://github.com/navapbc/platform-test-flask), [platform-test-nextjs](https://github.com/navapbc/platform-test-nextjs)). + +Broadly speaking there are three methods to manually update the platform test repos from the template: + +Step 0\. First, make sure you’re on the main branch of each repo and that you pull the latest changes of the template and of the platform test repo: + +```bash +platform-test$ git checkout main +platform-test@main$ git pull +template-infra$ git checkout main +template-infra@main$ git pull +``` + +#### **Method 1: Try running update-template.sh again and see what fails** + +Run a version of template-infra/template-only-bin/update-template.sh that you have downloaded locally from the platform-test repo’s root directory + +```bash +platform-test@main$ /template-only-bin/update-template.sh +``` + +If it fails due to a merge conflict, it will indicate which file(s) it fails on. The patch will still live in a newly checked out template-infra folder in \`./template-infra/update.patch\`. You can re-apply this patch, excluding the failed files: + +```bash +platform-test@main$ git apply template-infra/update.patch --exclude= +``` + +If that succeeds, you can then manually apply the changes in the excluded file by copying over the file and examining the diff and updating accordingly. + +Once you’re done, note that the .template-version file will be updated the latest commit hash of template-infra’s main branch. + +#### **Method 2: Run install-template.sh again and manually inspect the diff** + +You can run through the installation instructions again: + +```bash +platform-test@main$ curl https://raw.githubusercontent.com/navapbc/template-infra/main/template-only-bin/download-and-install-template.sh | bash -s +``` + +Then manually inspect the diff to make sure you don’t revert things like the uncommented blocks in cd-app.yml or ci-infra-service.yml or the filled in project-config/main.tf file. + +**Deleted files:** This method will not properly handle deleted files or renamed files. You will need to manually remove those. + +Once you’re done, note that the .template-version file will be updated with the latest commit hash of template-infra’s main branch. + +#### **Method 3: Manually copy over the changes** + +This method involves manually viewing the changed/renamed/moved files and manually copying the changes over. + +Once you’re done, update the .template-version file with the latest commit hash of template-infra’s main branch. + +```bash +platform-test@main$ cd template-infra +platform-test@main$ TEMPLATE_VERSION=$(git rev-parse HEAD) +platform-test@main$ cd .. +platform-test@main$ echo "$TEMPLATE_VERSION" > .template-version +``` From df1e90ae4b640da95745a01636e2526871940017 Mon Sep 17 00:00:00 2001 From: Loren Yu Date: Wed, 20 Aug 2025 08:17:17 -0700 Subject: [PATCH 2/8] Move "isolate PRs to one layer" guidance to template dev workflow doc --- CONTRIBUTING.md | 24 ------------------- .../template-development-workflow.md | 22 ++++++++++++++++- 2 files changed, 21 insertions(+), 25 deletions(-) diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index f22aef7b2..10179f205 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -8,30 +8,6 @@ If you are looking to contribute, get started by reading the following docs. - Read the [template development workflow](/template-only-docs/template-development-workflow.md) to understand how to develop and test changes to the template because working on the platform templates is unlike working on most other applications. - Read the [infrastructure style guide](/docs/infra/style-guide.md) to understand best practices for Terraform and shell scripts. -## Isolate PRs to a single infrastructure layer and make each individual PR backwards compatible - -Each PR should only touch one of the infrastructure layers at a time. The infrastructure layers are: - -- Accounts layer (infra/accounts) -- Network layer (infra/networks) -- Build repository layer (infra/app/build-repository) -- Database layer (infra/app/database) -- Service layer (infra/app/service) - -**If you need to make a change that affects multiple layers, break it down into multiple steps so that each step can be done in a backwards compatible manner that affects one layer at a time.** - -For example, if you want to change the name of the ECR image repository, you should break the change down into the following steps involving three PRs. - -1. Create PR #1 that adds a new image repository with a new name (This PR only modifies the build repository layer). -2. After merging the PR #1, manually apply the changes to the platform test repos (platform-test, platform-test-nextjs, platform-test-flask) since changes to the build-repository layer aren’t automatically applied as part of the CD workflow. -3. Create PR #2 that updates the publish-release.sh script to use the new image repository and modifies the task definition to pull from the new image repository (This PR only modifies the service layer). After merging PR #2, deploys should automatically be publishing builds to the new build repository and the service tasks should be using images from the new build repository -4. Create PR #3 that removes the old build repository(This PR only modifies the build repository layer) -When creating a release that includes a breaking change such as this one, include migration instructions on how to apply the changes without incurring downtime. In this example, the migration notes will look something like: - - Step 1: Do a targeted apply to create the new build repository - Step 2: Commit and apply the changes to the service layer - Step 3: Apply the rest of the changes to the build repository layer to remove the old build repository - ## Pay attention to testing and rollout process when reviewing PRs When reviewing template PRs, in addition to the usual things you look for, pay particular attention to: diff --git a/template-only-docs/template-development-workflow.md b/template-only-docs/template-development-workflow.md index 132aa87d6..6a86081a0 100644 --- a/template-only-docs/template-development-workflow.md +++ b/template-only-docs/template-development-workflow.md @@ -12,7 +12,27 @@ For most infrastructure changes, you will need an environment to work with. Sinc If you need an AWS IAM user for the AWS account associated with any of the platform test repos, contact @lorenyu. -## Developing infrastructure changes +## Planning infrastructure changes + +### Isolate PRs to a single infrastructure layer and make each individual PR backwards compatible + +Each PR should only touch one of [the infrastructure layers](/infra/README.md#infrastructure-layers) at a time. + +**If you need to make a change that affects multiple layers, break it down into multiple steps so that each step can be done in a backwards compatible manner that affects one layer at a time.** + +For example, if you want to change the name of the ECR image repository, you should break the change down into the following steps involving three PRs. + +1. Create PR #1 that adds a new image repository with a new name (This PR only modifies the build repository layer). +2. After merging the PR #1, manually apply the changes to the platform test repos (platform-test, platform-test-nextjs, platform-test-flask) since changes to the build-repository layer aren’t automatically applied as part of the CD workflow. +3. Create PR #2 that updates the publish-release.sh script to use the new image repository and modifies the task definition to pull from the new image repository (This PR only modifies the service layer). After merging PR #2, deploys should automatically be publishing builds to the new build repository and the service tasks should be using images from the new build repository +4. Create PR #3 that removes the old build repository(This PR only modifies the build repository layer) +When creating a release that includes a breaking change such as this one, include migration instructions on how to apply the changes without incurring downtime. In this example, the migration notes will look something like: + + Step 1: Do a targeted apply to create the new build repository + Step 2: Commit and apply the changes to the service layer + Step 3: Apply the rest of the changes to the build repository layer to remove the old build repository + +## Developing and testing infrastructure changes This is the most common workflow: From 9fd375b8e007fd72d1680f6794420f6018f5015e Mon Sep 17 00:00:00 2001 From: Loren Yu Date: Wed, 20 Aug 2025 08:48:16 -0700 Subject: [PATCH 3/8] Replace outdated content with link to Confluence --- template-only-docs/troubleshooting.md | 75 +-------------------------- 1 file changed, 1 insertion(+), 74 deletions(-) diff --git a/template-only-docs/troubleshooting.md b/template-only-docs/troubleshooting.md index 3d19649e2..4e0c2d742 100644 --- a/template-only-docs/troubleshooting.md +++ b/template-only-docs/troubleshooting.md @@ -66,77 +66,4 @@ Re-run Template CI Infra Checks on main branch ## Platform test repo(s) do not have the latest changes from template-infra -If the platform test repo does not have the latest changes from template-infra check the infra template’s [Template Deploy](https://github.com/navapbc/template-infra/actions/workflows/template-only-cd.yml) workflow to see if there was any failure in deploying the template to the platform-test, platform-test-flask, or platform-test-nextjs repos. - -### Failure due to merge conflict with files that have changed on the project - -If the Template Deploy failed during the “git apply” command with a “patch does not apply” error, it probably means that there was a merge conflict and you will have to apply the change manually. This is because template-infra has files that will have changed on the platform test repos, and therefore a patch in those files might not apply cleanly. Files that might cause merge conflicts include: - -* [.github/workflows/cd-app.yml](https://github.com/navapbc/template-infra/blob/main/.github/workflows/cd-app.yml#L7-L14) – push trigger is commented out in template-infra but uncommented in platform-test\* repos -* [.github/workflows/ci-infra-service.yml](https://github.com/navapbc/template-infra/blob/main/.github/workflows/ci-infra-service.yml#L4-L16) – push trigger is commented out in template-infra but uncommented in platform-test\* repos -* [infra/project-config/main.tf](https://github.com/navapbc/template-infra/blob/main/infra/project-config/main.tf) – placeholders in template-infra’s version will be replaced with actual values for the platform-test\* repos -* [infra/app/app-config/main.tf](https://github.com/navapbc/template-infra/blob/main/infra/app/app-config/main.tf#L6) – on platform-test and platform-test-flask (but not platform-test-nextjs), has\_database will be set to true, but it defaults to false in template-infra - -### Failure due to bugs in update-template.sh script - -It’s also possible that there is a failure in the [update-template.sh script](https://github.com/navapbc/template-infra/blob/main/template-only-bin/update-template.sh). If so, make sure to fix the bug and make sure that all changes from the templates have been propagated to the platform test repos so that they stay in sync with the templates and so that future changes can continue to propagate successfully. - -### Deploy template changes to platform test repos - -Sometimes changes to templates need to be manually applied to the platform test repos ([platform-test](https://github.com/navapbc/platform-test), [platform-test-flask](https://github.com/navapbc/platform-test-flask), [platform-test-nextjs](https://github.com/navapbc/platform-test-nextjs)). - -Broadly speaking there are three methods to manually update the platform test repos from the template: - -Step 0\. First, make sure you’re on the main branch of each repo and that you pull the latest changes of the template and of the platform test repo: - -```bash -platform-test$ git checkout main -platform-test@main$ git pull -template-infra$ git checkout main -template-infra@main$ git pull -``` - -#### **Method 1: Try running update-template.sh again and see what fails** - -Run a version of template-infra/template-only-bin/update-template.sh that you have downloaded locally from the platform-test repo’s root directory - -```bash -platform-test@main$ /template-only-bin/update-template.sh -``` - -If it fails due to a merge conflict, it will indicate which file(s) it fails on. The patch will still live in a newly checked out template-infra folder in \`./template-infra/update.patch\`. You can re-apply this patch, excluding the failed files: - -```bash -platform-test@main$ git apply template-infra/update.patch --exclude= -``` - -If that succeeds, you can then manually apply the changes in the excluded file by copying over the file and examining the diff and updating accordingly. - -Once you’re done, note that the .template-version file will be updated the latest commit hash of template-infra’s main branch. - -#### **Method 2: Run install-template.sh again and manually inspect the diff** - -You can run through the installation instructions again: - -```bash -platform-test@main$ curl https://raw.githubusercontent.com/navapbc/template-infra/main/template-only-bin/download-and-install-template.sh | bash -s -``` - -Then manually inspect the diff to make sure you don’t revert things like the uncommented blocks in cd-app.yml or ci-infra-service.yml or the filled in project-config/main.tf file. - -**Deleted files:** This method will not properly handle deleted files or renamed files. You will need to manually remove those. - -Once you’re done, note that the .template-version file will be updated with the latest commit hash of template-infra’s main branch. - -#### **Method 3: Manually copy over the changes** - -This method involves manually viewing the changed/renamed/moved files and manually copying the changes over. - -Once you’re done, update the .template-version file with the latest commit hash of template-infra’s main branch. - -```bash -platform-test@main$ cd template-infra -platform-test@main$ TEMPLATE_VERSION=$(git rev-parse HEAD) -platform-test@main$ cd .. -platform-test@main$ echo "$TEMPLATE_VERSION" > .template-version -``` +See :lock: [template changes fail to apply](https://navasage.atlassian.net/wiki/spaces/tss/pages/2011922659/Platform+Ecosystem#template-*-changes-fail-to-apply) for help troubleshooting this issue. From 47a4078652e0f984d017284450c5e5d887af267c Mon Sep 17 00:00:00 2001 From: Loren Yu Date: Tue, 16 Sep 2025 09:04:09 -0700 Subject: [PATCH 4/8] Update CONTRIBUTING.md Use regular quotes Co-authored-by: Tanner Doshier --- CONTRIBUTING.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 10179f205..0cf775384 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -35,7 +35,7 @@ platform-test$ make infra-update-app-service APP_NAME=app ENVIRONMENT=dev # shou If your PR will introduce a breaking change, then after the PR is approved, but before you merge it into main: 1. Prefix the commit title with ⚠️. This indicates to the Platform Admins who will make the next release that there is a breaking change included in the release. -2. Add a section in the commit description for “Release notes” and indicate what needs to be included in the release notes on how to handle the breaking change. +2. Add a section in the commit description for "Release notes" and indicate what needs to be included in the release notes on how to handle the breaking change. ## Troubleshooting From 64fcf0a054f0e568bc78721cc19db5011021455d Mon Sep 17 00:00:00 2001 From: Loren Yu Date: Tue, 16 Sep 2025 09:17:36 -0700 Subject: [PATCH 5/8] Replace curly quotes with straight quotes --- CONTRIBUTING.md | 4 ++-- docs/code-reviews.md | 2 +- ...2-10-05-use-custom-implementation-of-github-oidc.md | 2 +- ...aform-backend-configs-into-separate-config-files.md | 2 +- ...nfra-config-from-tfvars-files-into-config-module.md | 2 +- docs/e2e/e2e-checks.md | 2 +- infra/{{app_name}}/database/main.tf | 2 +- template-only-docs/template-development-workflow.md | 4 ++-- template-only-docs/troubleshooting.md | 10 +++++----- 9 files changed, 15 insertions(+), 15 deletions(-) diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 10179f205..faa02f200 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -22,7 +22,7 @@ In what ways could things be working differently as intended under the hood but Sometimes template changes do not propagate cleanly to the platform test repos. See Platform test repo(s) do not have the latest changes from template-infra. -Also, unlike application changes, infrastructure changes aren’t always automatically applied. Make sure to think about how the changes will be applied before merging and make sure the changes get applied after merge. Double check by making sure the latest deploys (including in platform-test-nextjs and platform-test-flask test repos) completed successfully and that the terraform plans on main show no configuration changes. +Also, unlike application changes, infrastructure changes aren't always automatically applied. Make sure to think about how the changes will be applied before merging and make sure the changes get applied after merge. Double check by making sure the latest deploys (including in platform-test-nextjs and platform-test-flask test repos) completed successfully and that the terraform plans on main show no configuration changes. ```bash platform-test$ git pull @@ -35,7 +35,7 @@ platform-test$ make infra-update-app-service APP_NAME=app ENVIRONMENT=dev # shou If your PR will introduce a breaking change, then after the PR is approved, but before you merge it into main: 1. Prefix the commit title with ⚠️. This indicates to the Platform Admins who will make the next release that there is a breaking change included in the release. -2. Add a section in the commit description for “Release notes” and indicate what needs to be included in the release notes on how to handle the breaking change. +2. Add a section in the commit description for "Release notes" and indicate what needs to be included in the release notes on how to handle the breaking change. ## Troubleshooting diff --git a/docs/code-reviews.md b/docs/code-reviews.md index 78748afe5..60f53e53b 100644 --- a/docs/code-reviews.md +++ b/docs/code-reviews.md @@ -11,7 +11,7 @@ Remember to highlight things that you like and appreciate while reading through and to make any other feedback clearly actionable by indicating if it is an optional preference, an important consideration, or an error. Don't be afraid to comment with a question, or to ask for clarification, or provide a suggestion, -whenever you don’t understand what is going on at first glance — or if you think an approach or decision can be improved. +whenever you don't understand what is going on at first glance — or if you think an approach or decision can be improved. Suggestions on how to split a large PR into smaller chunks can also help move things along. Code reviews give us a chance to learn from one another, and to reflect, iterate on, and document why certain decisions are made. diff --git a/docs/decisions/infra/2022-10-05-use-custom-implementation-of-github-oidc.md b/docs/decisions/infra/2022-10-05-use-custom-implementation-of-github-oidc.md index 738b77972..e3e98bbb0 100644 --- a/docs/decisions/infra/2022-10-05-use-custom-implementation-of-github-oidc.md +++ b/docs/decisions/infra/2022-10-05-use-custom-implementation-of-github-oidc.md @@ -33,6 +33,6 @@ Cons of unfunco/oidc-github: - Dependency on an external module in the Terraform registry has negative security implications. Furthermore, the module isn't published by an "official" organization. It is maintained by a single developer, further increasing the security risk. - The module includes extra unnecessary options that make the code more difficult to read and understand - In particular, the module includes the option to attach the `AdminstratorAccess` policy to the GitHub actions IAM role, which isn't necessary and could raise concerns in an audit. -- ~~The module hardcodes the GitHub OIDC Provider thumbprint, which isn't as elegant as the method in the [Initial setup for CD draft PR #43](https://github.com/navapbc/template-infra/pull/43) from @shawnvanderjagt which simply pulls the thumbprint via:~~ (Update: July 12, 2023) Starting July 6, 2023, AWS began securing communication with GitHub’s OIDC identity provider (IdP) using GitHub's library of trusted root Certificate Authorities instead of using a certificate thumbprint to verify the IdP’s server certificate. This approach ensures that the GitHub OIDC configuration behaves correctly without disruption during future certificate rotations and changes. With this new validation approach in place, your legacy thumbprint(s) are longer be needed for validation purposes. +- ~~The module hardcodes the GitHub OIDC Provider thumbprint, which isn't as elegant as the method in the [Initial setup for CD draft PR #43](https://github.com/navapbc/template-infra/pull/43) from @shawnvanderjagt which simply pulls the thumbprint via:~~ (Update: July 12, 2023) Starting July 6, 2023, AWS began securing communication with GitHub's OIDC identity provider (IdP) using GitHub's library of trusted root Certificate Authorities instead of using a certificate thumbprint to verify the IdP's server certificate. This approach ensures that the GitHub OIDC configuration behaves correctly without disruption during future certificate rotations and changes. With this new validation approach in place, your legacy thumbprint(s) are longer be needed for validation purposes. Forking the module to the @navapbc organization gets rid of the security issue, but the other issues remain. diff --git a/docs/decisions/infra/2023-05-09-separate-terraform-backend-configs-into-separate-config-files.md b/docs/decisions/infra/2023-05-09-separate-terraform-backend-configs-into-separate-config-files.md index a7087801d..d90ca29a1 100644 --- a/docs/decisions/infra/2023-05-09-separate-terraform-backend-configs-into-separate-config-files.md +++ b/docs/decisions/infra/2023-05-09-separate-terraform-backend-configs-into-separate-config-files.md @@ -6,7 +6,7 @@ ## Context -Up until now, most projects adopted an infrastructure module architecture that is structured as follows: Each application environment (prod, staging, etc) is a separate root module that calls a template module. The template module defines all the application infra resources needed for an environment. Things that could be different per environment (e.g. desired ECS task count) are template variables, and each environment can have local vars (or somewhat equivalently, a tfvars file) that customizes those variables. Importantly, each environment has its own backend tfstate file, and the backend config is stored in the environment module’s `main.tf`. +Up until now, most projects adopted an infrastructure module architecture that is structured as follows: Each application environment (prod, staging, etc) is a separate root module that calls a template module. The template module defines all the application infra resources needed for an environment. Things that could be different per environment (e.g. desired ECS task count) are template variables, and each environment can have local vars (or somewhat equivalently, a tfvars file) that customizes those variables. Importantly, each environment has its own backend tfstate file, and the backend config is stored in the environment module's `main.tf`. An alternative approach exists to managing the backend configs. Rather than saving the backend config directly in `main.tf`, `main.tf` could contain a [partial configuration](https://developer.hashicorp.com/terraform/language/backend#partial-configuration), and the rest of the backend config would be passed in during terraform init with a command like `terraform init --backend-config=prod.s3.tfbackend`. There would no longer be a need for separate root modules for each environment. What was previously the template module would instead act as the root module, and engineers would work with different environments solely through separate tfbackend files and tfvar files. Doing this would greatly simplify the module architecture at the cost of some complexity when executing terraform commands due to the extra command line parameters. To manage the extra complexity of running terraform commands, a wrapper script (such as with Makefile commands) can be introduced. diff --git a/docs/decisions/infra/2023-09-07-consolidate-infra-config-from-tfvars-files-into-config-module.md b/docs/decisions/infra/2023-09-07-consolidate-infra-config-from-tfvars-files-into-config-module.md index bdaf7824d..b29cd8867 100644 --- a/docs/decisions/infra/2023-09-07-consolidate-infra-config-from-tfvars-files-into-config-module.md +++ b/docs/decisions/infra/2023-09-07-consolidate-infra-config-from-tfvars-files-into-config-module.md @@ -8,7 +8,7 @@ Technical Story: [Replace configure scripts with project/app config variables #3 ## Context -Currently, application infrastructure configuration is split across config modules (see `infra//app-config/`) as well as .tfvars files in each of the application's infra layers - infra//build-repository, infra//database, and infra//service. As @kyeah pointed out, it’s easy to make mistakes when configuration is spread across multiple files, and expressed a desire to manage tfvars across environments all in a single file the way that some applications do for application configuration. Also, as @acouch [pointed out](https://github.com/navapbc/template-infra/pull/282#discussion_r1219930653), there is a lot of duplicate code with the configure scripts (setup-current-account.sh, configure-app-build-repository.sh, configure-app-database.sh, configure-app-service.sh) that configure the backend config and variable files for each infrastructure layer, which increases the burden of maintaining the configuration scripts. +Currently, application infrastructure configuration is split across config modules (see `infra//app-config/`) as well as .tfvars files in each of the application's infra layers - infra//build-repository, infra//database, and infra//service. As @kyeah pointed out, it's easy to make mistakes when configuration is spread across multiple files, and expressed a desire to manage tfvars across environments all in a single file the way that some applications do for application configuration. Also, as @acouch [pointed out](https://github.com/navapbc/template-infra/pull/282#discussion_r1219930653), there is a lot of duplicate code with the configure scripts (setup-current-account.sh, configure-app-build-repository.sh, configure-app-database.sh, configure-app-service.sh) that configure the backend config and variable files for each infrastructure layer, which increases the burden of maintaining the configuration scripts. ## Overview diff --git a/docs/e2e/e2e-checks.md b/docs/e2e/e2e-checks.md index d466e23cc..80517e096 100644 --- a/docs/e2e/e2e-checks.md +++ b/docs/e2e/e2e-checks.md @@ -60,7 +60,7 @@ make e2e-test-native APP_NAME=app ### Run tests in UI mode -When developing or debugging tests, it’s often helpful to see them running in real-time. You can achieve this by running the e2e tests in UI mode: +When developing or debugging tests, it's often helpful to see them running in real-time. You can achieve this by running the e2e tests in UI mode: ``` make e2e-test-native-ui APP_NAME=app diff --git a/infra/{{app_name}}/database/main.tf b/infra/{{app_name}}/database/main.tf index d35dd3733..14d9bf1a2 100644 --- a/infra/{{app_name}}/database/main.tf +++ b/infra/{{app_name}}/database/main.tf @@ -1,6 +1,6 @@ locals { # The prefix key/value pair is used for Terraform Workspaces, which is useful for projects with multiple infrastructure developers. - # By default, Terraform creates a workspace named “default.” If a non-default workspace is not created this prefix will equal “default”, + # By default, Terraform creates a workspace named "default." If a non-default workspace is not created this prefix will equal "default", # if you choose not to use workspaces set this value to "dev" prefix = terraform.workspace == "default" ? "" : "${terraform.workspace}-" diff --git a/template-only-docs/template-development-workflow.md b/template-only-docs/template-development-workflow.md index 6a86081a0..aea6f6587 100644 --- a/template-only-docs/template-development-workflow.md +++ b/template-only-docs/template-development-workflow.md @@ -23,7 +23,7 @@ Each PR should only touch one of [the infrastructure layers](/infra/README.md#in For example, if you want to change the name of the ECR image repository, you should break the change down into the following steps involving three PRs. 1. Create PR #1 that adds a new image repository with a new name (This PR only modifies the build repository layer). -2. After merging the PR #1, manually apply the changes to the platform test repos (platform-test, platform-test-nextjs, platform-test-flask) since changes to the build-repository layer aren’t automatically applied as part of the CD workflow. +2. After merging the PR #1, manually apply the changes to the platform test repos (platform-test, platform-test-nextjs, platform-test-flask) since changes to the build-repository layer aren't automatically applied as part of the CD workflow. 3. Create PR #2 that updates the publish-release.sh script to use the new image repository and modifies the task definition to pull from the new image repository (This PR only modifies the service layer). After merging PR #2, deploys should automatically be publishing builds to the new build repository and the service tasks should be using images from the new build repository 4. Create PR #3 that removes the old build repository(This PR only modifies the build repository layer) When creating a release that includes a breaking change such as this one, include migration instructions on how to apply the changes without incurring downtime. In this example, the migration notes will look something like: @@ -44,7 +44,7 @@ On the [platform-test](https://github.com/navapbc/platform-test) repo, you'll do 2. Create a terraform workspace that you will use for [developing and testing your infrastructure changes](/docs/infra/develop-and-test-infrastructure-in-isolation-using-workspaces.md). Using a workspace avoids conflicting with other developers and avoids CD overwriting any changes you've applied while developing. 3. Develop and test your infrastructure changes using the `dev` environment 4. Create a pull request -5. Iterate until all CI checks pass on your PR and you’ve also done additional testing that you need to validate. _Do not merge the PR._ +5. Iterate until all CI checks pass on your PR and you've also done additional testing that you need to validate. _Do not merge the PR._ ### 2. Create a pull request on infra template repo diff --git a/template-only-docs/troubleshooting.md b/template-only-docs/troubleshooting.md index 4e0c2d742..54fd63b15 100644 --- a/template-only-docs/troubleshooting.md +++ b/template-only-docs/troubleshooting.md @@ -2,7 +2,7 @@ ## Template CI Infra Checks fails on main -If the [Template CI Infra Checks (template-only-ci-infra.yml)](https://github.com/navapbc/template-infra/actions/workflows/template-only-ci-infra.yml) workflow fails on the main branch and there isn’t a bug in the code, it may mean that a prior run of the workflow did not properly clean up account resources. +If the [Template CI Infra Checks (template-only-ci-infra.yml)](https://github.com/navapbc/template-infra/actions/workflows/template-only-ci-infra.yml) workflow fails on the main branch and there isn't a bug in the code, it may mean that a prior run of the workflow did not properly clean up account resources. ### Preventing the problem from getting worse @@ -19,13 +19,13 @@ Look in the GitHub logs for the Template CI Infra check that failed. The logs ar Errors that may indicate a problem with cleanup include: -* “OIDC provider already exists” during the SetUpAccount step -* “IAM role already exists” during the SetUpDevEnvironment step -* “SNS topic already exists” during the SetUpDevEnvironment step +* "OIDC provider already exists" during the SetUpAccount step +* "IAM role already exists" during the SetUpDevEnvironment step +* "SNS topic already exists" during the SetUpDevEnvironment step ### Diagnosing the root cause -If you have good reason to believe this is a one time thing, then you can skip this step and proceed to clean up the AWS account to unblock the Template CI Infra Checks workflow. Otherwise, it is important to find out what caused the test to not properly clean up and fix that first so that you don’t end up repeating the problem. +If you have good reason to believe this is a one time thing, then you can skip this step and proceed to clean up the AWS account to unblock the Template CI Infra Checks workflow. Otherwise, it is important to find out what caused the test to not properly clean up and fix that first so that you don't end up repeating the problem. Look at the GitHub logs for previous runs of the Template CI Infra Checks workflow that also failed, starting from the one you were initially looking into. From a309634c1372e5c3c1c9337f94bc273907a17cf7 Mon Sep 17 00:00:00 2001 From: Loren Yu Date: Tue, 16 Sep 2025 10:08:25 -0700 Subject: [PATCH 6/8] Point to Template Deploy workflow as definitive list of test repos --- CONTRIBUTING.md | 2 +- template-only-docs/template-development-workflow.md | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index faa02f200..1ed741473 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -22,7 +22,7 @@ In what ways could things be working differently as intended under the hood but Sometimes template changes do not propagate cleanly to the platform test repos. See Platform test repo(s) do not have the latest changes from template-infra. -Also, unlike application changes, infrastructure changes aren't always automatically applied. Make sure to think about how the changes will be applied before merging and make sure the changes get applied after merge. Double check by making sure the latest deploys (including in platform-test-nextjs and platform-test-flask test repos) completed successfully and that the terraform plans on main show no configuration changes. +Also, unlike application changes, infrastructure changes aren't always automatically applied. Make sure to think about how the changes will be applied before merging and make sure the changes get applied after merge. Double check by making sure [the latest deploys](https://github.com/navapbc/template-infra/actions/workflows/template-only-cd.yml) completed successfully and that the terraform plans on main show no configuration changes. ```bash platform-test$ git pull diff --git a/template-only-docs/template-development-workflow.md b/template-only-docs/template-development-workflow.md index aea6f6587..9369c47ab 100644 --- a/template-only-docs/template-development-workflow.md +++ b/template-only-docs/template-development-workflow.md @@ -23,7 +23,7 @@ Each PR should only touch one of [the infrastructure layers](/infra/README.md#in For example, if you want to change the name of the ECR image repository, you should break the change down into the following steps involving three PRs. 1. Create PR #1 that adds a new image repository with a new name (This PR only modifies the build repository layer). -2. After merging the PR #1, manually apply the changes to the platform test repos (platform-test, platform-test-nextjs, platform-test-flask) since changes to the build-repository layer aren't automatically applied as part of the CD workflow. +2. After merging the PR #1, manually apply the changes to the platform's test repos (see [Template Deploy workflow for the list of repos that receive template changes](/.github/workflows/template-only-cd.yml)) since changes to the build-repository layer aren't automatically applied as part of the CD workflow. 3. Create PR #2 that updates the publish-release.sh script to use the new image repository and modifies the task definition to pull from the new image repository (This PR only modifies the service layer). After merging PR #2, deploys should automatically be publishing builds to the new build repository and the service tasks should be using images from the new build repository 4. Create PR #3 that removes the old build repository(This PR only modifies the build repository layer) When creating a release that includes a breaking change such as this one, include migration instructions on how to apply the changes without incurring downtime. In this example, the migration notes will look something like: From 680da66a70ae9235a8d2fc43e3c73f1114507c79 Mon Sep 17 00:00:00 2001 From: Loren Yu Date: Tue, 16 Sep 2025 10:08:33 -0700 Subject: [PATCH 7/8] Fix typo --- template-only-docs/troubleshooting.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/template-only-docs/troubleshooting.md b/template-only-docs/troubleshooting.md index 54fd63b15..6ce97a92a 100644 --- a/template-only-docs/troubleshooting.md +++ b/template-only-docs/troubleshooting.md @@ -6,7 +6,7 @@ If the [Template CI Infra Checks (template-only-ci-infra.yml)](https://github.co ### Preventing the problem from getting worse -If you notice Template CI Infra Checks failing on main, tell people to pause on doing anything that would trigger a Template CI Infra Check run, since further runs will just create more issues you have to look into and more things you have to clean up. +If you notice Template CI Infra Checks failing on main, tell people to pause on doing anything that would trigger a Template CI Infra Checks run, since further runs will just create more issues you have to look into and more things you have to clean up. Things that trigger Template CI Infra Checks runs include: From 1c55854df7dca32f4b2875447386dacf861dfad0 Mon Sep 17 00:00:00 2001 From: Loren Yu Date: Tue, 16 Sep 2025 10:10:20 -0700 Subject: [PATCH 8/8] Address PR feedback --- template-only-docs/troubleshooting.md | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/template-only-docs/troubleshooting.md b/template-only-docs/troubleshooting.md index 6ce97a92a..89d8875d4 100644 --- a/template-only-docs/troubleshooting.md +++ b/template-only-docs/troubleshooting.md @@ -10,8 +10,10 @@ If you notice Template CI Infra Checks failing on main, tell people to pause on Things that trigger Template CI Infra Checks runs include: -* Pushes to main branch -* Opening PRs or updating PRs with new commits on template-infra +* Pushes to `main` branch +* Opening PRs (or updating PRs with new commits) that touch infrastructure/test code + +See [Template CI Infra Checks workflow](/.github/workflows/template-only-ci-infra.yml) for full list of triggers. ### Diagnosing the immediate problem