-
Notifications
You must be signed in to change notification settings - Fork 3
Add platform operations docs #950
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
5621c3e
df1e90a
9fd375b
d9822f9
47a4078
64fcf0a
7f692b9
a309634
680da66
1c55854
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,7 +1,42 @@ | ||
| # Contributing to the infrastructure template | ||
|
|
||
| ## Getting started | ||
|
|
||
| If you are looking to contribute, get started by reading the following docs. | ||
|
|
||
| - Read about the [infrastructure's module architecture](/docs/infra/module-architecture.md) to learn how the architecture of the infrastructure code is designed and how the modules interact with each other. | ||
| - Read the [template development workflow](/template-only-docs/template-development-workflow.md) to understand how to develop and test changes to the template because working on the platform templates is unlike working on most other applications. | ||
| - Read the [infrastructure style guide](/docs/infra/style-guide.md) to understand best practices for Terraform and shell scripts. | ||
|
|
||
| ## Pay attention to testing and rollout process when reviewing PRs | ||
|
|
||
| When reviewing template PRs, in addition to the usual things you look for, pay particular attention to: | ||
|
|
||
| ### Manual testing | ||
|
|
||
| Unlike application development, the automated test suite for infrastructure has much less coverage, so it is more important than usual to review test plans and evidence of successful testing to demonstrate that things work. Ask yourself the following questions: | ||
| What evidence would I need to see to be confident that things are working as intended? | ||
| In what ways could things be working differently as intended under the hood but still look the same based on the evidence provided? | ||
|
|
||
| ### Rollout process | ||
|
|
||
| Sometimes template changes do not propagate cleanly to the platform test repos. See Platform test repo(s) do not have the latest changes from template-infra. | ||
|
|
||
| Also, unlike application changes, infrastructure changes aren't always automatically applied. Make sure to think about how the changes will be applied before merging and make sure the changes get applied after merge. Double check by making sure [the latest deploys](https://github.com/navapbc/template-infra/actions/workflows/template-only-cd.yml) completed successfully and that the terraform plans on main show no configuration changes. | ||
|
|
||
| ```bash | ||
| platform-test$ git pull | ||
| platform-test$ make infra-update-app-database APP_NAME=app ENVIRONMENT=dev # should show no configuration changes | ||
| platform-test$ make infra-update-app-service APP_NAME=app ENVIRONMENT=dev # should show no configuration changes | ||
| ``` | ||
|
|
||
| ## Make note of breaking changes | ||
|
|
||
| If your PR will introduce a breaking change, then after the PR is approved, but before you merge it into main: | ||
|
|
||
| 1. Prefix the commit title with ⚠️. This indicates to the Platform Admins who will make the next release that there is a breaking change included in the release. | ||
| 2. Add a section in the commit description for "Release notes" and indicate what needs to be included in the release notes on how to handle the breaking change. | ||
|
|
||
| ## Troubleshooting | ||
|
|
||
| See the [troubleshooting guide](/template-only-docs/troubleshooting.md) for common issues and how to resolve them. |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,71 @@ | ||
| # Troubleshooting Guide for Template Infra | ||
|
|
||
| ## Template CI Infra Checks fails on main | ||
|
|
||
| If the [Template CI Infra Checks (template-only-ci-infra.yml)](https://github.com/navapbc/template-infra/actions/workflows/template-only-ci-infra.yml) workflow fails on the main branch and there isn't a bug in the code, it may mean that a prior run of the workflow did not properly clean up account resources. | ||
|
|
||
| ### Preventing the problem from getting worse | ||
|
|
||
| If you notice Template CI Infra Checks failing on main, tell people to pause on doing anything that would trigger a Template CI Infra Checks run, since further runs will just create more issues you have to look into and more things you have to clean up. | ||
|
|
||
| Things that trigger Template CI Infra Checks runs include: | ||
|
|
||
| * Pushes to `main` branch | ||
| * Opening PRs (or updating PRs with new commits) that touch infrastructure/test code | ||
|
|
||
| See [Template CI Infra Checks workflow](/.github/workflows/template-only-ci-infra.yml) for full list of triggers. | ||
|
|
||
| ### Diagnosing the immediate problem | ||
|
|
||
| Look in the GitHub logs for the Template CI Infra check that failed. The logs are very long and therefore are collapsed into groups. | ||
|
|
||
| Errors that may indicate a problem with cleanup include: | ||
|
|
||
| * "OIDC provider already exists" during the SetUpAccount step | ||
| * "IAM role already exists" during the SetUpDevEnvironment step | ||
| * "SNS topic already exists" during the SetUpDevEnvironment step | ||
|
|
||
| ### Diagnosing the root cause | ||
|
|
||
| If you have good reason to believe this is a one time thing, then you can skip this step and proceed to clean up the AWS account to unblock the Template CI Infra Checks workflow. Otherwise, it is important to find out what caused the test to not properly clean up and fix that first so that you don't end up repeating the problem. | ||
|
|
||
| Look at the GitHub logs for previous runs of the Template CI Infra Checks workflow that also failed, starting from the one you were initially looking into. | ||
|
|
||
| Look in the following Teardown\* steps for errors: | ||
|
|
||
| * TeardownAccount | ||
| * TeardownBuildRepository | ||
| * TeardownDevEnvironment | ||
|
|
||
| Causes for errors in these steps may include: | ||
|
|
||
| * Inability to delete non-empty buckets. In order to delete non-empty buckets, you first need to set force\_destroy \= true and prevent\_destroy \= false for the bucket and run a terraform apply before running terraform destroy. | ||
| * Bugs in the template\_infra\_test.go file | ||
| * Bugs in template-only-bin/destroy-\* scripts | ||
|
|
||
| ### Clean up the AWS account | ||
|
|
||
| Login to the [nava-platform AWS account](https://nava-platform.signin.aws.amazon.com/console) and check for the following to clean up: | ||
|
|
||
| * [ECS clusters](https://us-east-1.console.aws.amazon.com/ecs/v2/getStarted?region=us-east-1) for the service | ||
| * [Load balancers](https://us-east-1.console.aws.amazon.com/ec2/home?region=us-east-1#LoadBalancers:) for the service | ||
| * [SNS topics](https://us-east-1.console.aws.amazon.com/sns/v3/home?region=us-east-1#/homepage) for monitoring alerts | ||
| * [IAM roles](https://us-east-1.console.aws.amazon.com/iamv2/home?region=us-east-1#/roles) for GitHub actions, the service, and others | ||
| * [S3 buckets](https://s3.console.aws.amazon.com/s3/get-started?region=us-east-1) for terraform state file, terraform logs, and load balancer access logs | ||
| * [DynamoDB tables](https://us-east-1.console.aws.amazon.com/dynamodbv2/home?region=us-east-1#service) for terraform state locks | ||
| * [Identity providers](https://us-east-1.console.aws.amazon.com/iamv2/home?region=us-east-1#/identity_providers) for the GitHub OIDC provider | ||
|
|
||
| Note: The Template CI Infra does not currently spin up any databases. | ||
|
|
||
| Note: Loren has a branch called `lorenyu/clean` with two scripts that you can use: | ||
|
|
||
| * `template-only-bin/clean-account.sh` | ||
| * `template-only-bin/destroy-vpc.sh` | ||
|
Comment on lines
+60
to
+63
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Any reason we shouldn't have the scripts in There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good question. My surface reason is that the scripts are kinda poor quality / kinda hacked together, don't have a great DevEx, error handling, etc, so it personally felt awkward to merge them to main alongside code that I feel is much higher quality. That said, curious for your opinion as a neutral third party who didn't write the scripts. |
||
|
|
||
| ### Verify solution | ||
|
|
||
| Re-run Template CI Infra Checks on main branch | ||
|
|
||
| ## Platform test repo(s) do not have the latest changes from template-infra | ||
|
|
||
| See :lock: [template changes fail to apply](https://navasage.atlassian.net/wiki/spaces/tss/pages/2011922659/Platform+Ecosystem#template-*-changes-fail-to-apply) for help troubleshooting this issue. | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re-reading this sentence, I think I meant it as the lower case "check". I think it's actually
Template CI Infra Checks check