Skip to content
35 changes: 35 additions & 0 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,42 @@
# Contributing to the infrastructure template

## Getting started

If you are looking to contribute, get started by reading the following docs.

- Read about the [infrastructure's module architecture](/docs/infra/module-architecture.md) to learn how the architecture of the infrastructure code is designed and how the modules interact with each other.
- Read the [template development workflow](/template-only-docs/template-development-workflow.md) to understand how to develop and test changes to the template because working on the platform templates is unlike working on most other applications.
- Read the [infrastructure style guide](/docs/infra/style-guide.md) to understand best practices for Terraform and shell scripts.

## Pay attention to testing and rollout process when reviewing PRs

When reviewing template PRs, in addition to the usual things you look for, pay particular attention to:

### Manual testing

Unlike application development, the automated test suite for infrastructure has much less coverage, so it is more important than usual to review test plans and evidence of successful testing to demonstrate that things work. Ask yourself the following questions:
What evidence would I need to see to be confident that things are working as intended?
In what ways could things be working differently as intended under the hood but still look the same based on the evidence provided?

### Rollout process

Sometimes template changes do not propagate cleanly to the platform test repos. See Platform test repo(s) do not have the latest changes from template-infra.

Also, unlike application changes, infrastructure changes aren't always automatically applied. Make sure to think about how the changes will be applied before merging and make sure the changes get applied after merge. Double check by making sure [the latest deploys](https://github.com/navapbc/template-infra/actions/workflows/template-only-cd.yml) completed successfully and that the terraform plans on main show no configuration changes.

```bash
platform-test$ git pull
platform-test$ make infra-update-app-database APP_NAME=app ENVIRONMENT=dev # should show no configuration changes
platform-test$ make infra-update-app-service APP_NAME=app ENVIRONMENT=dev # should show no configuration changes
```

## Make note of breaking changes

If your PR will introduce a breaking change, then after the PR is approved, but before you merge it into main:

1. Prefix the commit title with ⚠️. This indicates to the Platform Admins who will make the next release that there is a breaking change included in the release.
2. Add a section in the commit description for "Release notes" and indicate what needs to be included in the release notes on how to handle the breaking change.

## Troubleshooting

See the [troubleshooting guide](/template-only-docs/troubleshooting.md) for common issues and how to resolve them.
2 changes: 1 addition & 1 deletion docs/code-reviews.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ Remember to highlight things that you like and appreciate while reading through
and to make any other feedback clearly actionable by indicating if it is an optional preference, an important consideration, or an error.

Don't be afraid to comment with a question, or to ask for clarification, or provide a suggestion,
whenever you dont understand what is going on at first glance — or if you think an approach or decision can be improved.
whenever you don't understand what is going on at first glance — or if you think an approach or decision can be improved.
Suggestions on how to split a large PR into smaller chunks can also help move things along.
Code reviews give us a chance to learn from one another, and to reflect, iterate on, and document why certain decisions are made.

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,6 @@ Cons of unfunco/oidc-github:
- Dependency on an external module in the Terraform registry has negative security implications. Furthermore, the module isn't published by an "official" organization. It is maintained by a single developer, further increasing the security risk.
- The module includes extra unnecessary options that make the code more difficult to read and understand
- In particular, the module includes the option to attach the `AdminstratorAccess` policy to the GitHub actions IAM role, which isn't necessary and could raise concerns in an audit.
- ~~The module hardcodes the GitHub OIDC Provider thumbprint, which isn't as elegant as the method in the [Initial setup for CD draft PR #43](https://github.com/navapbc/template-infra/pull/43) from @shawnvanderjagt which simply pulls the thumbprint via:~~ (Update: July 12, 2023) Starting July 6, 2023, AWS began securing communication with GitHubs OIDC identity provider (IdP) using GitHub's library of trusted root Certificate Authorities instead of using a certificate thumbprint to verify the IdPs server certificate. This approach ensures that the GitHub OIDC configuration behaves correctly without disruption during future certificate rotations and changes. With this new validation approach in place, your legacy thumbprint(s) are longer be needed for validation purposes.
- ~~The module hardcodes the GitHub OIDC Provider thumbprint, which isn't as elegant as the method in the [Initial setup for CD draft PR #43](https://github.com/navapbc/template-infra/pull/43) from @shawnvanderjagt which simply pulls the thumbprint via:~~ (Update: July 12, 2023) Starting July 6, 2023, AWS began securing communication with GitHub's OIDC identity provider (IdP) using GitHub's library of trusted root Certificate Authorities instead of using a certificate thumbprint to verify the IdP's server certificate. This approach ensures that the GitHub OIDC configuration behaves correctly without disruption during future certificate rotations and changes. With this new validation approach in place, your legacy thumbprint(s) are longer be needed for validation purposes.

Forking the module to the @navapbc organization gets rid of the security issue, but the other issues remain.
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@

## Context

Up until now, most projects adopted an infrastructure module architecture that is structured as follows: Each application environment (prod, staging, etc) is a separate root module that calls a template module. The template module defines all the application infra resources needed for an environment. Things that could be different per environment (e.g. desired ECS task count) are template variables, and each environment can have local vars (or somewhat equivalently, a tfvars file) that customizes those variables. Importantly, each environment has its own backend tfstate file, and the backend config is stored in the environment modules `main.tf`.
Up until now, most projects adopted an infrastructure module architecture that is structured as follows: Each application environment (prod, staging, etc) is a separate root module that calls a template module. The template module defines all the application infra resources needed for an environment. Things that could be different per environment (e.g. desired ECS task count) are template variables, and each environment can have local vars (or somewhat equivalently, a tfvars file) that customizes those variables. Importantly, each environment has its own backend tfstate file, and the backend config is stored in the environment module's `main.tf`.

An alternative approach exists to managing the backend configs. Rather than saving the backend config directly in `main.tf`, `main.tf` could contain a [partial configuration](https://developer.hashicorp.com/terraform/language/backend#partial-configuration), and the rest of the backend config would be passed in during terraform init with a command like `terraform init --backend-config=prod.s3.tfbackend`. There would no longer be a need for separate root modules for each environment. What was previously the template module would instead act as the root module, and engineers would work with different environments solely through separate tfbackend files and tfvar files. Doing this would greatly simplify the module architecture at the cost of some complexity when executing terraform commands due to the extra command line parameters. To manage the extra complexity of running terraform commands, a wrapper script (such as with Makefile commands) can be introduced.

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ Technical Story: [Replace configure scripts with project/app config variables #3

## Context

Currently, application infrastructure configuration is split across config modules (see `infra/<APP_NAME>/app-config/`) as well as .tfvars files in each of the application's infra layers - infra/<APP_NAME>/build-repository, infra/<APP_NAME>/database, and infra/<APP_NAME>/service. As @kyeah pointed out, its easy to make mistakes when configuration is spread across multiple files, and expressed a desire to manage tfvars across environments all in a single file the way that some applications do for application configuration. Also, as @acouch [pointed out](https://github.com/navapbc/template-infra/pull/282#discussion_r1219930653), there is a lot of duplicate code with the configure scripts (setup-current-account.sh, configure-app-build-repository.sh, configure-app-database.sh, configure-app-service.sh) that configure the backend config and variable files for each infrastructure layer, which increases the burden of maintaining the configuration scripts.
Currently, application infrastructure configuration is split across config modules (see `infra/<APP_NAME>/app-config/`) as well as .tfvars files in each of the application's infra layers - infra/<APP_NAME>/build-repository, infra/<APP_NAME>/database, and infra/<APP_NAME>/service. As @kyeah pointed out, it's easy to make mistakes when configuration is spread across multiple files, and expressed a desire to manage tfvars across environments all in a single file the way that some applications do for application configuration. Also, as @acouch [pointed out](https://github.com/navapbc/template-infra/pull/282#discussion_r1219930653), there is a lot of duplicate code with the configure scripts (setup-current-account.sh, configure-app-build-repository.sh, configure-app-database.sh, configure-app-service.sh) that configure the backend config and variable files for each infrastructure layer, which increases the burden of maintaining the configuration scripts.

## Overview

Expand Down
2 changes: 1 addition & 1 deletion docs/e2e/e2e-checks.md
Original file line number Diff line number Diff line change
Expand Up @@ -60,7 +60,7 @@ make e2e-test-native APP_NAME=app

### Run tests in UI mode

When developing or debugging tests, its often helpful to see them running in real-time. You can achieve this by running the e2e tests in UI mode:
When developing or debugging tests, it's often helpful to see them running in real-time. You can achieve this by running the e2e tests in UI mode:

```
make e2e-test-native-ui APP_NAME=app
Expand Down
2 changes: 1 addition & 1 deletion infra/{{app_name}}/database/main.tf
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
locals {
# The prefix key/value pair is used for Terraform Workspaces, which is useful for projects with multiple infrastructure developers.
# By default, Terraform creates a workspace named default. If a non-default workspace is not created this prefix will equal default,
# By default, Terraform creates a workspace named "default." If a non-default workspace is not created this prefix will equal "default",
# if you choose not to use workspaces set this value to "dev"
prefix = terraform.workspace == "default" ? "" : "${terraform.workspace}-"

Expand Down
24 changes: 22 additions & 2 deletions template-only-docs/template-development-workflow.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,27 @@ For most infrastructure changes, you will need an environment to work with. Sinc

If you need an AWS IAM user for the AWS account associated with any of the platform test repos, contact @lorenyu.

## Developing infrastructure changes
## Planning infrastructure changes

### Isolate PRs to a single infrastructure layer and make each individual PR backwards compatible

Each PR should only touch one of [the infrastructure layers](/infra/README.md#infrastructure-layers) at a time.

**If you need to make a change that affects multiple layers, break it down into multiple steps so that each step can be done in a backwards compatible manner that affects one layer at a time.**

For example, if you want to change the name of the ECR image repository, you should break the change down into the following steps involving three PRs.

1. Create PR #1 that adds a new image repository with a new name (This PR only modifies the build repository layer).
2. After merging the PR #1, manually apply the changes to the platform's test repos (see [Template Deploy workflow for the list of repos that receive template changes](/.github/workflows/template-only-cd.yml)) since changes to the build-repository layer aren't automatically applied as part of the CD workflow.
3. Create PR #2 that updates the publish-release.sh script to use the new image repository and modifies the task definition to pull from the new image repository (This PR only modifies the service layer). After merging PR #2, deploys should automatically be publishing builds to the new build repository and the service tasks should be using images from the new build repository
4. Create PR #3 that removes the old build repository(This PR only modifies the build repository layer)
When creating a release that includes a breaking change such as this one, include migration instructions on how to apply the changes without incurring downtime. In this example, the migration notes will look something like:

Step 1: Do a targeted apply to create the new build repository
Step 2: Commit and apply the changes to the service layer
Step 3: Apply the rest of the changes to the build repository layer to remove the old build repository

## Developing and testing infrastructure changes

This is the most common workflow:

Expand All @@ -24,7 +44,7 @@ On the [platform-test](https://github.com/navapbc/platform-test) repo, you'll do
2. Create a terraform workspace that you will use for [developing and testing your infrastructure changes](/docs/infra/develop-and-test-infrastructure-in-isolation-using-workspaces.md). Using a workspace avoids conflicting with other developers and avoids CD overwriting any changes you've applied while developing.
3. Develop and test your infrastructure changes using the `dev` environment
4. Create a pull request
5. Iterate until all CI checks pass on your PR and youve also done additional testing that you need to validate. _Do not merge the PR._
5. Iterate until all CI checks pass on your PR and you've also done additional testing that you need to validate. _Do not merge the PR._

### 2. Create a pull request on infra template repo

Expand Down
71 changes: 71 additions & 0 deletions template-only-docs/troubleshooting.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
# Troubleshooting Guide for Template Infra

## Template CI Infra Checks fails on main

If the [Template CI Infra Checks (template-only-ci-infra.yml)](https://github.com/navapbc/template-infra/actions/workflows/template-only-ci-infra.yml) workflow fails on the main branch and there isn't a bug in the code, it may mean that a prior run of the workflow did not properly clean up account resources.

### Preventing the problem from getting worse

If you notice Template CI Infra Checks failing on main, tell people to pause on doing anything that would trigger a Template CI Infra Checks run, since further runs will just create more issues you have to look into and more things you have to clean up.

Things that trigger Template CI Infra Checks runs include:

* Pushes to `main` branch
* Opening PRs (or updating PRs with new commits) that touch infrastructure/test code

See [Template CI Infra Checks workflow](/.github/workflows/template-only-ci-infra.yml) for full list of triggers.

### Diagnosing the immediate problem

Look in the GitHub logs for the Template CI Infra check that failed. The logs are very long and therefore are collapsed into groups.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Look in the GitHub logs for the Template CI Infra check that failed. The logs are very long and therefore are collapsed into groups.
Look in the GitHub logs for the Template CI Infra Checks that failed. The logs are very long and therefore are collapsed into groups.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-reading this sentence, I think I meant it as the lower case "check". I think it's actually Template CI Infra Checks check


Errors that may indicate a problem with cleanup include:

* "OIDC provider already exists" during the SetUpAccount step
* "IAM role already exists" during the SetUpDevEnvironment step
* "SNS topic already exists" during the SetUpDevEnvironment step

### Diagnosing the root cause

If you have good reason to believe this is a one time thing, then you can skip this step and proceed to clean up the AWS account to unblock the Template CI Infra Checks workflow. Otherwise, it is important to find out what caused the test to not properly clean up and fix that first so that you don't end up repeating the problem.

Look at the GitHub logs for previous runs of the Template CI Infra Checks workflow that also failed, starting from the one you were initially looking into.

Look in the following Teardown\* steps for errors:

* TeardownAccount
* TeardownBuildRepository
* TeardownDevEnvironment

Causes for errors in these steps may include:

* Inability to delete non-empty buckets. In order to delete non-empty buckets, you first need to set force\_destroy \= true and prevent\_destroy \= false for the bucket and run a terraform apply before running terraform destroy.
* Bugs in the template\_infra\_test.go file
* Bugs in template-only-bin/destroy-\* scripts

### Clean up the AWS account

Login to the [nava-platform AWS account](https://nava-platform.signin.aws.amazon.com/console) and check for the following to clean up:

* [ECS clusters](https://us-east-1.console.aws.amazon.com/ecs/v2/getStarted?region=us-east-1) for the service
* [Load balancers](https://us-east-1.console.aws.amazon.com/ec2/home?region=us-east-1#LoadBalancers:) for the service
* [SNS topics](https://us-east-1.console.aws.amazon.com/sns/v3/home?region=us-east-1#/homepage) for monitoring alerts
* [IAM roles](https://us-east-1.console.aws.amazon.com/iamv2/home?region=us-east-1#/roles) for GitHub actions, the service, and others
* [S3 buckets](https://s3.console.aws.amazon.com/s3/get-started?region=us-east-1) for terraform state file, terraform logs, and load balancer access logs
* [DynamoDB tables](https://us-east-1.console.aws.amazon.com/dynamodbv2/home?region=us-east-1#service) for terraform state locks
* [Identity providers](https://us-east-1.console.aws.amazon.com/iamv2/home?region=us-east-1#/identity_providers) for the GitHub OIDC provider

Note: The Template CI Infra does not currently spin up any databases.

Note: Loren has a branch called `lorenyu/clean` with two scripts that you can use:

* `template-only-bin/clean-account.sh`
* `template-only-bin/destroy-vpc.sh`
Comment on lines +60 to +63
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason we shouldn't have the scripts in main?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. My surface reason is that the scripts are kinda poor quality / kinda hacked together, don't have a great DevEx, error handling, etc, so it personally felt awkward to merge them to main alongside code that I feel is much higher quality. That said, curious for your opinion as a neutral third party who didn't write the scripts.


### Verify solution

Re-run Template CI Infra Checks on main branch

## Platform test repo(s) do not have the latest changes from template-infra

See :lock: [template changes fail to apply](https://navasage.atlassian.net/wiki/spaces/tss/pages/2011922659/Platform+Ecosystem#template-*-changes-fail-to-apply) for help troubleshooting this issue.
Loading