Skip to content

Add Terraform example for AWS deployment#13

Open
sambiggins-aws wants to merge 10 commits into
EpicGames:mainfrom
sambiggins-aws:feat/examples-aws
Open

Add Terraform example for AWS deployment#13
sambiggins-aws wants to merge 10 commits into
EpicGames:mainfrom
sambiggins-aws:feat/examples-aws

Conversation

@sambiggins-aws

@sambiggins-aws sambiggins-aws commented Jun 17, 2026

Copy link
Copy Markdown

Self-contained Terraform configuration at examples/aws/ that deploys a Lore primary + edge topology on ECS Fargate with durable S3/DynamoDB storage.

Creates

  • VPC (2 AZs, public/private subnets, NAT, S3+DynamoDB gateway endpoints)
  • S3 bucket (fragments)
  • 4 DynamoDB tables (fragments, metadata, mutable, locks — with GSIs)
  • ECS Fargate primary (S3/DynamoDB storage, serves replication)
  • ECS Fargate edge (replicated immutable store, remote mutable store)
  • Cloud Map private DNS (edge → primary discovery)
  • TLS CA + server certificate (inter-node trust via Secrets Manager)
  • IAM roles (least-privilege, scoped to tables + bucket + GSI indexes)
  • Security groups (41337 TCP+UDP, 41339 TCP, 41340 UDP internal)
  • CloudWatch log group

Usage

See README.md for more information

cd examples/aws
cp terraform.tfvars.example terraform.tfvars
terraform init && terraform apply

Deploys loreserver on ECS Fargate with S3/DynamoDB storage.
DynamoDB schemas and IAM permissions verified against lore-aws source.

Signed-off-by: Sam Biggins <sabiggin@amazon.com>
- Explain that the Dockerfile build auto-registers lore-aws plugin
- Document that the task runs in private subnets (VPC access required)
- Add ingress to the Customize section for production paths

Signed-off-by: Sam Biggins <sabiggin@amazon.com>
@sambiggins-aws sambiggins-aws marked this pull request as draft June 17, 2026 23:16
- Add s3:DeleteObjectVersion (required for versioned bucket cleanup)
- Add edge pod service with replicated+remote stores via Cloud Map
- Add Cloud Map private DNS for edge→primary discovery
- Add internal SG rules for node-to-node QUIC+gRPC

Signed-off-by: Sam Biggins <sabiggin@amazon.com>
- Generate CA + server cert via tls provider (SAN: primary.lore.internal)
- Store certs in Secrets Manager, provision via init containers
- Primary: enables quic_internal:41340 with cert for edge replication
- Edge: trusts primary CA via SSL_CERT_FILE, connects replicated+remote
- Both services confirmed running in deployment test

Signed-off-by: Sam Biggins <sabiggin@amazon.com>
@sambiggins-aws sambiggins-aws marked this pull request as ready for review June 17, 2026 23:52
Signed-off-by: Sam Biggins <sabiggin@amazon.com>
Signed-off-by: Sam Biggins <sabiggin@amazon.com>
@ragnarula

Copy link
Copy Markdown
Collaborator

Hey @sambiggins-aws this is awesome, thanks for cotributing this.

Is it possible to add some sort of integration test just to make sure it's not going stale with changes in tf versions, aws resources or Lore itself (where there are dependencies)?

Validates resource schemas, variable wiring, and service configuration
without AWS credentials. Catches breakage from Terraform/provider
version upgrades or changes to the Lore AWS plugin config contract.

Run: cd examples/aws && terraform init && terraform test
Signed-off-by: Sam Biggins <sabiggin@amazon.com>
@sambiggins-aws

Copy link
Copy Markdown
Author

@ragnarula let me know if this is what you had in mind, or if you are looking for something more along the lines of a github action?

@anupddas anupddas left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@ragnarula

Copy link
Copy Markdown
Collaborator

A GH action to run your test when those files change sounds like a good idea. Thanks!

Replace Fargate with ECS on EC2 to demonstrate Lore's core value
proposition: NVMe-cached edge nodes with high-throughput serving.

- c8gd.8xlarge default (32 vCPU, 64GB, 1.9TB NVMe, 25Gbps)
- Composite store: local NVMe cache + S3 durable (primary)
- Composite store: local NVMe cache + replicated durable (edge)
- Separate IAM roles (primary has S3+DDB, edge has none)
- Cloud Map for both primary and edge (client-facing DNS)
- TLS cert SANs include both primary and edge DNS names
- HMAC key via Secrets Manager for presigned URLs
- Health check grace periods (120s primary, 300s edge)
- DynamoDB PITR on all tables, S3 lifecycle for multipart cleanup
- GSI key_schema (provider 6.x), runtime_platform ARM64
- Cache sized to 80% of NVMe (1.52TB on c8gd.8xlarge)
- e2e test script (scripts/e2e-test.sh) for post-deploy validation
- Full Lore CLI workflow documented in README

Signed-off-by: Sam Biggins <sabiggin@amazon.com>
Runs terraform fmt, validate, and test on changes to examples/aws/.
Uses mock providers (no AWS credentials needed).

- hashicorp/setup-terraform@v4 pinned to 1.15.3
- Concurrency group cancels superseded runs
- Self-triggering path filter for workflow changes

Signed-off-by: Sam Biggins <sabiggin@amazon.com>
@sambiggins-aws

Copy link
Copy Markdown
Author

@ragnarula added. cheers!

run:
working-directory: examples/aws
steps:
- uses: actions/checkout@v4

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @sambiggins-aws - can you pin all actions used by this workflow to use explicit version hashes, similar to how the existing workflows do it?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@duncangrist done 😄

Pin actions/checkout to v6.0.3 and hashicorp/setup-terraform to v4.0.1
using explicit commit SHAs, matching the convention in dco.yml and lint.yml.

Signed-off-by: Sam Biggins <sabiggin@amazon.com>
@duncangrist

Copy link
Copy Markdown
Contributor

Hi @sambiggins-aws , just reviewing the rest of this PR and was wondering whether you've actually span up the infrastructure in this PR and tested it end-to-end?

If you haven't already, can you do that and detail exactly what you've proven as working in the PR description under a "Test Plan" heading.
It's just there's a lot of complexity here and it's almost impossible to know you've got everything correct unless you've tried it out already. Given it's an example it's important to get right :)

Thanks

Comment thread examples/aws/compute.tf
{ name = "LORE__IMMUTABLE_STORE__COMPOSITE__LOCAL__LOCAL__MAX_SIZE", value = "1520000000000" },
{ name = "LORE__IMMUTABLE_STORE__COMPOSITE__LOCAL__LOCAL__FLUSH_DELAY_SECONDS", value = "10" },
{ name = "LORE__IMMUTABLE_STORE__COMPOSITE__DURABLE__MODE", value = "replicated" },
{ name = "LORE__IMMUTABLE_STORE__COMPOSITE__DURABLE__REPLICATED__REMOTE_URL", value = "lore://primary.${local.name}.internal:${local.port_replication}" },

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ReplicatedStore is based off a quic client so this should be quic:// or quics:// - preferably the latter.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

5 participants