Skip to content

AWS Fargate gRPC cluster example #5

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 12 additions & 0 deletions aws-fargate/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
FROM authzed/spicedb:v1.12.0

ARG SPICEDB_GRPC_PRESHARED_KEY
ENV SPICEDB_GRPC_PRESHARED_KEY=${SPICEDB_GRPC_PRESHARED_KEY}
ARG SPICEDB_DATASTORE_ENGINE
ENV SPICEDB_DATASTORE_ENGINE=${SPICEDB_DATASTORE_ENGINE}
ARG SPICEDB_DATASTORE_CONN_URI
ENV SPICEDB_DATASTORE_CONN_URI=${SPICEDB_DATASTORE_CONN_URI}
Comment on lines +3 to +8
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 it's not clear to me why this has to be added. Why can't those be specified as environment variables that get injected in the container via the cloudformation template? This container definition does not seem to provide anything the upstream couldn't provide

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, i don't understand the comment. don't know the project as well as you probably do. is the goal of your question to understand whether my approach is the only approach possible? please clarify.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me try again 😄 What I'm trying to say is that theoretically you wouldn't need to define your own Dockerfile and use the images we the Authzed team push to public registries like Dockerhub, Quay or GitHub Container Registry. My goal is to see if we can remove this Dockerfile and instead adjust the cloudformation accordingly to inject the corresponding environment variables. Does that make sense?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are certainly welcome to pull my PR and adjust the CloudFormation template. There are also other issues that are worth addressing, for example, parametrization of the ecr repository, certificate, aws account id, etc. I started the TODO section in the README. Please add anything you see fit there, as well as push further improvements.

This was meant as a PoC reference implementation. I would certainly be delighted to pull any upstream changes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood! I think the dockerfile bits is at least not something in line with our own standards, and it's only fair we tackle those. Another thing is dispatch, which isn't really enabled. Thanks for your contribution! As soon as we get some spare cycles we will get back to this 🙇🏻

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I do understand. You are correct. Appreciate the fact that this is a draft rather than a finished production-quality template, and it does require someone to co-pilot before the merge.

Perhaps in the future, I will get to the point where I will be able to cover an entire issue by myself, all the way over the finish line. Right now, I built MAYBE somewhat useful pieces. However, I don't have enough mileage to see every angle, like the core team members. Just sharing small bits that I developed while trying to get the initial SpiceDB integration functional enough to cover some of my use cases.


ENTRYPOINT ["spicedb", "serve"]

EXPOSE 50051/tcp 8080/tcp 9090/tcp
15 changes: 15 additions & 0 deletions aws-fargate/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
# SpiceDB AWS Fargate cluster
This reference config will create a Fargate cluster exposing gRPC port `50051` over an SSL endpoint.

## The following commands need to run:
1. a network VPC is required for Fargate. Usually, one VPC is enough for multiple services.
we will reference SubnetA, SubnetB and a VPC ids of our network VPC in the subsequent steps
2. create a cluster using your values for SubnetA, SubnetB and VPC
```
aws cloudformation create-stack --stack-name permissions-staging --template-body file://./aws/fargate.yaml --parameters ParameterKey=SubnetA,ParameterValue=subnet-0dd.........d9 ParameterKey=SubnetB,ParameterValue=subnet-062........a5 ParameterKey=VPC,ParameterValue=vpc-09a........57 --capabilities CAPABILITY_NAMED_IAM
```

TODO:
1. Parametrize: AWSAccountId, CertificateId, HostedZoneName, ServiceName, etc.
2. Create ECR repository along with the rest of the stack resources
3. Route traffic on ports 8080 and 9090 for SpiceDB dashboard and metrics
283 changes: 283 additions & 0 deletions aws-fargate/fargate.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,283 @@
AWSTemplateFormatVersion: 2010-09-09
Description: CloudFormation template for a SpiceDB cluster.
Parameters:
VPC:
Type: AWS::EC2::VPC::Id
SubnetA:
Type: AWS::EC2::Subnet::Id
SubnetB:
Type: AWS::EC2::Subnet::Id
Certificate:
Type: String
# Update with the certificate ARN from Certificate Manager, which must exist in the same region.
# In our case, it is permissions-staging.domain.com
Default: 'arn:aws:acm:us-east-1:5xxxxxxxxx3:certificate/6e603b3c-....-....-....-72b1d8d4711b'
Image:
Type: String
# Update with the Docker image. "You can use images in the Docker Hub registry or specify other repositories (repository-url/image:tag)."
Default: 5xxxxxxxxx3.dkr.ecr.us-east-1.amazonaws.com/permissions-staging:latest
ServiceName:
Type: String
# update with the name of the service
Default: perms-stg-service
ContainerPort:
Type: Number
Default: 50051
LoadBalancerPort:
Type: Number
Default: 50051
HealthCheckPath:
Type: String
Default: '/grpc.health.v1.Health/Check'
HostedZoneName:
Type: String
Default: domain.com
Subdomain:
Type: String
Default: permissions-staging
# for autoscaling
MinContainers:
Type: Number
Default: 1
# for autoscaling
MaxContainers:
Type: Number
Default: 2
# target CPU utilization (%)
AutoScalingTargetValue:
Type: Number
Default: 50
Comment on lines +47 to +49
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

autoscaling could be disruptive to a spicedb cluster, because the ring has to reconfigure and it would potentially affect your deployment's cache hit ratios

Copy link
Author

@igorshmukler igorshmukler Sep 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you please rephrase your comment as a question?

BTW, I do have a question here, since I know next to nothing about the project: Does SpiceDB support dynamic sizing/scaling or not?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My initial suggestion was not to enable autoscaling. While SpiceDB supports adding and removing nodes, it's something that comes at a cost of reorganizing the hash ring, and leads to lower cache hit rate. However, after discussing with the team, there are things we could explore to reduce the impact of instances of SpiceDB coming and going, and we certainly want to make autoscaling fully supported without any performance impact.

So you can dismiss my comment!

Resources:
Cluster:
Type: AWS::ECS::Cluster
Properties:
ClusterName: !Join ['', [!Ref ServiceName, Cluster]]
TaskDefinition:
Type: AWS::ECS::TaskDefinition
# Makes sure the log group is created before it is used.
DependsOn: LogGroup
Properties:
# Name of the task definition. Subsequent versions of the task definition are grouped together under this name.
Family: !Join ['', [!Ref ServiceName, TaskDefinition]]
# awsvpc is required for Fargate
NetworkMode: awsvpc
RequiresCompatibilities:
- FARGATE
# 256 (.25 vCPU) - Available memory values: 0.5GB, 1GB, 2GB
# 512 (.5 vCPU) - Available memory values: 1GB, 2GB, 3GB, 4GB
# 1024 (1 vCPU) - Available memory values: 2GB, 3GB, 4GB, 5GB, 6GB, 7GB, 8GB
# 2048 (2 vCPU) - Available memory values: Between 4GB and 16GB in 1GB increments
# 4096 (4 vCPU) - Available memory values: Between 8GB and 30GB in 1GB increments
Cpu: 256
# 0.5GB, 1GB, 2GB - Available cpu values: 256 (.25 vCPU)
# 1GB, 2GB, 3GB, 4GB - Available cpu values: 512 (.5 vCPU)
# 2GB, 3GB, 4GB, 5GB, 6GB, 7GB, 8GB - Available cpu values: 1024 (1 vCPU)
# Between 4GB and 16GB in 1GB increments - Available cpu values: 2048 (2 vCPU)
# Between 8GB and 30GB in 1GB increments - Available cpu values: 4096 (4 vCPU)
Memory: 512
# A role needed by ECS.
# "The ARN of the task execution role that containers in this task can assume. All containers in this task are granted the permissions that are specified in this role."
# "There is an optional task execution IAM role that you can specify with Fargate to allow your Fargate tasks to make API calls to Amazon ECR."
ExecutionRoleArn: !Ref ExecutionRole
# "The Amazon Resource Name (ARN) of an AWS Identity and Access Management (IAM) role that grants containers in the task permission to call AWS APIs on your behalf."
TaskRoleArn: !Ref TaskRole
ContainerDefinitions:
- Name: !Ref ServiceName
Image: !Ref Image
PortMappings:
- ContainerPort: !Ref ContainerPort
# Send logs to CloudWatch Logs
LogConfiguration:
LogDriver: awslogs
Options:
awslogs-region: !Ref AWS::Region
awslogs-group: !Ref LogGroup
awslogs-stream-prefix: ecs
# A role needed by ECS
ExecutionRole:
Type: AWS::IAM::Role
Properties:
RoleName: !Join ['', [!Ref ServiceName, ExecutionRole]]
AssumeRolePolicyDocument:
Statement:
- Effect: Allow
Principal:
Service: ecs-tasks.amazonaws.com
Action: 'sts:AssumeRole'
ManagedPolicyArns:
- 'arn:aws:iam::aws:policy/service-role/AmazonECSTaskExecutionRolePolicy'
# A role for the containers
TaskRole:
Type: AWS::IAM::Role
Properties:
RoleName: !Join ['', [!Ref ServiceName, TaskRole]]
AssumeRolePolicyDocument:
Statement:
- Effect: Allow
Principal:
Service: ecs-tasks.amazonaws.com
Action: 'sts:AssumeRole'
# ManagedPolicyArns:
# -
# Policies:
# -
# A role needed for auto scaling
AutoScalingRole:
Type: AWS::IAM::Role
Properties:
RoleName: !Join ['', [!Ref ServiceName, AutoScalingRole]]
AssumeRolePolicyDocument:
Statement:
- Effect: Allow
Principal:
Service: ecs-tasks.amazonaws.com
Action: 'sts:AssumeRole'
ManagedPolicyArns:
- 'arn:aws:iam::aws:policy/service-role/AmazonEC2ContainerServiceAutoscaleRole'
ContainerSecurityGroup:
Type: AWS::EC2::SecurityGroup
Properties:
GroupDescription: !Join ['', [!Ref ServiceName, ContainerSecurityGroup]]
VpcId: !Ref VPC
SecurityGroupIngress:
- IpProtocol: tcp
FromPort: !Ref ContainerPort
ToPort: !Ref ContainerPort
SourceSecurityGroupId: !Ref LoadBalancerSecurityGroup
LoadBalancerSecurityGroup:
Type: AWS::EC2::SecurityGroup
Properties:
GroupDescription:
!Join ['', [!Ref ServiceName, LoadBalancerSecurityGroup]]
VpcId: !Ref VPC
SecurityGroupIngress:
- IpProtocol: tcp
FromPort: !Ref LoadBalancerPort
ToPort: !Ref LoadBalancerPort
CidrIp: 0.0.0.0/0
- IpProtocol: tcp
FromPort: 80
ToPort: 80
CidrIp: 0.0.0.0/0
Service:
Type: AWS::ECS::Service
# This dependency is needed so that the load balancer is setup correctly in time
DependsOn:
- LoadBalancerListener
Properties:
ServiceName: !Ref ServiceName
Cluster: !Ref Cluster
TaskDefinition: !Ref TaskDefinition
DeploymentConfiguration:
MinimumHealthyPercent: 100
MaximumPercent: 200
DesiredCount: 2
# This may need to be adjusted if the container takes a while to start up
HealthCheckGracePeriodSeconds: 30
LaunchType: FARGATE
NetworkConfiguration:
AwsvpcConfiguration:
# change to DISABLED if you're using private subnets that have access to a NAT gateway
AssignPublicIp: ENABLED
Subnets:
- !Ref SubnetA
- !Ref SubnetB
SecurityGroups:
- !Ref ContainerSecurityGroup
LoadBalancers:
- ContainerName: !Ref ServiceName
ContainerPort: !Ref ContainerPort
TargetGroupArn: !Ref TargetGroup
TargetGroup:
Type: AWS::ElasticLoadBalancingV2::TargetGroup
Properties:
HealthCheckIntervalSeconds: 300
HealthCheckPath: !Ref HealthCheckPath
HealthCheckTimeoutSeconds: 5
UnhealthyThresholdCount: 2
HealthyThresholdCount: 2
HealthCheckEnabled: true
HealthCheckPort: 'traffic-port'
HealthCheckProtocol: HTTP
# end of gRPC -specific configuration
Name: !Join ['', [!Ref ServiceName, TargetGroup]]
Port: !Ref ContainerPort
Protocol: HTTP
ProtocolVersion: GRPC
Matcher:
GrpcCode: 0
TargetGroupAttributes:
- Key: deregistration_delay.timeout_seconds
Value: 300 # default is 300
TargetType: ip
VpcId: !Ref VPC
LoadBalancerListener:
Type: AWS::ElasticLoadBalancingV2::Listener
Properties:
LoadBalancerArn: !Ref LoadBalancer
Port: !Ref LoadBalancerPort
Protocol: HTTPS
SslPolicy: "ELBSecurityPolicy-2016-08"
Certificates:
- CertificateArn: !Ref Certificate
DefaultActions:
- Order: 1
TargetGroupArn: !Ref TargetGroup
Type: "forward"
LoadBalancer:
Type: AWS::ElasticLoadBalancingV2::LoadBalancer
Properties:
LoadBalancerAttributes:
# this is the default, but is specified here in case it needs to be changed
- Key: idle_timeout.timeout_seconds
Value: 60
Name: !Join ['', [!Ref ServiceName, LoadBalancer]]
# "internal" is also an option
Scheme: internet-facing
SecurityGroups:
- !Ref LoadBalancerSecurityGroup
Subnets:
- !Ref SubnetA
- !Ref SubnetB
DNSRecord:
Type: AWS::Route53::RecordSet
Properties:
HostedZoneName: !Join ['', [!Ref HostedZoneName, .]]
Name: !Join ['', [!Ref Subdomain, ., !Ref HostedZoneName, .]]
Type: A
AliasTarget:
DNSName: !GetAtt LoadBalancer.DNSName
HostedZoneId: !GetAtt LoadBalancer.CanonicalHostedZoneID
LogGroup:
Type: AWS::Logs::LogGroup
Properties:
LogGroupName: !Join ['', [/ecs/, !Ref Subdomain]]
RetentionInDays: 1
AutoScalingTarget:
Type: AWS::ApplicationAutoScaling::ScalableTarget
Properties:
MinCapacity: !Ref MinContainers
MaxCapacity: !Ref MaxContainers
ResourceId: !Join ['/', [service, !Ref Cluster, !GetAtt Service.Name]]
ScalableDimension: ecs:service:DesiredCount
ServiceNamespace: ecs
# "The Amazon Resource Name (ARN) of an AWS Identity and Access Management (IAM) role that allows Application Auto Scaling to modify your scalable target."
RoleARN: !GetAtt AutoScalingRole.Arn
AutoScalingPolicy:
Type: AWS::ApplicationAutoScaling::ScalingPolicy
Properties:
PolicyName: !Join ['', [!Ref ServiceName, AutoScalingPolicy]]
PolicyType: TargetTrackingScaling
ScalingTargetId: !Ref AutoScalingTarget
TargetTrackingScalingPolicyConfiguration:
PredefinedMetricSpecification:
PredefinedMetricType: ECSServiceAverageCPUUtilization
ScaleInCooldown: 10
ScaleOutCooldown: 10
# Keep things at or lower than 50% CPU utilization, for example
TargetValue: !Ref AutoScalingTargetValue

Outputs:
Endpoint:
Description: Endpoint
Value: !Join ['', ['https://', !Ref DNSRecord]]