The high throughput compute grid project (HTC-Grid) is a container based cloud native HPC/Grid environment. The project povides a reference architecture that can be used to build and adapt a modern High throughput compute solution using underlying AWS services, allowing users to submit high volumes of short and long running tasks and scaling environments dynamically.
Warning: This project is an Open Source (Apache 2.0 License), not a supported AWS Service offering.
HTC-Grid should be used when the following criteria are meet:
- A high task throughput is required (from 250 to 10,000+ tasks per second).
- The tasks are loosely coupled.
- Variable workloads (tasks with heterogeneous execution times) are expected and the solution needs to dynamically scale with the load.
HTC-Grid might not be the best choice if :
- The required task throughput is below 250 tasks per second: Use AWS Batch instead.
- The tasks are tightly coupled, or use MPI. Consider using either AWS Parallel Cluster or AWS Batch Multi-Node workloads instead
- The tasks uses third party licensed software.
The following documentation describes HTC-Grid's system architecture, development guides, troubleshooting in further detail.
This section steps through the HTC-Grid's AWS infrastructure and software prerequisites. An AWS account is required along with some limited familiarity of AWS services and terraform. The execution of the Getting Started section will create AWS resources not included in the free tier and then will incur cost to your AWS Account. The complete execution of this section will cost at least 50$ per day.
The following resources should be installed upon you local machine (Linux and macOS only are supported).
-
docker version > 1.19
-
kubectl version > 1.19 (usually installed alongside Docker)
-
python 3.7
-
helm version > 3
Unpack the provided HTC-Grid software ZIP (i.e: htc-grid-0.1.0.tar.gz) or clone the repository into a local directory of your choice; this directory referred to in this documentation as <project_root>. Unless stated otherwise, all paths referenced in this documentation are relative to <project_root>.
For first time users or windows users, we do recommend the use of Cloud9 as the platform to deploy HTC-Grid. The installation process uses Terraform and also make to build up artifacts and environment. This project provides a CloudFormation Cloud9 Stack that installs all the pre-requisites listed above to deploy and develop HTC-Grid. Just follow the standard process in your account and deploy the Cloud9 Cloudformation Stack. Once the CloudFormation Stack has been created, open either the Output section in CloudFormation or go to Cloud9 in your AWS console and open the newly created Cloud9 environment.
Configure the AWS CLI to use your AWS account: see https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-files.html
Check connectivity as follows:
$ aws sts get-caller-identity
{
"Account": "XXXXXXXXXXXX",
"UserId": "XXXXXXXXXXXXXXXXXXXXX",
"Arn": "arn:aws:iam::XXXXXXXXXXXX:user/XXXXXXX"
}The current release of HTC requires python3.7, and the documentation assumes the use of virtualenv. Set this up as follows:
$ cd <project_root>/
$ virtualenv --python=$PATH/python3.7 venv
created virtual environment CPython3.7.10.final.0-64 in 1329ms
creator CPython3Posix(dest=<project_roor>/venv, clear=False, no_vcs_ignore=False, global=False)
seeder FromAppData(download=False, pip=bundle, setuptools=bundle, wheel=bundle, via=copy, app_data_dir=/Users/user/Library/Application Support/virtualenv)
added seed packages: pip==21.0.1, setuptools==54.1.2, wheel==0.36.2
activators BashActivator,CShellActivator,FishActivator,PowerShellActivator,PythonActivator,XonshActivator
Check you have the correct version of python (3.7.x), with a path rooted on <project_root>, then start the environment:
$ source ./venv/bin/activate
(venv) 8c8590cffb8f:htc-grid-0.0.1 $
Check the python version as follows:
$ which python
<project_root>/venv/bin/python
$ python -V
Python 3.7.10For further details on virtualenv see https://sourabhbajaj.com/mac-setup/Python/virtualenv.html
-
To simplify this installation it is suggested that a unique name (to be used later) is also used to prefix the different required bucket. TAG needs to follow S3 naming rules.
export TAG=<Your tag>
-
Define the AWS account ID where the grid will be deployed
export HTCGRID_ACCOUNT_ID=$(aws sts get-caller-identity --query 'Account' --output text)
-
Define the region where the grid will be deployed
export HTCGRID_REGION=<Your region>
<Your region>region can be (the list is not exhaustive)eu-west-1eu-west-2eu-west-3eu-central-1us-east-1us-west-2ap-northeast-1ap-southeast-1
-
In the following section we create a unique UUID to ensure the buckets created are unique, then we create the three environment variables with the S3 bucket names. The S3 buckets will contain:
- S3_IMAGE_TFSTATE_HTCGRID_BUCKET_NAME: The environment variable for the S3 bucket that holds the terraform state to transfer htc-grid docker images to your ECR repository.
- S3_TFSTATE_HTCGRID_BUCKET_NAME: The environment variable for the S3 bucket that holds terraform state for the installation of the htc-grid project
- S3_LAMBDA_HTCGRID_BUCKET_NAME: The environment variable for the S3 bucket that holds the code to be executed when a task is invoked.
export S3_UUID=$(uuidgen | sed 's/.*-\(.*\)/\1/g' | tr '[:upper:]' '[:lower:]') export S3_IMAGE_TFSTATE_HTCGRID_BUCKET_NAME="${TAG}-image-tfstate-htc-grid-${S3_UUID}" export S3_TFSTATE_HTCGRID_BUCKET_NAME="${TAG}-tfstate-htc-grid-${S3_UUID}" export S3_LAMBDA_HTCGRID_BUCKET_NAME="${TAG}-lambda-unit-htc-grid-${S3_UUID}"
- The following step creates the S3 buckets that will be needed during the installation:
aws s3 --region $HTCGRID_REGION mb s3://$S3_IMAGE_TFSTATE_HTCGRID_BUCKET_NAME
aws s3 --region $HTCGRID_REGION mb s3://$S3_TFSTATE_HTCGRID_BUCKET_NAME
aws s3 --region $HTCGRID_REGION mb s3://$S3_LAMBDA_HTCGRID_BUCKET_NAMEThe HTC-Grid project has external software dependencies that are deployed as container images. Instead of downloading each time from the public DockerHub repository, this step will pull those dependencies and upload into the your Amazon Elastic Container Registry (ECR).
Important Note HTC-Grid uses a few open source project with container images storead at Dockerhub. Dockerhub has a download rate limit policy. This may impact you when running this step as an anonymous user as you can get errors when running the terraform command below. To overcome those errors, you can re-run the terraform command and wait until the throttling limit is lifted, or optionally you can create an account in hub.docker.com and then use the credentials of the account using docker login locally to avoid anonymous throttling limitations.
-
As you'll be uploading images to ECR, to avoid timeouts, refresh your ECR authentication token:
aws ecr get-login-password --region $HTCGRID_REGION | docker login --username AWS --password-stdin $HTCGRID_ACCOUNT_ID.dkr.ecr.$HTCGRID_REGION.amazonaws.com
-
From the
<project_root>go to the image repository foldercd ./deployment/image_repository/terraform -
Now run the command
terraform init -backend-config="bucket=$S3_IMAGE_TFSTATE_HTCGRID_BUCKET_NAME" \ -backend-config="region=$HTCGRID_REGION"
-
If successful, you can now run terraform apply to create the HTC-Grid infrastructure. This can take between 10 and 15 minutes depending on the Internet connection.
terraform apply -var-file ./images_config.json -var "region=$HTCGRID_REGION" -parallelism=1
NB: This operation fetches images from external repositories and creates a copy into your ECR account, sometimes the fetch to external repositories may have temporary failures due to the state of the external repositories, If the terraform apply fails with errors such as the ones below, re-run the command until terraform apply successfully completes.
name unknown: The repository with name 'xxxxxxxxx' does not exist in the registry with idHTC artifacts include: python packages, docker images, configuration files for HTC and k8s. To build and install these:
-
Now build the images for the HTC agent. Return to
<project_root>and run the command:make happy-path TAG=$TAG ACCOUNT_ID=$HTCGRID_ACCOUNT_ID REGION=$HTCGRID_REGION BUCKET_NAME=$S3_LAMBDA_HTCGRID_BUCKET_NAME
- If
TAGis omitted thenmainlinewill be the chosen has a default value. - If
ACCOUNT_IDis omitted then the value will be resolved by the following command:
aws sts get-caller-identity --query 'Account' --output text- If
REGIONis omitted theneu-west-1will be used. BUCKET_NAMErefers to the name of the bucket created at the beginning for storing the HTC-Grid workload lambda function. This variable is mandatory.
A folder name
generatedwill be created at<project_root>. This folder should contain the following two files:grid_config.jsona configuration file for the grid with basic settingsingle-task-test.yamlthe kubernetes configuration for running a single tasks on the grid.
- If
The grid_config.json is ready to deploy, but you can tune it before deployment.
Some important parameters are:
- region : the AWS region where all resources are going to be created.
- grid_storage_service : the type of storage used for tasks payloads, configurable between [S3 or Redis]
- eks_worker : an array describing the autoscaling group used by EKS
The deployment time is about 30 min.
- from the project root
cd ./deployment/grid/terraform - Run
terraform init -backend-config="bucket=$S3_TFSTATE_HTCGRID_BUCKET_NAME" \ -backend-config="region=$HTCGRID_REGION"
- if successful you can run terraform apply to create the infrastructure. HTC-Grid deploys a grafana version behind cognito. The admin password is configurable and should be passed at this stage.
terraform apply -var-file ../../../generated/grid_config.json -var="grafana_admin_password=<my_grafana_admin_password>"
-
If
terraform applyis successful then in the terraform folder two files are created:kubeconfig_htc_$TAG: this file give access to the EKS cluster through kubectl (example: kubeconfig_htc_aws_my_project)Agent_config.json: this file contains all the parameters, so the agent can run in the infrastructure
-
Set the connection with the EKS cluster
- If using terraform v0.14.9:
export KUBECONFIG=$(terraform output -raw kubeconfig)
- If using terraform v0.13.4:
export KUBECONFIG=$(terraform output kubeconfig)
- Testing the Deployment
-
Get the number of nodes in the cluster using the command below. Note: You should have one or more nodes. If not please the review the configuration files and particularly the variable
eks_workerkubectl get nodes
-
Check is system pods are running using the command below. Note: You should have all pods in running state (this might one minute but no more).
kubectl -n kube-system get po
-
Check if logging and monitoring is deployed using the command below. Note: You should have all pods in running state (this might one minute but no more).
kubectl -n amazon-cloudwatch get po
-
Check if metric server is deployed using the command below. Note: You should have all pods in running state (this might one minute but no more).
kubectl -n custom-metrics get po
-
In the folder mock_computation, you will find the code of the C++ program mocking computation. This program can sleep for a given duration or emulate CPU/memory consumption based on the input parameters. We will use a kubernetes Jobs to submit one execution of 1 second of this C++ program. The communication between the job and the grid are implemented by a client in folder ./examples/client/python.
-
Make sure the connection with the grid is established
kubectl get nodes
if an error is returned, please come back to step 2 of the previous section.
-
Change directory to
<project_root> -
Run the test:
kubectl apply -f ./generated/single-task-test.yaml
-
look at the log of the submission:
kubectl logs job/single-task -f
The test should take about 3 second to execute. If you see a successful message without exceptions raised, then the test has been successfully executed.
-
clean the job submission instance:
kubectl delete -f ./generated/single-task-test.yaml
The HTC-Grid project captures metrics into influxdb and exposes those metrics through Grafana. To secure Grafana we use Amazon Cognito. You will need to add a user, using your email, and a password to access the Grafana landing page.
-
To find out the https endpoint where grafana has been deployed type:
kubectl -n grafana get ingress | tail -n 1 | awk '{ print "Grafana URL -> https://"$4 }'It should output something like:
Grafana URL -> https://k8s-grafana-grafanai-XXXXXXXXXXXX-YYYYYYYYYYY.eu-west-2.elb.amazonaws.comThen take the ADDRESS part and point at that on a browser. Note:It will generate a warning as we are using self-signed certificates. Just accept the self-signed certificate to get into grafana
-
Log into the URL. Cognito login screen will come up, use it to sign up with your email and a password.
-
On the AWS Console open Cognito and select the
htc_poolin theusers_poolsection, then select theusers and groupsand confirm user that you just created. This will allow the user to log in with the credentials you provided in the previous step. -
Go to the grafana URL above, login and use the credentials that you just signed up with and confirmed. This will take you to the grafana dashboard landing page.
-
Finally, in the landing page for grafana, you can use the user
adminand the password that you provided in the Deploying HTC-Grid section. If you did not provide any password the project sets the defaulthtcadmin. We encourage everyone to set a password, even if the grafana dashboard is protected through Cognito.
The destruction time is about 15 min.
- Go in the terraform grid folder
./deployment/grid/terraform. - To remove the grid resources run the following command:
terraform destroy -var-file ../../../generated/grid_config.json
- To remove the images from the ECR repository go to the images folder
deployment/image_repository/terraformand executeterraform destroy -var-file ./images_config.json -var "region=$HTCGRID_REGION" - Finally, this will leave the 3 only resources that you can clean manually, the S3 buckets. You can remove the folders using the following command
aws s3 --region $HTCGRID_REGION rb --force s3://$S3_IMAGE_TFSTATE_HTCGRID_BUCKET_NAME aws s3 --region $HTCGRID_REGION rb --force s3://$S3_TFSTATE_HTCGRID_BUCKET_NAME aws s3 --region $HTCGRID_REGION rb --force s3://$S3_LAMBDA_HTCGRID_BUCKET_NAME
- Go at the root of the git repository
- run the following command
or for deploying the server :
make docmake serve