Flame is a distributed system for intelligent workloads; it provides a suite of mechanisms that are commonly required by many classes of intelligent workload, including AI/ML, HPC, BigData and so on. Flame builds upon a decade and a half of experience running a wide variety of high performance workloads at scale using several systems and platforms, combined with best-of-breed ideas and practices from the open source community.
As more and more intelligent workload patten are adopted for the innovation, a common workload runtime is helpful to speed up those intelligent workloads by following aspects:
- Scale: Compared to the application in a single node, the Flame will scale up the workload to multiple nodes as much as possible to speed up it, e.g. distributed training; and it makes sure the resources are shared fairly between multiple tenants.
- Data sharing: Data is one of key factor for intelligent workload; the Flame will schedule not only the workload, but also the data. A distributed cached will be introduced in Flame, and it will schedule data & resources together to improve data sharing.
- Mix workloads: Batch (e.g. MPI) and Elastic are two major pattern for intelligent workloads, the Flame will manage those two kind of workload together by migrating 'message passing' to 'data driven'.
- Roundtrip/Throughput: Usually, intelligent workload includes tens of thousands of short tasks; the Flame leverages the latest features (e.g. Future, CondVar) to improve roundtrip and throughput in a large scale environment.
Session: One Session
represents a group of tasks of a job, the Session Scheduler
will allocate resources to each session based on scheduling configurations, by asking for resource manager (e.g. Kubernetes) to launch executors.
Task: The task of Session
which includes the major algorithm of the job by task's metadata and input/output info, e.g. volume path.
Executor: The Executor will handle the lifecycle management of Application/Service which is user's code to execute tasks. Usually, the applications are not reused between sessions, but the image maybe reused to avoid download.
Shim: The protocol implementation of Executor to manage application, e.g. gRPC, Restful, stdio and so on.
Cache: As the data is the key factor of intelligent workload, a distributed cache will be introduced in Flame to schedule data & resources together for better performance.
The Flame will accept connection from user's client, and create Session
s for the job; the client keeps submit tasks to the session until closing it, pre-defined replica is not necessary.
The Session Scheduler
will allocate resources to each session based on scheduling configurations, by asking for resource manager (e.g. Kubernetes) to launch executor.
The Executor will connect back to Flame by gRPC
to pull tasks from related Session
to reuse executor. The executor will be released/deleted if no more tasks in related session.
The service will get the notification when it's bound or unbound to the related session, so it can take action accordingly, e.g. connecting to database; and then, the service can pull tasks from Session
,
and reuse those data to speed up execution.
In the future, the Session scheduler
will provide several features to improve the performance and usage, e.g. proportion, delay release, min/max and so on.
In this guidance, minikube and skaffold are used to start a local kuberentes with Flame. After installing minikube and skaffold, we can start a local Flame cluster by the following steps:
$ minikube start --driver=docker
$ skaffold run
After the Flame clsuter was launched, use the following steps to login into flame-console
pod which is a debug tool for
both developer and SRE.
$ CONSOLE_POD=`kubectl get pod -n flame-system | grep flame-console | cut -d" " -f 1`
$ kubectl exec -it ${CONSOLE_POD} -n flame-system -- /bin/bash
And then, let's verify it with flmping
in the pod. In addition, there are also more meaningful examples here.
# flmping -t 10000
Session <4> was created in <0 ms>, start to run <10,000> tasks in the session:
[100%] ============================================= 10000/10000
<10,000> tasks was completed in <1,603 ms>.
We can check sessions' status by flmctl
as follow, it also includes several sub-commands, e.g. view
.
# flmctl list
ID State App Slots Pending Running Succeed Failed Created
1 Closed pi 1 0 0 100 0 07:33:58
2 Closed flmexec 1 0 0 10 0 07:34:12
3 Closed flmexec 1 0 0 10000 0 07:34:17
4 Closed flmexec 1 0 0 10000 0 08:34:20