-
Notifications
You must be signed in to change notification settings - Fork 66
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
docs: Add ModelMesh documentation (#110)
- Create /docs directory - Add new docs on: - VModels - Configuration/Tuning - Scaling - Metrics - Move existing docs into /docs directory: - Payload processing - Getting started, build, deployment --- Signed-off-by: Rafael Vasquez <[email protected]>
- Loading branch information
Showing
10 changed files
with
353 additions
and
77 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,30 @@ | ||
# Overview | ||
|
||
ModelMesh is a mature, general-purpose model serving management/routing layer designed for high-scale, high-density and frequently-changing model use cases. It works with existing or custom-built model servers and acts as a distributed LRU cache for serving runtime models. | ||
|
||
For full Kubernetes-based deployment and management of ModelMesh clusters and models, see the [ModelMesh Serving](https://github.com/kserve/modelmesh-serving) repo. This includes a separate controller and provides K8s custom resource based management of ServingRuntimes and InferenceServices along with common, abstracted handling of model repository storage and ready-to-use integrations with some existing OSS model servers. | ||
|
||
For more information on supported features and design details, see [these charts](https://github.com/kserve/modelmesh/files/8854091/modelmesh-jun2022.pdf). | ||
|
||
## What is a model? | ||
|
||
In ModelMesh, a **model** refers to an abstraction of machine learning models. It is not aware of the underlying model format. There are two model types: model (regular) and vmodel. Regular models in ModelMesh are assumed and required to be immutable. VModels add a layer of indirection in front of the immutable models. See [VModels Reference](/docs/vmodels.md) for further reading. | ||
|
||
## Implement a model runtime | ||
|
||
1. Wrap your model-loading and invocation logic in this [model-runtime.proto](/src/main/proto/current/model-runtime.proto) gRPC service interface. | ||
- `runtimeStatus()` - called only during startup to obtain some basic configuration parameters from the runtime, such as version, capacity, model-loading timeout. | ||
- `loadModel()` - load the specified model into memory from backing storage, returning when complete. | ||
- `modelSize()` - determine size (memory usage) of previously-loaded model. If very fast, can be omitted and provided instead in the response from `loadModel`. | ||
- `unloadModel()` - unload previously loaded model, returning when complete. | ||
- Use a separate, arbitrary gRPC service interface for model inferencing requests. It can have any number of methods and they are assumed to be idempotent. See [predictor.proto](/src/test/proto/predictor.proto) for a very simple example. | ||
- The methods of your custom applier interface will be called only for already fully-loaded models. | ||
2. Build a grpc server docker container which exposes these interfaces on localhost port 8085 or via a mounted unix domain socket. | ||
3. Extend the [Kustomize-based Kubernetes manifests](/config) to use your docker image, and with appropriate memory and CPU resource allocations for your container. | ||
4. Deploy to a Kubernetes cluster as a regular Service, which will expose [this grpc service interface](/src/main/proto/current/model-mesh.proto) via kube-dns (you do not implement this yourself), consume using grpc client of your choice from your upstream service components. | ||
- `registerModel()` and `unregisterModel()` for registering/removing models managed by the cluster | ||
- Any custom inferencing interface methods to make a runtime invocation of previously-registered model, making sure to set a `mm-model-id` or `mm-vmodel-id` metadata header (or `-bin` suffix equivalents for UTF-8 ids) | ||
|
||
## Development | ||
|
||
Please see the [Developer Guide](/developer-guide.md) for details. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,74 @@ | ||
A core goal of the ModelMesh framework was minimizing the amount of custom configuration required. It should be possible to get up and running without changing most of these things. | ||
|
||
## Model Runtime Configuration | ||
|
||
There are a few basic parameters (some optional) that the model runtime implementation must report in a `RuntimeStatusResponse` response to the `ModelRuntime.runtimeStatus` rpc method once it has successfully initialized: | ||
|
||
- `uint64 capacityInBytes` | ||
- `uint32 maxLoadingConcurrency` | ||
- `uint32 modelLoadingTimeoutMs` | ||
- `uint64 defaultModelSizeInBytes` | ||
- `string runtimeVersion` (optional) | ||
- ~~`uint64 numericRuntimeVersion`~~ (deprecated, unused) | ||
- `map<string,MethodInfo> methodInfos` (optional) | ||
- `bool allowAnyMethod` - applicable only if one or more `methodInfos` are provided. | ||
- `bool limitModelConcurrency` - (experimental) | ||
|
||
It's expected that all model runtime instances in the same cluster (with same Kubernetes deployment config including image version) will report the same values for these, although it's not strictly necessary. | ||
|
||
## TLS (SSL) Configuration | ||
|
||
This can be configured via environment variables on the ModelMesh container, refer to [the documentation](/docs/configuration/tls.md). | ||
|
||
## Model Auto-Scaling | ||
|
||
Nothing needs to be configured to enable this, it is on by default. There is a single configuration parameter which can optionally be used to tune the sensitivity of the scaling, based on rate of requests per model. Note that this applies to scaling copies of models within existing pods, not scaling of the pods themselves. | ||
|
||
The scale-up RPM threshold specifies a target request rate per model **copy** measured in requests per minute. Model-mesh balances requests between loaded copies of a given model evenly, and if one copy's share of requests increases above this threshold more copies will be added if possible in instances (replicas) that do not currently have the model loaded. | ||
|
||
The default for this parameter is 2000 RPM. It can be overridden by setting either the `MM_SCALEUP_RPM_THRESHOLD` environment variable or `scaleup_rpm_threshold` etcd/zookeeper dynamic config parameter, with the latter taking precedence. | ||
|
||
Other points to note: | ||
|
||
- Scale up can happen by more than one additional copy at a time if the request rate breaches the configured threshold by a sufficient amount. | ||
- The number of replicas in the deployment dictates the maximum number of copies that a given model can be scaled to (one in each Pod). | ||
- Models will scale to two copies if they have been used recently regardless of the load - the autoscaling behaviour applies between 2 and N>2 copies. | ||
- Scale-down will occur slowly once the per-copy load remains below the configured threshold for long enough. | ||
- Note that if the runtime is in latency-based auto-scaling mode (when the runtime returns non-default `limitModelConcurrency = true` in the `RuntimeStatusResponse`), scaling is triggered based on measured latencies/queuing rather than request rates, and the RPM threshold parameter will have no effect. | ||
|
||
## Request Header Logging | ||
|
||
To have particular gRPC request metadata headers included in any request-scoped log messages, set the `MM_LOG_REQUEST_HEADERS` environment variable to a json string->string map (object) whose keys are the header names to log and values are the names of corresponding entries to insert into the logger thread context map (MDC). | ||
|
||
Values can be either raw ascii or base64-encoded utf8; in the latter case the corresponding header name must end with `-bin`. For example: | ||
``` | ||
{ | ||
"transaction_id": "txid", | ||
"user_id-bin": "user_id" | ||
} | ||
``` | ||
**Note**: this does not generate new log messages and successful requests aren't logged by default. To log a message for every request, additionally set the `MM_LOG_EACH_INVOKE` environment variable to true. | ||
|
||
## Other Optional Parameters | ||
|
||
Set via environment variables on the ModelMesh container: | ||
|
||
- `MM_SVC_GRPC_PORT` - external grpc port, default 8033 | ||
- `INTERNAL_GRPC_SOCKET_PATH` - unix domain socket, which should be a file location on a persistent volume mounted in both the model-mesh and model runtime containers, defaults to /tmp/mmesh/grpc.sock | ||
- `INTERNAL_SERVING_GRPC_SOCKET_PATH` - unix domain socket to use for inferencing requests, defaults to be same as primary domain socket | ||
- `INTERNAL_GRPC_PORT` - pod-internal grpc port (model runtime localhost), default 8056 | ||
- `INTERNAL_SERVING_GRPC_PORT` - pod-internal grpc port to use for inferencing requests, defaults to be same as primary pod-internal grpc port | ||
- `MM_SVC_GRPC_MAX_MSG_SIZE` - max message size in bytes, default 16MiB | ||
- `MM_SVC_GRPC_MAX_HEADERS_SIZE` - max headers size in bytes, defaults to gRPC default | ||
- `MM_METRICS` - metrics configuration, see Metrics wiki page | ||
- `MM_MULTI_PARALLELISM` - max multi-model request parallelism, default 4 | ||
- `KV_READ_ONLY` (advanced) - run in "read only" mode where new (v)models cannot be registered or unregistered | ||
- `MM_LOG_EACH_INVOKE` - log an INFO level message for every request; default is false, set to true to enable | ||
- `MM_SCALEUP_RPM_THRESHOLD` - see Model auto-scaling above | ||
|
||
**Note**: only one of `INTERNAL_GRPC_SOCKET_PATH` and `INTERNAL_GRPC_PORT` can be set. The same goes for `INTERNAL_SERVING_GRPC_SOCKET_PATH` and `INTERNAL_SERVING_GRPC_PORT`. | ||
|
||
Set dynamically in kv-store (etcd or zookeeper): | ||
- log_each_invocation - dynamic override of `MM_LOG_EACH_INVOKE` env var | ||
- logger_level - TODO | ||
- scaleup_rpm_threshold - dynamic override of `MM_SCALEUP_RPM_THRESHOLD` env var, see [auto-scaling](#model-auto-scaling) above. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,26 @@ | ||
## Payload Processing Overview | ||
ModelMesh exchanges `Payloads` with models deployed within runtimes. In ModelMesh, a `Payload` consists of information regarding the id of the model and the method of the model being called, together with some data (actual binary requests or responses) and metadata (e.g., headers). | ||
|
||
A `PayloadProcessor` is responsible for processing such `Payloads` for models served by ModelMesh. Examples would include loggers of prediction requests, data sinks for data visualization, model quality assessment, or monitoring tooling. | ||
|
||
They can be configured to only look at payloads that are consumed and produced by certain models, or payloads containing certain headers, etc. This configuration is performed at the ModelMesh instance level. Multiple `PayloadProcessors` can be configured per each ModelMesh instance, and they can be set to care about specific portions of the payload (e.g., model inputs, model outputs, metadata, specific headers, etc.). | ||
|
||
As an example, a `PayloadProcessor` can see input data as below: | ||
|
||
```text | ||
[mmesh.ExamplePredictor/predict, Metadata(content-type=application/grpc,user-agent=grpc-java-netty/1.51.1,mm-model-id=myModel,another-custom-header=custom-value,grpc-accept-encoding=gzip,grpc-timeout=1999774u), CompositeByteBuf(ridx: 0, widx: 2000004, cap: 2000004, components=147) | ||
``` | ||
|
||
and/or output data as `ByteBuf`: | ||
```text | ||
java.nio.HeapByteBuffer[pos=0 lim=65 cap=65] | ||
``` | ||
|
||
A `PayloadProcessor` can be configured by means of a whitespace separated `String` of URIs. For example, in a URI like `logger:///*?pytorch1234#predict`: | ||
- the scheme represents the type of processor, e.g., `logger` | ||
- the query represents the model id to observe, e.g., `pytorch1234` | ||
- the fragment represents the method to observe, e.g., `predict` | ||
|
||
## Featured `PayloadProcessors`: | ||
- `logger` : logs requests/responses payloads to `model-mesh` logs (_INFO_ level), e.g., use `logger://*` to log every `Payload` | ||
- `http` : sends requests/responses payloads to a remote service (via _HTTP POST_), e.g., use `http://10.10.10.1:8080/consumer/kserve/v2` to send every `Payload` to the specified HTTP endpoint |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,31 @@ | ||
ModelMesh relies on [Kubernetes for rolling updates](https://kubernetes.io/docs/tutorials/kubernetes-basics/update/update-intro/). For the sake of simplicity and elasticity, ModelMesh does not keep track of update states or so internally. | ||
|
||
## Scaling Up/Down | ||
|
||
ModelMesh follows the process below, skipping the termination/migration steps in the context of scaling up (adding new pods). | ||
|
||
1. A new Pod with updates starts. | ||
2. Kubernetes awaits the new Pod to report `Ready` state. | ||
3. If ready, it triggers termination of the old Pod. | ||
4. Once the old Pod receives a termination signal from Kubernetes, it will begin to migrate its models to other instances. | ||
|
||
Asynchronously, ModelMesh will try to rebalance model distribution among all the pods with `Ready` state. | ||
|
||
## Fail Fast with Readiness Probes | ||
|
||
When an update triggers a cluster-wise failure, resulting in the failure to load existing models on new pods, fail fast protection will prevent old cluster from shutting down completely by using [Readiness Probes](https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/#define-readiness-probes). | ||
|
||
ModelMesh achieves fail fast by collecting statistics about loading failures during the startup period. Specifically: | ||
|
||
1. Critical Failure - if this model loaded successfully on other pods, but cannot be loaded on this pod. | ||
2. General Failure - if a new model cannot be loaded on this pod. | ||
|
||
However, this statistics are only collected during the startup period. The length of this period can be controlled by the environment variable `BOOTSTRAP_CLEARANCE_PERIOD_MS`. Once failure statistics exceed the threshold on certain pods, these pods will start to report a `NOT READY` state. This will prevent the old pods from terminating. | ||
|
||
The default `BOOTSTRAP_CLEARANCE_PERIOD_MS` is 3 minutes (180,000 ms). | ||
|
||
**Note**: you may also want to tweak the readiness probes' parameters as well. For example, increasing `initialDelaySeconds` may help slow down the shutdown old pods too early. | ||
|
||
## Rolling Update Configuration | ||
|
||
Specify `maxUnavailable` and `maxSurge` [as described here](https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#rolling-update-deployment) to control the rolling update process. |
Oops, something went wrong.