Update KServe 2024-2025 Roadmap (kserve#3810)

yuzisun · web-flow · commit e82beb6c4cda · 2024-08-21T11:28:15.000-04:00
* Update ROADMAP.md

Signed-off-by: Dan Sun &lt;dsun20@bloomberg.net&gt;

* Add llm gateway

Signed-off-by: Dan Sun &lt;dsun20@bloomberg.net&gt;

* Update ROADMAP.md

Signed-off-by: Dan Sun &lt;dsun20@bloomberg.net&gt;

* Update ROADMAP.md

Signed-off-by: Dan Sun &lt;dsun20@bloomberg.net&gt;

---------

Signed-off-by: Dan Sun &lt;dsun20@bloomberg.net&gt;
diff --git a/ROADMAP.md b/ROADMAP.md
@@ -1,16 +1,38 @@
-# KServe 2023 Roadmap
+# KServe 2024-2025 Roadmap
+## Objective: "Support GenAI inference"
+- LLM Serving Runtimes
+   * Support Speculative Decoding with vLLM runtime [https://github.com/kserve/kserve/issues/3800].
+   * Support LoRA adapters [https://github.com/kserve/kserve/issues/3750].
+   * Support LLM Serving runtimes for TensorRT-LLM, TGI and provide benchmarking comparisons [https://github.com/kserve/kserve/issues/3868].
+   * Support multi-host, multi-GPU inference runtime [https://github.com/kserve/kserve/issues/2145].
+
+- LLM Autoscaling
+   * Support Model Caching with automatic PV/PVC provisioning [https://github.com/kserve/kserve/issues/3869].
+   * Support Autoscaling settings for serving runtimes.
+   * Support Autoscaling based on custom metrics [https://github.com/kserve/kserve/issues/3561].
+
+- LLM RAG/Agent Pipeline Orchestration
+   * Support declarative RAG/Agent workflow using KServe Inference Graph [https://github.com/kserve/kserve/issues/3829].
+
+-  Open Inference Protocol extension to GenAI Task APIs
+   * Community-maintained Open Inference Protocol repo for OpenAI schema [https://docs.google.com/document/d/1odTMdIFdm01CbRQ6CpLzUIGVppHSoUvJV_zwcX6GuaU].
+   * Support vertical GenAI Task APIs such as embedding, Text-to-Image, Text-To-Code, Doc-To-Text [https://github.com/kserve/kserve/issues/3572].
+
+- LLM Gateway
+   * Support multiple LLM providers.
+   * Support token based rate limiting.
+   * Support LLM router with traffic shaping, fallback, load balancing.
+   * LLM Gateway observability for metrics and cost reporting
 
 ## Objective: "Graduate core inference capability to stable/GA"
-- Promote `InferenceService` and `ClusterServingRuntime`/`ServingRuntime` CRD from v1beta1 to v1 
+- Promote `InferenceService` and `ClusterServingRuntime`/`ServingRuntime` CRD to v1
   * Improve `InferenceService` CRD for REST/gRPC protocol interface
-  * Unify model storage spec and implementation between KServe and ModelMesh
-  * Add Status to `ServingRuntime` for both ModelMesh and KServe, surface `ServingRuntime` validation errors and deployment status
-  * Deprecate `TrainedModel` CRD and use `InferenceService` annotation to allow dynamic model updates as alternative option to storage initializer
-  * Collocate transformer and predictor in the pod to reduce sidecar resources and networking latency
-  * Stablize `RawDeployment` mode with comprehensive testing for supported features
-
-- All model formats to support v2 inference protocol including custom serving runtime
-  * TorchServe to support v2 gRPC inference protocol
+  * Improve model storage interface 
+  * Deprecate `TrainedModel` CRD and add multiple model support for co-hosting, draft model, LoRA adapters to InferenceService.
+  * Improve YAML UX for predictor and transformer container collocation.
+  * Close the feature gap between `RawDeployment` and `Serverless` mode.
+
+- Open Inference Protocol 
   * Support batching for v2 inference protocol
   * Transformer and Explainer v2 inference protocol interoperability
   * Improve codec for v2 inference protocol
@@ -19,30 +41,19 @@ Reference: [Control plane issues](https://github.com/kserve/kserve/issues?q=is%3
 
 ## Objective: "Graduate KServe Python SDK to 1.0“
 
-- Improve KServe Python SDK dependency management with Poetry
-- Create standarized model packaging API
-- Improve KServe model server observability with metrics and distruted tracing
+- Create standardized model packaging API
+- Improve KServe model server observability with metrics and distributed tracing
 - Support batch inference
 
 Reference：[Python SDK issues](https://github.com/kserve/kserve/issues?q=is%3Aissue+is%3Aopen+label%3Akserve%2Fsdk), [Storage issues](https://github.com/kserve/kserve/issues?q=is%3Aissue+is%3Aopen+label%3Akfserving%2Fstorage)
 
-## Objective: "Graduate ModelMesh to beta"
-- Support TorchServe ServingRuntime
-- Add PVC support and unify storage implementation with KServe
-- Add optional ingress for ModelMesh deployments
-- Etcd secret security for multi-namespace mode
-- Add estimated model size field
-
-Reference: [ModelMesh issues](https://github.com/kserve/modelmesh-serving/issues?page=1&q=is%3Aissue+is%3Aopen)
-
-## Objective: "Graduate InferenceGraph to beta"
+## Objective: "Graduate InferenceGraph"
 - Improve `InferenceGraph` spec for replica and concurrency control
-- Allow setting resource limits per `InferenceGraph`
 - Support distributed tracing
 - Support gRPC for `InferenceGraph`
 - Standalone `Transformer` support for `InferenceGraph`
 - Support traffic mirroring node
-- Support `RawDeployment` mode for `InferenceGraph`
+- Improve `RawDeployment` mode for `InferenceGraph`
 
 Reference: [InferenceGraph issues](https://github.com/kserve/kserve/issues?q=is%3Aissue+is%3Aopen+label%3Akserve%2Finference_graph)
 
@@ -58,5 +69,5 @@ Reference: [Auth related issues](https://github.com/kserve/kserve/issues?q=is%3A
 - Add ModelMesh docs and explain the use cases for classic KServe and ModelMesh
 - Unify the data plane v1 and v2 page formats
 - Improve v2 data plane docs to tell the story why and what changed
-- Clean up the examples in kserve repo and unify them with the website's by creating one source of truth for example documentation
+- Clean up the examples in kserve repo and unify them with the website's by creating one source of truth for documentation
 - Update any out-of-date documentation and make sure the website as a whole is consistent and cohesive