You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* Support Speculative Decoding with vLLM runtime [https://github.com/kserve/kserve/issues/3800].
5
+
* Support LoRA adapters [https://github.com/kserve/kserve/issues/3750].
6
+
* Support LLM Serving runtimes for TensorRT-LLM, TGI and provide benchmarking comparisons [https://github.com/kserve/kserve/issues/3868].
7
+
* Support multi-host, multi-GPU inference runtime [https://github.com/kserve/kserve/issues/2145].
8
+
9
+
- LLM Autoscaling
10
+
* Support Model Caching with automatic PV/PVC provisioning [https://github.com/kserve/kserve/issues/3869].
11
+
* Support Autoscaling settings for serving runtimes.
12
+
* Support Autoscaling based on custom metrics [https://github.com/kserve/kserve/issues/3561].
13
+
14
+
- LLM RAG/Agent Pipeline Orchestration
15
+
* Support declarative RAG/Agent workflow using KServe Inference Graph [https://github.com/kserve/kserve/issues/3829].
16
+
17
+
- Open Inference Protocol extension to GenAI Task APIs
18
+
* Community-maintained Open Inference Protocol repo for OpenAI schema [https://docs.google.com/document/d/1odTMdIFdm01CbRQ6CpLzUIGVppHSoUvJV_zwcX6GuaU].
19
+
* Support vertical GenAI Task APIs such as embedding, Text-to-Image, Text-To-Code, Doc-To-Text [https://github.com/kserve/kserve/issues/3572].
20
+
21
+
- LLM Gateway
22
+
* Support multiple LLM providers.
23
+
* Support token based rate limiting.
24
+
* Support LLM router with traffic shaping, fallback, load balancing.
25
+
* LLM Gateway observability for metrics and cost reporting
2
26
3
27
## Objective: "Graduate core inference capability to stable/GA"
4
-
- Promote `InferenceService` and `ClusterServingRuntime`/`ServingRuntime` CRD from v1beta1 to v1
28
+
- Promote `InferenceService` and `ClusterServingRuntime`/`ServingRuntime` CRD to v1
5
29
* Improve `InferenceService` CRD for REST/gRPC protocol interface
6
-
* Unify model storage spec and implementation between KServe and ModelMesh
7
-
* Add Status to `ServingRuntime` for both ModelMesh and KServe, surface `ServingRuntime` validation errors and deployment status
8
-
* Deprecate `TrainedModel` CRD and use `InferenceService` annotation to allow dynamic model updates as alternative option to storage initializer
9
-
* Collocate transformer and predictor in the pod to reduce sidecar resources and networking latency
10
-
* Stablize `RawDeployment` mode with comprehensive testing for supported features
11
-
12
-
- All model formats to support v2 inference protocol including custom serving runtime
13
-
* TorchServe to support v2 gRPC inference protocol
30
+
* Improve model storage interface
31
+
* Deprecate `TrainedModel` CRD and add multiple model support for co-hosting, draft model, LoRA adapters to InferenceService.
32
+
* Improve YAML UX for predictor and transformer container collocation.
33
+
* Close the feature gap between `RawDeployment` and `Serverless` mode.
34
+
35
+
- Open Inference Protocol
14
36
* Support batching for v2 inference protocol
15
37
* Transformer and Explainer v2 inference protocol interoperability
0 commit comments