Skip to content

LMCacheConnectorV1 in vLLMDisaggregated mode does not perform real KV reuse #1069

@xrwang8

Description

@xrwang8

Summary

When deploying a ModelBooster with backend.type: vLLMDisaggregated and vLLM LMCacheConnectorV1, Kthena can create the resources and route requests through prefill and decode pods, but LMCache does not appear to perform real KV reuse between the prefill and decode instances.

The current router implementation seems to map lmcache to the generic HTTPConnector, so the prefill response is not parsed and no LMCache/vLLM transfer metadata is passed to the decode request. This makes the deployment look like PD disaggregation at the request-routing level, but the decode side does not reuse KV cache generated by the prefill side.

Environment

  • Cluster: 2-node RTX 3090 cluster (xrnode41, xrnode44)
  • Kthena version: v0.4.0
  • Namespace: default
  • vLLM image: vllm-openai:v0.20.0
  • Model: Qwen/Qwen2.5-0.5B-Instruct
  • Backend type: vLLMDisaggregated
  • Connector under test: LMCacheConnectorV1

Minimal configuration used

The following minimal LMCache configuration follows the current Kthena example style:

apiVersion: workload.serving.volcano.sh/v1alpha1
kind: ModelBooster
metadata:
  name: qwen-pd-offline
  namespace: default
spec:
  backend:
    name: qwen-pd
    type: vLLMDisaggregated
    modelURI: "ms://Qwen/Qwen2.5-0.5B-Instruct"
    cacheURI: "hostpath://tmp/cache"
    runtimeClassName: nvidia
    minReplicas: 1
    maxReplicas: 1
    env:
      - name: KTHENA_SKIP_ENGINE_DEPENDENCY_INSTALL
        value: "1"
      - name: PYTORCH_CUDA_ALLOC_CONF
        value: "expandable_segments:True"
      - name: PYTHONHASHSEED
        value: "0"
      - name: UCX_TLS
        value: "tcp"
    workers:
      - type: prefill
        image:  vllm-openai:v0.20.0-cu130-lmcache-nixl-mooncake
        replicas: 1
        pods: 1
        resources:
          limits:
            nvidia.com/gpu: "1"
            cpu: "4"
            memory: "8Gi"
        config:
          served-model-name: Qwen2.5-0.5B-Instruct
          tensor-parallel-size: 1
          gpu-memory-utilization: "0.90"
          max-model-len: 32768
          kv-transfer-config: |
            {"kv_connector":"LMCacheConnectorV1","kv_role":"kv_producer","kv_connector_extra_config":{"discard_partial_chunks":true,"lmcache_rpc_port":"10086"}}
      - type: decode
        image: 172.28.0.32:3443/jdcloud/kserve/servingruntimes/vllm-openai:v0.20.0-cu130-lmcache-nixl-mooncake
        replicas: 1
        pods: 1
        resources:
          limits:
            nvidia.com/gpu: "1"
            cpu: "4"
            memory: "8Gi"
        config:
          served-model-name: Qwen2.5-0.5B-Instruct
          tensor-parallel-size: 1
          gpu-memory-utilization: "0.90"
          max-model-len: 32768
          kv-transfer-config: |
            {"kv_connector":"LMCacheConnectorV1","kv_role":"kv_consumer","kv_connector_extra_config":{"discard_partial_chunks":true,"lmcache_rpc_port":"10086"}}

Observed behavior

The generated ModelServer has kvConnector.type: lmcache:

spec:
  inferenceEngine: vLLM
  kvConnector:
    type: lmcache
  model: Qwen2.5-0.5B-Instruct

Both prefill and decode pods become ready:

qwen-pd-offline-qwen-pd-0-decode-0-0    2/2     Running
qwen-pd-offline-qwen-pd-0-prefill-0-0   2/2     Running

A request through the Kthena router succeeds:

curl -X POST http://10.43.65.139/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{"model":"qwen-pd-offline","messages":[{"role":"user","content":"Hello"}],"max_tokens":20}'

The response is a normal chat completion, and the router logs show a successful route through the ModelRoute and ModelServer:

"POST /v1/chat/completions HTTP/1.1" 200 model_name=qwen-pd-offline model_route=default/qwen-pd-offline model_server=default/qwen-pd-offline-qwen-pd

However, the vLLM/LMCache logs show no external KV cache reuse on the decode side:

LMCache INFO: Reqid: ..., Total tokens 30, Inference Engine computed tokens: 0, LMCache hit tokens: 0, need to load: 0
Engine 000: ... Prefix cache hit rate: 0.0%, External prefix cache hit rate: 0.0%

This suggests that the deployment is only performing request-level prefill/decode routing, not real LMCache-backed KV reuse.

Additional attempted configuration

We also tried a more explicit LMCache PD/NIXL-style configuration with fields such as:

{
  "lmcache.enable_pd": true,
  "lmcache.pd_role": "sender",
  "lmcache.transfer_channel": "nixl",
  "lmcache.nixl_backends": ["UCX"],
  "lmcache.pd_skip_proxy_notification": true,
  "lmcache.pd_peer_host": "...",
  "lmcache.pd_peer_init_port": "10086",
  "lmcache.pd_peer_alloc_port": "10087",
  "lmcache.pd_buffer_device": "cuda"
}

In that mode, requests failed with decode-side errors such as HTTP 500 from the decode request. The failure is consistent with the decode side expecting LMCache/orchestrator state that is not populated by the current Kthena router path.

Code analysis

From the current code, LMCacheConnectorV1 is converted into Kthena's ConnectorTypeLMCache:

var VLLMKvConnectorType = map[string]networking.KVConnectorType{
    "MooncakeConnector":  networking.ConnectorTypeMoonCake,
    "NixlConnector":      networking.ConnectorTypeNIXL,
    "LMCacheConnectorV1": networking.ConnectorTypeLMCache,
}

But the router factory currently registers lmcache to the generic HTTP connector:

factory.RegisterConnectorBuilder(v1alpha1.ConnectorTypeLMCache, NewHTTPConnector) // LMCache uses HTTP connector for now

HTTPConnector.Proxy() sends a prefill request and then a decode request, but does not parse the prefill response or pass any KV transfer metadata to decode:

err := h.prefill(h.prefillRequest, prefillAddr)
if err != nil {
    return 0, err
}
result, decodeErr := h.decode(c, h.decodeRequest, decodeAddr)

The prefill transport also only checks the HTTP status and discards the response body:

func prefillerProxy(_ *gin.Context, req *http.Request) error {
    resp, err := http.DefaultTransport.RoundTrip(req)
    ...
    if resp.StatusCode < 200 || resp.StatusCode >= 300 {
        return fmt.Errorf("prefill request failed with status %d", resp.StatusCode)
    }
    return nil
}

By contrast, NIXLConnector actively parses kv_transfer_params from the prefill response and injects them into the decode request:

kvTransferParams, err := n.prefill(n.prefillRequest, prefillAddr)
...
decodeReq := n.buildDecodeRequest(c, n.decodeRequestBody, kvTransferParams)

This difference appears to explain why NixlConnector can achieve real KV transfer in the same environment, while LMCacheConnectorV1 cannot.

The proposal document also states that lmcache currently defaults to HTTP connector behavior:

This is the default connector and is used for http, lmcache, and mooncake connector types.

and later:

While lmcache and mooncake currently default to the http connector's behavior...

Expected behavior

One of the following would be helpful:

  1. A dedicated Kthena LMCacheConnector implementation that performs the LMCache/vLLM PD orchestration required for real KV reuse.
  2. Documentation clarifying that lmcache currently means HTTP-level prefill/decode routing only, and does not imply real LMCache KV transfer/reuse.
  3. A supported deployment recipe using the LMCache/vLLM disaggregated prefill proxy/orchestrator, if that is the intended integration model.

Questions

  • Is real LMCache-backed KV reuse currently expected to work with ModelBooster + vLLMDisaggregated in Kthena?
  • If yes, what is the intended configuration for the LMCache orchestrator/proxy path?
  • If not, should the current docs/examples clarify that lmcache is currently mapped to the generic HTTP connector and does not provide real KV reuse?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions