LMCacheConnectorV1 in vLLMDisaggregated mode does not perform real KV reuse

## Summary

When deploying a `ModelBooster` with `backend.type: vLLMDisaggregated` and vLLM `LMCacheConnectorV1`, Kthena can create the resources and route requests through prefill and decode pods, but LMCache does not appear to perform real KV reuse between the prefill and decode instances.

The current router implementation seems to map `lmcache` to the generic `HTTPConnector`, so the prefill response is not parsed and no LMCache/vLLM transfer metadata is passed to the decode request. This makes the deployment look like PD disaggregation at the request-routing level, but the decode side does not reuse KV cache generated by the prefill side.

## Environment

- Cluster: 2-node RTX 3090 cluster (`xrnode41`, `xrnode44`)
- Kthena version: `v0.4.0`
- Namespace: `default`
- vLLM image: `vllm-openai:v0.20.0`
- Model: `Qwen/Qwen2.5-0.5B-Instruct`
- Backend type: `vLLMDisaggregated`
- Connector under test: `LMCacheConnectorV1`

## Minimal configuration used

The following minimal LMCache configuration follows the current Kthena example style:

```yaml
apiVersion: workload.serving.volcano.sh/v1alpha1
kind: ModelBooster
metadata:
  name: qwen-pd-offline
  namespace: default
spec:
  backend:
    name: qwen-pd
    type: vLLMDisaggregated
    modelURI: "ms://Qwen/Qwen2.5-0.5B-Instruct"
    cacheURI: "hostpath://tmp/cache"
    runtimeClassName: nvidia
    minReplicas: 1
    maxReplicas: 1
    env:
      - name: KTHENA_SKIP_ENGINE_DEPENDENCY_INSTALL
        value: "1"
      - name: PYTORCH_CUDA_ALLOC_CONF
        value: "expandable_segments:True"
      - name: PYTHONHASHSEED
        value: "0"
      - name: UCX_TLS
        value: "tcp"
    workers:
      - type: prefill
        image:  vllm-openai:v0.20.0-cu130-lmcache-nixl-mooncake
        replicas: 1
        pods: 1
        resources:
          limits:
            nvidia.com/gpu: "1"
            cpu: "4"
            memory: "8Gi"
        config:
          served-model-name: Qwen2.5-0.5B-Instruct
          tensor-parallel-size: 1
          gpu-memory-utilization: "0.90"
          max-model-len: 32768
          kv-transfer-config: |
            {"kv_connector":"LMCacheConnectorV1","kv_role":"kv_producer","kv_connector_extra_config":{"discard_partial_chunks":true,"lmcache_rpc_port":"10086"}}
      - type: decode
        image: 172.28.0.32:3443/jdcloud/kserve/servingruntimes/vllm-openai:v0.20.0-cu130-lmcache-nixl-mooncake
        replicas: 1
        pods: 1
        resources:
          limits:
            nvidia.com/gpu: "1"
            cpu: "4"
            memory: "8Gi"
        config:
          served-model-name: Qwen2.5-0.5B-Instruct
          tensor-parallel-size: 1
          gpu-memory-utilization: "0.90"
          max-model-len: 32768
          kv-transfer-config: |
            {"kv_connector":"LMCacheConnectorV1","kv_role":"kv_consumer","kv_connector_extra_config":{"discard_partial_chunks":true,"lmcache_rpc_port":"10086"}}
```

## Observed behavior

The generated `ModelServer` has `kvConnector.type: lmcache`:

```yaml
spec:
  inferenceEngine: vLLM
  kvConnector:
    type: lmcache
  model: Qwen2.5-0.5B-Instruct
```

Both prefill and decode pods become ready:

```text
qwen-pd-offline-qwen-pd-0-decode-0-0    2/2     Running
qwen-pd-offline-qwen-pd-0-prefill-0-0   2/2     Running
```

A request through the Kthena router succeeds:

```bash
curl -X POST http://10.43.65.139/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{"model":"qwen-pd-offline","messages":[{"role":"user","content":"Hello"}],"max_tokens":20}'
```

The response is a normal chat completion, and the router logs show a successful route through the `ModelRoute` and `ModelServer`:

```text
"POST /v1/chat/completions HTTP/1.1" 200 model_name=qwen-pd-offline model_route=default/qwen-pd-offline model_server=default/qwen-pd-offline-qwen-pd
```

However, the vLLM/LMCache logs show no external KV cache reuse on the decode side:

```text
LMCache INFO: Reqid: ..., Total tokens 30, Inference Engine computed tokens: 0, LMCache hit tokens: 0, need to load: 0
Engine 000: ... Prefix cache hit rate: 0.0%, External prefix cache hit rate: 0.0%
```

This suggests that the deployment is only performing request-level prefill/decode routing, not real LMCache-backed KV reuse.

## Additional attempted configuration

We also tried a more explicit LMCache PD/NIXL-style configuration with fields such as:

```json
{
  "lmcache.enable_pd": true,
  "lmcache.pd_role": "sender",
  "lmcache.transfer_channel": "nixl",
  "lmcache.nixl_backends": ["UCX"],
  "lmcache.pd_skip_proxy_notification": true,
  "lmcache.pd_peer_host": "...",
  "lmcache.pd_peer_init_port": "10086",
  "lmcache.pd_peer_alloc_port": "10087",
  "lmcache.pd_buffer_device": "cuda"
}
```

In that mode, requests failed with decode-side errors such as HTTP 500 from the decode request. The failure is consistent with the decode side expecting LMCache/orchestrator state that is not populated by the current Kthena router path.

## Code analysis

From the current code, `LMCacheConnectorV1` is converted into Kthena's `ConnectorTypeLMCache`:

```go
var VLLMKvConnectorType = map[string]networking.KVConnectorType{
    "MooncakeConnector":  networking.ConnectorTypeMoonCake,
    "NixlConnector":      networking.ConnectorTypeNIXL,
    "LMCacheConnectorV1": networking.ConnectorTypeLMCache,
}
```

But the router factory currently registers `lmcache` to the generic HTTP connector:

```go
factory.RegisterConnectorBuilder(v1alpha1.ConnectorTypeLMCache, NewHTTPConnector) // LMCache uses HTTP connector for now
```

`HTTPConnector.Proxy()` sends a prefill request and then a decode request, but does not parse the prefill response or pass any KV transfer metadata to decode:

```go
err := h.prefill(h.prefillRequest, prefillAddr)
if err != nil {
    return 0, err
}
result, decodeErr := h.decode(c, h.decodeRequest, decodeAddr)
```

The prefill transport also only checks the HTTP status and discards the response body:

```go
func prefillerProxy(_ *gin.Context, req *http.Request) error {
    resp, err := http.DefaultTransport.RoundTrip(req)
    ...
    if resp.StatusCode < 200 || resp.StatusCode >= 300 {
        return fmt.Errorf("prefill request failed with status %d", resp.StatusCode)
    }
    return nil
}
```

By contrast, `NIXLConnector` actively parses `kv_transfer_params` from the prefill response and injects them into the decode request:

```go
kvTransferParams, err := n.prefill(n.prefillRequest, prefillAddr)
...
decodeReq := n.buildDecodeRequest(c, n.decodeRequestBody, kvTransferParams)
```

This difference appears to explain why `NixlConnector` can achieve real KV transfer in the same environment, while `LMCacheConnectorV1` cannot.

The proposal document also states that `lmcache` currently defaults to HTTP connector behavior:

> This is the default connector and is used for `http`, `lmcache`, and `mooncake` connector types.

and later:

> While `lmcache` and `mooncake` currently default to the `http` connector's behavior...

## Expected behavior

One of the following would be helpful:

1. A dedicated Kthena `LMCacheConnector` implementation that performs the LMCache/vLLM PD orchestration required for real KV reuse.
2. Documentation clarifying that `lmcache` currently means HTTP-level prefill/decode routing only, and does not imply real LMCache KV transfer/reuse.
3. A supported deployment recipe using the LMCache/vLLM disaggregated prefill proxy/orchestrator, if that is the intended integration model.

## Questions

- Is real LMCache-backed KV reuse currently expected to work with `ModelBooster` + `vLLMDisaggregated` in Kthena?
- If yes, what is the intended configuration for the LMCache orchestrator/proxy path?
- If not, should the current docs/examples clarify that `lmcache` is currently mapped to the generic HTTP connector and does not provide real KV reuse?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LMCacheConnectorV1 in vLLMDisaggregated mode does not perform real KV reuse #1069

Summary

Environment

Minimal configuration used

Observed behavior

Additional attempted configuration

Code analysis

Expected behavior

Questions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

LMCacheConnectorV1 in vLLMDisaggregated mode does not perform real KV reuse #1069

Description

Summary

Environment

Minimal configuration used

Observed behavior

Additional attempted configuration

Code analysis

Expected behavior

Questions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions