-
Notifications
You must be signed in to change notification settings - Fork 137
session boost queue to optimize multi conversation scenario #1183
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
4656457
4ad4828
5bfe37a
2e3e4fc
3cae68b
1080c9f
a67cb96
5b36e3c
9e8a1d0
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,244 @@ | ||
| # Session Boost Queue | ||
|
|
||
| Kthena Router session boost queue optimizes multi-turn conversation performance by prioritizing follow-up requests from the same conversation session. This maximizes **prefix cache hit rate** on LLM inference backends (e.g., vLLM), significantly reducing Time-to-First-Token (TTFT) for multi-turn conversations. | ||
|
|
||
| This guide explains what session boost does, when to use it, how to enable it, and how to verify it is working. | ||
|
|
||
| ## Overview | ||
|
|
||
| Modern LLM inference engines like vLLM maintain a **prefix cache** (KV cache) that stores previously computed key-value attention states. In multi-turn conversations, each follow-up message shares a large prefix with the previous turn. If the follow-up request is processed shortly after the prior turn completes, the prefix cache is still warm, and the engine can skip recomputing attention for the shared prefix—reducing TTFT by 50–80%. | ||
|
|
||
| Without session boost, a follow-up request may be queued behind unrelated requests. By the time it reaches the backend, the prefix cache may have been evicted, forcing a full recomputation. | ||
|
|
||
| When session boost is enabled, the router does the following: | ||
|
|
||
| 1. Extracts the session identifier from the HTTP header configured via `SESSION_BOOST_HEADER` environment variable. | ||
| 2. Checks whether the same session completed a request recently (within TTL). | ||
| 3. If yes, marks the new request as **boosted** and promotes it ahead of non-boosted requests in the queue. | ||
| 4. After a request completes, optionally enters a brief **grace period** (disabled by default) to give a potential follow-up from the same session time to arrive. | ||
|
|
||
| ## When to Use Session Boost | ||
|
|
||
| Session boost is designed for these scenarios: | ||
|
|
||
| - **Multi-turn chat applications**: ChatGPT-like interfaces where users have back-and-forth conversations with an LLM. Each turn builds on the previous conversation context. | ||
| - **Agentic workflows and RAG chains**: Automated pipelines that issue multiple sequential requests in the same session, where each request depends on the previous response. | ||
| - **Low-latency prefix cache optimization**: Workloads where minimizing TTFT is critical and the same session's requests benefit from being processed back-to-back on warm KV cache. | ||
| - **Environments without fairness requirements**: When you want prefix cache optimization without the complexity of multi-tenant fairness scheduling. | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. not sure what does this mean, |
||
|
|
||
| Session boost is **not** needed when: | ||
|
|
||
| - Your workload is single-turn (independent requests with no shared prefix). | ||
| - Requests are already routed with session sticky and no queuing contention exists. | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. session sticky seems not conflict with session boost |
||
| - You need multi-tenant fairness as the primary scheduling concern (use [fairness scheduling](./fairness-scheduling) instead). | ||
|
|
||
| ## Prerequisites | ||
|
|
||
| - A Kubernetes cluster with Kthena installed. | ||
| - A deployed `ModelRoute` and backend `ModelServer` (e.g., vLLM) that supports prefix caching. | ||
| - Clients that include a consistent session identifier header across related requests in a conversation. | ||
|
|
||
| ## Enable Session Boost | ||
|
|
||
| ### Helm Values | ||
|
|
||
| The simplest way to enable session boost is through Helm values: | ||
|
|
||
| ```yaml | ||
| networking: | ||
| kthenaRouter: | ||
| sessionBoost: | ||
| enabled: true | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. remove |
||
| header: "X-Session-ID" | ||
| ttl: "60s" | ||
| ``` | ||
|
|
||
| Apply with Helm: | ||
|
|
||
| ```bash | ||
| helm upgrade --install kthena charts/kthena \ | ||
| --namespace kthena-system \ | ||
| --create-namespace \ | ||
| -f your-values.yaml | ||
| ``` | ||
|
|
||
| ### Environment Variables | ||
|
|
||
| You can also configure session boost directly via environment variables on the `kthena-router` Deployment: | ||
|
|
||
| ```yaml | ||
| env: | ||
| - name: ENABLE_SESSION_BOOST | ||
| value: "true" | ||
| - name: SESSION_BOOST_HEADER | ||
| value: "X-Session-ID" | ||
| - name: SESSION_BOOST_TTL | ||
| value: "60s" | ||
| # - name: SESSION_BOOST_GRACE_PERIOD | ||
| # value: "50ms" # Disabled by default; enable only if you understand the trade-off | ||
| - name: SESSION_BOOST_POLL_INTERVAL | ||
| value: "100ms" | ||
| # - name: SESSION_BOOST_INFLIGHT_PER_POD | ||
| # value: "16" # Tune based on your backend's concurrency capacity | ||
| ``` | ||
|
|
||
| ## Configuration Reference | ||
|
|
||
| | Environment Variable | Purpose | Default | Notes | | ||
| | -------------------------------- | -------------------------------------------------------------------------- | ------------ | ---------------------------------------------------------------------------------------------------------------------------- | | ||
| | `ENABLE_SESSION_BOOST` | Enable the session boost queue | `false` | Global feature switch | | ||
| | `SESSION_BOOST_HEADER` | HTTP header used to identify conversation sessions | *(required)* | Must match what your clients send | | ||
| | `SESSION_BOOST_TTL` | Duration after which a session's boost expires | `60s` | Longer values help slow human conversations; shorter values suit fast automated pipelines | | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. why does it not take effect after this? In reality, how could a user set this correctly? And btw, how do you set it when you do benchmarking |
||
| | `SESSION_BOOST_GRACE_PERIOD` | Wait time after a request completes for a same-session follow-up to arrive | `0` | Disabled by default. Only enable (e.g., `50ms`) if you understand the latency trade-off for non-boosted requests | | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not sure i understand |
||
| | `SESSION_BOOST_POLL_INTERVAL` | Backend capacity polling interval | `100ms` | Lower values reduce latency but increase polling load | | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. polling what? metrics? |
||
| | `SESSION_BOOST_INFLIGHT_PER_POD` | Maximum inflight requests per backend pod | `16` | Tune based on your backend's actual concurrency capacity (e.g., vLLM's `--max-num-seqs`). Lower values are more conservative | | ||
|
|
||
| ## Client Integration | ||
|
|
||
| Clients must include the configured session header in their requests. All requests belonging to the same conversation should use the same header value: | ||
|
|
||
| ```bash | ||
| # Turn 1 | ||
| curl -X POST http://kthena-router/v1/chat/completions \ | ||
| -H "Content-Type: application/json" \ | ||
| -H "X-Session-ID: conv-abc-123" \ | ||
| -d '{"model": "llama-3", "messages": [{"role": "user", "content": "Hello"}]}' | ||
|
|
||
| # Turn 2 (same session) | ||
| curl -X POST http://kthena-router/v1/chat/completions \ | ||
| -H "Content-Type: application/json" \ | ||
| -H "X-Session-ID: conv-abc-123" \ | ||
| -d '{"model": "llama-3", "messages": [{"role": "user", "content": "Hello"}, {"role": "assistant", "content": "Hi!"}, {"role": "user", "content": "Tell me about Kubernetes"}]}' | ||
| ``` | ||
|
|
||
| ### Custom Header | ||
|
|
||
| If your clients already use a different header for session tracking, configure the router to match: | ||
|
|
||
| ```yaml | ||
| env: | ||
| - name: SESSION_BOOST_HEADER | ||
| value: "X-Session-ID" | ||
| ``` | ||
|
|
||
| Then clients send: | ||
|
|
||
| ```bash | ||
| curl -X POST http://kthena-router/v1/chat/completions \ | ||
| -H "X-Session-ID: my-conversation-42" \ | ||
| -d '{"model": "llama-3", "messages": [...]}' | ||
| ``` | ||
|
|
||
| ## How It Works | ||
|
|
||
| ### Priority Ordering | ||
|
|
||
| The session boost queue uses a simple two-level priority: | ||
|
|
||
| 1. **Boosted requests** (session completed recently) are always dequeued before non-boosted requests. | ||
| 2. **Within the same boost level**, requests are served in FIFO order (earliest arrival first). | ||
|
|
||
| ### Grace Period | ||
|
|
||
| The grace period is **disabled by default** (`SESSION_BOOST_GRACE_PERIOD=0`). When disabled, the queue immediately dequeues the next request after a completion without waiting. | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As we design, should carefully think about how to make it understandable to both contributors and users. From my view, i can guess of this param at some point. But i am still more confused than clear. |
||
|
|
||
| When explicitly enabled (e.g., `SESSION_BOOST_GRACE_PERIOD=50ms`), the queue briefly holds the dequeue slot for a potential follow-up from the same session: | ||
|
|
||
| - If a boosted request arrives during the grace period, it is dequeued immediately. | ||
| - If no boosted request arrives before the grace period expires, the next non-boosted request proceeds normally. | ||
| - If the head of the queue is already a boosted request, the grace period is skipped entirely. | ||
|
|
||
| Only enable the grace period if you understand the trade-off: it adds latency to non-boosted requests in exchange for a higher chance of a same-session follow-up arriving in time. This is mainly useful for fast automated pipelines (RAG, agents) where follow-up requests arrive within milliseconds of the previous response. | ||
|
|
||
| ### Backpressure Control | ||
|
|
||
| The queue uses two-level admission control to avoid flooding backends: | ||
|
|
||
| 1. **Inflight limit**: at most `SESSION_BOOST_INFLIGHT_PER_POD × pod_count` requests can be in-flight simultaneously. | ||
| 2. **Backend metrics check**: the queue polls backend pod metrics to confirm at least one pod has available capacity before dispatching. | ||
|
|
||
| When a request completes, the queue immediately attempts to dequeue the next request (release-driven dequeue) rather than waiting for the next polling tick. | ||
|
|
||
| ## Session Boost vs Fairness Scheduling | ||
|
|
||
| Session boost and fairness scheduling are independent features that serve different goals: | ||
|
|
||
| | Aspect | Session Boost | Fairness Scheduling | | ||
| | ---------------- | ---------------------------------- | --------------------------------- | | ||
| | Goal | Maximize prefix cache hits | Equitable resource allocation | | ||
| | Activation | `ENABLE_SESSION_BOOST=true` | `ENABLE_FAIRNESS_SCHEDULING=true` | | ||
| | Requires user ID | No | Yes | | ||
| | Priority logic | Boosted > non-boosted, FIFO within | Lower recent usage wins | | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If there are two sessions, both are multi-turn conversation. For example session A request boosted at the front of the queue but not dequeued, then SessionB's request enqueued, will you boost SessionB request in front of A's. |
||
| | Best for | Multi-turn latency optimization | Multi-tenant contention | | ||
|
|
||
| The two features are mutually exclusive in the current implementation—enable one or the other based on your primary scheduling concern. | ||
|
|
||
| ## Choosing Good Settings | ||
|
|
||
| Start with the defaults unless you have a specific performance issue. | ||
|
|
||
| Recommended tuning: | ||
|
|
||
| - **Human chat applications** (slow follow-ups): increase `SESSION_BOOST_TTL` to `120s` or higher so the boost persists between human typing intervals. | ||
| - **Automated pipelines** (fast follow-ups): keep `SESSION_BOOST_TTL` at `60s`. Consider enabling `SESSION_BOOST_GRACE_PERIOD` (e.g., `50ms`) if your pipeline issues follow-up requests within milliseconds and you want to maximize prefix cache hits. | ||
| - **High-throughput backends**: the default `SESSION_BOOST_INFLIGHT_PER_POD=16` allows reasonable concurrency. Tune this to match your backend's actual concurrency capacity (e.g., vLLM's `--max-num-seqs`). Reduce for conservative admission control; increase for backends that handle high parallelism. | ||
| - **Latency-sensitive workloads**: reduce `SESSION_BOOST_POLL_INTERVAL` to `50ms` for faster reaction to backend capacity changes. | ||
|
|
||
| ## Verify Session Boost | ||
|
|
||
| ### 1. Check Router Environment | ||
|
|
||
| ```bash | ||
| kubectl -n kthena-system get deployment kthena-router -o yaml | grep SESSION_BOOST | ||
| ``` | ||
|
|
||
| Confirm the router is running with expected session boost variables. | ||
|
|
||
| ### 2. Inspect Logs | ||
|
|
||
| When a session boost is triggered, the router logs the event. Look for session boost enqueue and dequeue messages: | ||
|
|
||
| ```bash | ||
| kubectl -n kthena-system logs deploy/kthena-router | grep -i "session boost" | ||
| ``` | ||
|
|
||
| ### 3. Compare TTFT With and Without Boost | ||
|
|
||
| Send a multi-turn conversation and measure TTFT for the second turn: | ||
|
|
||
| - **Without session boost**: TTFT for Turn 2 ≈ TTFT for Turn 1 (full prefix computation). | ||
| - **With session boost**: TTFT for Turn 2 should be significantly lower (prefix cache hit). | ||
|
|
||
| ## Operational Notes | ||
|
|
||
| - Session state is **in memory** on each router instance. In multi-replica deployments, state is not shared across replicas. Combine with session sticky routing to ensure the same session hits the same router instance. | ||
| - Session boost does not guarantee pod affinity. For maximum prefix cache benefit, combine with session sticky routing so boosted requests reach the pod that holds the warm KV cache. | ||
| - Requests without the configured session header are enqueued as normal (non-boosted) requests. | ||
| - Session tracking does not survive router restarts. | ||
|
|
||
| ## Troubleshooting | ||
|
|
||
| ### Requests are not being boosted | ||
|
|
||
| Verify that: | ||
| 1. The client sends the configured header (set via `SESSION_BOOST_HEADER`) with a consistent value across turns. | ||
| 2. The follow-up request arrives within `SESSION_BOOST_TTL` of the previous request's completion. | ||
| 3. `ENABLE_SESSION_BOOST` is set to `true` in the router environment. | ||
|
|
||
| ### TTFT is not improving despite boost | ||
|
|
||
| Session boost only controls queue ordering. If the boosted request is routed to a different pod than the one holding the warm prefix cache, no TTFT improvement occurs. Combine session boost with session sticky routing for full benefit. | ||
|
|
||
| ### Grace period causes slight latency for non-boosted requests | ||
|
|
||
| The grace period is disabled by default. If you have explicitly enabled it and it adds unwanted delay for single-turn traffic, set `SESSION_BOOST_GRACE_PERIOD=0` to disable it. | ||
|
|
||
| ### High memory usage from session tracking | ||
|
|
||
| Each tracked session consumes minimal memory (session ID + timestamp). Sessions are automatically evicted after TTL. If you have millions of concurrent sessions, consider reducing `SESSION_BOOST_TTL`. | ||
|
|
||
| ## Related Guides | ||
|
|
||
| - [Fairness Scheduling](./fairness-scheduling) | ||
| - [Router Routing](./router-routing) | ||
| - [Router Observability](./router-observability) | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need a worflow diagram here