blog: add predicted-latency based scheduling for LLMs by kaushikmitr · Pull Request #208 · llm-d/llm-d.github.io

kaushikmitr · 2026-03-14T06:26:47Z

Summary

This blog post introduces predicted-latency based scheduling for LLM inference in llm-d / Gateway API Inference Extension. Instead of manually tuning heuristic weights for load balancing signals (queue depth, KV cache, prefix cache), a lightweight XGBoost model is trained online from live traffic to directly predict TTFT and TPOT per candidate server.

Key Results

43% improvement in P50 end-to-end latency and 70% improvement in TTFT on a production-realistic workload (Qwen3-480B, 13x8 H200s)
Predicted-latency routing matches or outperforms load+prefix-aware routing across all five benchmark scenarios
Eliminates the need for manual weight tuning that shifts as workload varies

Blog Contents

Problem statement: why fixed-weight load balancing fails under production LLM traffic (bursty sizes, uneven load, unstable cache)
System design: online XGBoost training, sidecar architecture, feature set (KV cache %, input length, queue depth, running requests, prefix cache match %, input tokens in flight)
Benchmark results across 5 synthetic scenarios (A-D + ShareGPT) varying cache pressure and system prompt overlap
Production-realistic workload comparison derived from 7 days of internal Google traffic
Appendix: prefix cache capacity analysis with LRU simulation

Files

blog/2026-03-13_predicted-latency-based-scheduling-for-llms.md — blog post
blog/authors.yml — 3 new authors added
blog/tags.yml — 2 new tags (scheduling, inference)
static/img/blogs/predicted-latency/image{1-16}.webp — 16 lossless WebP figures

netlify · 2026-03-14T06:26:53Z

✅ Deploy Preview for elaborate-kangaroo-25e1ee ready!

Name	Link
🔨 Latest commit	`fe37306`
🔍 Latest deploy log	https://app.netlify.com/projects/elaborate-kangaroo-25e1ee/deploys/69c039eee3c5950008229f92
😎 Deploy Preview	https://deploy-preview-208--elaborate-kangaroo-25e1ee.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

github-actions · 2026-03-14T06:26:58Z

Unsigned commits detected! Please sign your commits.

For instructions on how to set up GPG/SSH signing and verify your commits, please see GitHub Documentation.

Copilot

Pull request overview

Adds a new blog post describing predicted-latency based scheduling for LLM inference in llm-d / Gateway API Inference Extension, along with supporting metadata (authors/tags) and accompanying figures.

Changes:

Adds the blog post 2026-03-13_predicted-latency-based-scheduling-for-llms.md describing the design, benchmarks, and results.
Updates blog/authors.yml (adds 3 authors) and blog/tags.yml (adds scheduling and inference tags).
Adds a set of WebP images under static/img/blogs/predicted-latency/ used by the post.

Reviewed changes

Copilot reviewed 3 out of 19 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
blog/2026-03-13_predicted-latency-based-scheduling-for-llms.md	New long-form post (MDX-in-Markdown) covering motivation, system design, benchmarks, and appendix with cache analysis.
blog/authors.yml	Adds new author entries and adjusts an existing author line.
blog/tags.yml	Fixes indentation for `storage.description` and adds `scheduling` + `inference` tags used by the post.
static/img/blogs/predicted-latency/image6.webp	Figure asset referenced by the post (predicted vs actual TTFT).
static/img/blogs/predicted-latency/image13.webp	Figure asset referenced by the post (Workload A cache behavior).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

blog/authors.yml

blog/2026-03-13_predicted-latency-based-scheduling-for-llms.md

ahg-g · 2026-03-19T20:19:13Z

/lgtm

I reviewed this on the google doc

github-actions · 2026-03-19T20:19:22Z

Cannot apply the lgtm label because Error: ahg-g is not included in the reviewers role in the OWNERS file

blog/authors.yml

blog/2026-03-13_predicted-latency-based-scheduling-for-llms.md

kaushikmitr · 2026-03-20T19:53:54Z

@smarterclayton @Gregory-Pereira @robertgshaw2-redhat PTAL.

petecheslock · 2026-03-20T20:14:27Z

Thanks @kaushikmitr can you just check your commits and sign them .

kaushikmitr · 2026-03-20T20:23:23Z

Thanks @kaushikmitr can you just check your commits and sign them .

thanks @petecheslock i squashed my commits into 1 and signed them

blog/authors.yml

Signed-off-by: kaushikmitr <kaushikmitra.umd@gmail.com>

Gregory-Pereira

/lgtm

Most of these are not required just what I thought would make it flow better. This looks really good and is well written - close to merge

blog/2026-03-13_predicted-latency-based-scheduling-for-llms.md

Signed-off-by: kaushikmitr <kaushikmitra.umd@gmail.com>

Gregory-Pereira

/lgtm

Gregory-Pereira · 2026-03-22T19:21:50Z

Good with merging this whenever but going to give a bit more time in case @robertgshaw2-redhat or @smarterclayton want to comment

petecheslock · 2026-03-23T18:28:51Z

Also just want to confirm that @chcost is good with this post as well.

kaushikmitr requested review from chcost, robertgshaw2-redhat and smarterclayton as code owners March 14, 2026 06:26

Copilot AI review requested due to automatic review settings March 14, 2026 06:26

kaushikmitr requested review from Gregory-Pereira, clubanderson, jjasghar and petecheslock as code owners March 14, 2026 06:26

Copilot started reviewing on behalf of kaushikmitr March 14, 2026 06:27 View session

Copilot AI reviewed Mar 14, 2026

View reviewed changes

ahg-g reviewed Mar 19, 2026

View reviewed changes

blog/authors.yml Outdated Show resolved Hide resolved

ahg-g reviewed Mar 19, 2026

View reviewed changes

blog/2026-03-13_predicted-latency-based-scheduling-for-llms.md Outdated Show resolved Hide resolved

ahg-g reviewed Mar 19, 2026

View reviewed changes

blog/2026-03-13_predicted-latency-based-scheduling-for-llms.md Outdated Show resolved Hide resolved

kaushikmitr force-pushed the predicted-latency-blog branch from 361378b to 19e718d Compare March 20, 2026 20:20

kaushikmitr force-pushed the predicted-latency-blog branch from 19e718d to 5b3df1b Compare March 20, 2026 20:32

ahg-g reviewed Mar 20, 2026

View reviewed changes

blog/authors.yml Outdated Show resolved Hide resolved

blog: add predicted-latency based scheduling for LLMs

038bd92

Signed-off-by: kaushikmitr <kaushikmitra.umd@gmail.com>

kaushikmitr force-pushed the predicted-latency-blog branch from 5b3df1b to 038bd92 Compare March 20, 2026 21:25

Gregory-Pereira reviewed Mar 22, 2026

View reviewed changes

blog: address review comments

fe37306

Signed-off-by: kaushikmitr <kaushikmitra.umd@gmail.com>

Gregory-Pereira approved these changes Mar 22, 2026

View reviewed changes

Conversation

kaushikmitr commented Mar 14, 2026

Summary

Key Results

Blog Contents

Files

Uh oh!

netlify bot commented Mar 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for elaborate-kangaroo-25e1ee ready!

Uh oh!

github-actions bot commented Mar 14, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ahg-g commented Mar 19, 2026

Uh oh!

github-actions bot commented Mar 19, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kaushikmitr commented Mar 20, 2026

Uh oh!

petecheslock commented Mar 20, 2026

Uh oh!

kaushikmitr commented Mar 20, 2026

Uh oh!

Uh oh!

Gregory-Pereira left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Gregory-Pereira left a comment

Choose a reason for hiding this comment

Uh oh!

Gregory-Pereira commented Mar 22, 2026

Uh oh!

petecheslock commented Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

netlify bot commented Mar 14, 2026 •

edited

Loading