Skip to content

blog: add predicted-latency based scheduling for LLMs#208

Open
kaushikmitr wants to merge 2 commits intollm-d:mainfrom
kaushikmitr:predicted-latency-blog
Open

blog: add predicted-latency based scheduling for LLMs#208
kaushikmitr wants to merge 2 commits intollm-d:mainfrom
kaushikmitr:predicted-latency-blog

Conversation

@kaushikmitr
Copy link

Summary

This blog post introduces predicted-latency based scheduling for LLM inference in llm-d / Gateway API Inference Extension. Instead of manually tuning heuristic weights for load balancing signals (queue depth, KV cache, prefix cache), a lightweight XGBoost model is trained online from live traffic to directly predict TTFT and TPOT per candidate server.

Key Results

  • 43% improvement in P50 end-to-end latency and 70% improvement in TTFT on a production-realistic workload (Qwen3-480B, 13x8 H200s)
  • Predicted-latency routing matches or outperforms load+prefix-aware routing across all five benchmark scenarios
  • Eliminates the need for manual weight tuning that shifts as workload varies

Blog Contents

  • Problem statement: why fixed-weight load balancing fails under production LLM traffic (bursty sizes, uneven load, unstable cache)
  • System design: online XGBoost training, sidecar architecture, feature set (KV cache %, input length, queue depth, running requests, prefix cache match %, input tokens in flight)
  • Benchmark results across 5 synthetic scenarios (A-D + ShareGPT) varying cache pressure and system prompt overlap
  • Production-realistic workload comparison derived from 7 days of internal Google traffic
  • Appendix: prefix cache capacity analysis with LRU simulation

Files

  • blog/2026-03-13_predicted-latency-based-scheduling-for-llms.md — blog post
  • blog/authors.yml — 3 new authors added
  • blog/tags.yml — 2 new tags (scheduling, inference)
  • static/img/blogs/predicted-latency/image{1-16}.webp — 16 lossless WebP figures

@netlify
Copy link

netlify bot commented Mar 14, 2026

Deploy Preview for elaborate-kangaroo-25e1ee ready!

Name Link
🔨 Latest commit fe37306
🔍 Latest deploy log https://app.netlify.com/projects/elaborate-kangaroo-25e1ee/deploys/69c039eee3c5950008229f92
😎 Deploy Preview https://deploy-preview-208--elaborate-kangaroo-25e1ee.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@github-actions
Copy link

Unsigned commits detected! Please sign your commits.

For instructions on how to set up GPG/SSH signing and verify your commits, please see GitHub Documentation.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new blog post describing predicted-latency based scheduling for LLM inference in llm-d / Gateway API Inference Extension, along with supporting metadata (authors/tags) and accompanying figures.

Changes:

  • Adds the blog post 2026-03-13_predicted-latency-based-scheduling-for-llms.md describing the design, benchmarks, and results.
  • Updates blog/authors.yml (adds 3 authors) and blog/tags.yml (adds scheduling and inference tags).
  • Adds a set of WebP images under static/img/blogs/predicted-latency/ used by the post.

Reviewed changes

Copilot reviewed 3 out of 19 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
blog/2026-03-13_predicted-latency-based-scheduling-for-llms.md New long-form post (MDX-in-Markdown) covering motivation, system design, benchmarks, and appendix with cache analysis.
blog/authors.yml Adds new author entries and adjusts an existing author line.
blog/tags.yml Fixes indentation for storage.description and adds scheduling + inference tags used by the post.
static/img/blogs/predicted-latency/image6.webp Figure asset referenced by the post (predicted vs actual TTFT).
static/img/blogs/predicted-latency/image13.webp Figure asset referenced by the post (Workload A cache behavior).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

@ahg-g
Copy link

ahg-g commented Mar 19, 2026

/lgtm

I reviewed this on the google doc

@github-actions
Copy link

Cannot apply the lgtm label because Error: ahg-g is not included in the reviewers role in the OWNERS file

@kaushikmitr
Copy link
Author

@petecheslock
Copy link
Member

Thanks @kaushikmitr can you just check your commits and sign them .

@kaushikmitr kaushikmitr force-pushed the predicted-latency-blog branch from 361378b to 19e718d Compare March 20, 2026 20:20
@kaushikmitr
Copy link
Author

Thanks @kaushikmitr can you just check your commits and sign them .

thanks @petecheslock i squashed my commits into 1 and signed them

@kaushikmitr kaushikmitr force-pushed the predicted-latency-blog branch from 19e718d to 5b3df1b Compare March 20, 2026 20:32
Signed-off-by: kaushikmitr <kaushikmitra.umd@gmail.com>
@kaushikmitr kaushikmitr force-pushed the predicted-latency-blog branch from 5b3df1b to 038bd92 Compare March 20, 2026 21:25
Copy link
Member

@Gregory-Pereira Gregory-Pereira left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

Most of these are not required just what I thought would make it flow better. This looks really good and is well written - close to merge

Signed-off-by: kaushikmitr <kaushikmitra.umd@gmail.com>
Copy link
Member

@Gregory-Pereira Gregory-Pereira left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@Gregory-Pereira
Copy link
Member

Good with merging this whenever but going to give a bit more time in case @robertgshaw2-redhat or @smarterclayton want to comment

@petecheslock
Copy link
Member

Also just want to confirm that @chcost is good with this post as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants