blog: add predicted-latency based scheduling for LLMs#208
blog: add predicted-latency based scheduling for LLMs#208kaushikmitr wants to merge 2 commits intollm-d:mainfrom
Conversation
✅ Deploy Preview for elaborate-kangaroo-25e1ee ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
|
Unsigned commits detected! Please sign your commits. For instructions on how to set up GPG/SSH signing and verify your commits, please see GitHub Documentation. |
There was a problem hiding this comment.
Pull request overview
Adds a new blog post describing predicted-latency based scheduling for LLM inference in llm-d / Gateway API Inference Extension, along with supporting metadata (authors/tags) and accompanying figures.
Changes:
- Adds the blog post
2026-03-13_predicted-latency-based-scheduling-for-llms.mddescribing the design, benchmarks, and results. - Updates
blog/authors.yml(adds 3 authors) andblog/tags.yml(addsschedulingandinferencetags). - Adds a set of WebP images under
static/img/blogs/predicted-latency/used by the post.
Reviewed changes
Copilot reviewed 3 out of 19 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| blog/2026-03-13_predicted-latency-based-scheduling-for-llms.md | New long-form post (MDX-in-Markdown) covering motivation, system design, benchmarks, and appendix with cache analysis. |
| blog/authors.yml | Adds new author entries and adjusts an existing author line. |
| blog/tags.yml | Fixes indentation for storage.description and adds scheduling + inference tags used by the post. |
| static/img/blogs/predicted-latency/image6.webp | Figure asset referenced by the post (predicted vs actual TTFT). |
| static/img/blogs/predicted-latency/image13.webp | Figure asset referenced by the post (Workload A cache behavior). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
|
/lgtm I reviewed this on the google doc |
|
Cannot apply the lgtm label because Error: ahg-g is not included in the reviewers role in the OWNERS file |
|
Thanks @kaushikmitr can you just check your commits and sign them . |
361378b to
19e718d
Compare
thanks @petecheslock i squashed my commits into 1 and signed them |
19e718d to
5b3df1b
Compare
Signed-off-by: kaushikmitr <kaushikmitra.umd@gmail.com>
5b3df1b to
038bd92
Compare
Gregory-Pereira
left a comment
There was a problem hiding this comment.
/lgtm
Most of these are not required just what I thought would make it flow better. This looks really good and is well written - close to merge
Signed-off-by: kaushikmitr <kaushikmitra.umd@gmail.com>
|
Good with merging this whenever but going to give a bit more time in case @robertgshaw2-redhat or @smarterclayton want to comment |
|
Also just want to confirm that @chcost is good with this post as well. |
Summary
This blog post introduces predicted-latency based scheduling for LLM inference in llm-d / Gateway API Inference Extension. Instead of manually tuning heuristic weights for load balancing signals (queue depth, KV cache, prefix cache), a lightweight XGBoost model is trained online from live traffic to directly predict TTFT and TPOT per candidate server.
Key Results
Blog Contents
Files
blog/2026-03-13_predicted-latency-based-scheduling-for-llms.md— blog postblog/authors.yml— 3 new authors addedblog/tags.yml— 2 new tags (scheduling, inference)static/img/blogs/predicted-latency/image{1-16}.webp— 16 lossless WebP figures