Serialization in tokio thread instead of blocking thread, 50% reduction in latency for small models #767

michaelfeil · 2025-11-26T07:34:12Z

What does this PR do?

There is a performance bug introduced first by the initial release of TEI. The embeddings are calculated in a std::thread, with the intend to not block the main backend. While this is a fine idea, it would be much better to do it in a threadPool. tokio_spawn_blocking is such a thread pool with lazy warmup, therefore its a good idea to just use the runtimes pool for that.

Running a small model yields around ~20% more performance, some cases also 50%. Also leads to 15% thoughput improvments for small models.

text-embeddings-router --model-id TaylorAI/bge-micro --max-batch-tokens 280960 --port 7998 --max-client-batch-size 512

mainline

1 token requests, 512 clients:
Requests per second:    709.84 [#/sec] (mean)
512 token requests, 32 clients
Requests per second:    93.99 [#/sec] (mean)
1 token request, 1 client:
Time per request:       1.865 [ms] (mean)

This branch

1 token requests, 512 clients:
Requests per second:    888.68 [#/sec] (mean)
512 token requests, 32 clients
Requests per second:    118.40 [#/sec] (mean)
1 token request, 1 client:
Time per request:       1.267 [ms] (mean, across all concurrent requests)

Fixes # (issue)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline?
Was this discussed/approved via a GitHub issue or the forum? Please add a link to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the documentation guidelines.
Did you write any new necessary tests? If applicable, did you include or update the insta snapshots?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

michaelfeil · 2025-11-26T08:05:23Z

mosty came up with the idea to look into this when finding: #766 . However, in #766, its actually a good idea to just use std::thread, since the process is running forever. The treads in the PR are short-lived, so better use the tokio pool.

michaelfeil · 2025-11-26T08:18:35Z

openai codex review: michaelfeil#1 (comment)

kozistr

Looks good to me! Great findings!

spawn blocking

03ce741

michaelfeil changed the title ~~draft: spawn blocking~~ Serialization in tokio thread instead of blocking thread Nov 26, 2025

michaelfeil changed the title ~~Serialization in tokio thread instead of blocking thread~~ Serialization in tokio thread instead of blocking thread, 50% reduction in latency for small models Nov 26, 2025

kozistr approved these changes Nov 27, 2025

View reviewed changes

alvarobartt self-requested a review December 1, 2025 05:29

alvarobartt self-assigned this Dec 1, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Serialization in tokio thread instead of blocking thread, 50% reduction in latency for small models #767

Serialization in tokio thread instead of blocking thread, 50% reduction in latency for small models #767

Uh oh!

michaelfeil commented Nov 26, 2025 •

edited

Loading

Uh oh!

michaelfeil commented Nov 26, 2025

Uh oh!

michaelfeil commented Nov 26, 2025

Uh oh!

kozistr left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Serialization in tokio thread instead of blocking thread, 50% reduction in latency for small models #767

Are you sure you want to change the base?

Serialization in tokio thread instead of blocking thread, 50% reduction in latency for small models #767

Uh oh!

Conversation

michaelfeil commented Nov 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

Who can review?

Uh oh!

michaelfeil commented Nov 26, 2025

Uh oh!

michaelfeil commented Nov 26, 2025

Uh oh!

kozistr left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

michaelfeil commented Nov 26, 2025 •

edited

Loading