Skip to content

Conversation

@michaelfeil
Copy link
Contributor

@michaelfeil michaelfeil commented Nov 26, 2025

What does this PR do?

There is a performance bug introduced first by the initial release of TEI. The embeddings are calculated in a std::thread, with the intend to not block the main backend. While this is a fine idea, it would be much better to do it in a threadPool. tokio_spawn_blocking is such a thread pool with lazy warmup, therefore its a good idea to just use the runtimes pool for that.

Running a small model yields around ~20% more performance, some cases also 50%. Also leads to 15% thoughput improvments for small models.

text-embeddings-router --model-id TaylorAI/bge-micro --max-batch-tokens 280960 --port 7998 --max-client-batch-size 512

mainline

1 token requests, 512 clients:
Requests per second:    709.84 [#/sec] (mean)
512 token requests, 32 clients
Requests per second:    93.99 [#/sec] (mean)
1 token request, 1 client:
Time per request:       1.865 [ms] (mean)

This branch

1 token requests, 512 clients:
Requests per second:    888.68 [#/sec] (mean)
512 token requests, 32 clients
Requests per second:    118.40 [#/sec] (mean)
1 token request, 1 client:
Time per request:       1.267 [ms] (mean, across all concurrent requests)

Fixes # (issue)

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline?
  • Was this discussed/approved via a GitHub issue or the forum? Please add a link to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the documentation guidelines.
  • Did you write any new necessary tests? If applicable, did you include or update the insta snapshots?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@michaelfeil michaelfeil changed the title draft: spawn blocking Serialization in tokio thread instead of blocking thread Nov 26, 2025
@michaelfeil michaelfeil changed the title Serialization in tokio thread instead of blocking thread Serialization in tokio thread instead of blocking thread, 50% reduction in latency for small models Nov 26, 2025
@michaelfeil
Copy link
Contributor Author

mosty came up with the idea to look into this when finding: #766 . However, in #766, its actually a good idea to just use std::thread, since the process is running forever. The treads in the PR are short-lived, so better use the tokio pool.

@michaelfeil
Copy link
Contributor Author

openai codex review: michaelfeil#1 (comment)

Copy link
Contributor

@kozistr kozistr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! Great findings!

@alvarobartt alvarobartt self-requested a review December 1, 2025 05:29
@alvarobartt alvarobartt self-assigned this Dec 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants