-
Notifications
You must be signed in to change notification settings - Fork 895
Async Executor/Runner slows to a halt with jobs that auto-retry with default (high) max_wait
#642
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Problem Investigation When too many jobs (i.e. >hundreds) are calling the same API, a throughput limit is hit and many of the jobs need to wait & retry. With the default Proposal PR created: #643 please review. |
In case this doesn't get attention, excuse my mention @jjmachan |
same problem, when the size of document chunks > 500 the program will probably stuck. appreciate your PR. |
max_wait
max_wait
hey @joy13975 really sorry I have not been able to get to this - but for the last commend you are right. max_workers are an old arg we dont' use as for your PR - really really appreciate it ❤️ , will get it reviewed both and respond tonight thanks again for your help bringing this up 🙌🏽 |
**Added optional Semaphore-based concurrency control for #642** As for the default value for `max_concurrency`, I don't know the ratio of API users vs. local LLM users, so the proposed default is an opinionated value of `16` * I *think* more people use OpenAI API for now vs. local LLMs, thus default is not `-1` (no limit) * `16` seems to be reasonably fast and doesn't seem to hit throughput limit in my experience **Tests** Embedding for 1k documents finished in <2min and subsequent Testset generation for `test_size=1000` proceeding without getting stuck: <img width="693" alt="image" src="https://github.com/explodinggradients/ragas/assets/6729737/d83fecc8-a815-43ee-a3b0-3395d7a9d244"> another 30s passes: <img width="725" alt="image" src="https://github.com/explodinggradients/ragas/assets/6729737/d4ab08ba-5a79-45f6-84b1-e563f107d682"> --------- Co-authored-by: Jithin James <[email protected]>
thanks to @joy13975 we have fixed this guys 🙂 |
**Added optional Semaphore-based concurrency control for explodinggradients#642** As for the default value for `max_concurrency`, I don't know the ratio of API users vs. local LLM users, so the proposed default is an opinionated value of `16` * I *think* more people use OpenAI API for now vs. local LLMs, thus default is not `-1` (no limit) * `16` seems to be reasonably fast and doesn't seem to hit throughput limit in my experience **Tests** Embedding for 1k documents finished in <2min and subsequent Testset generation for `test_size=1000` proceeding without getting stuck: <img width="693" alt="image" src="https://github.com/explodinggradients/ragas/assets/6729737/d83fecc8-a815-43ee-a3b0-3395d7a9d244"> another 30s passes: <img width="725" alt="image" src="https://github.com/explodinggradients/ragas/assets/6729737/d4ab08ba-5a79-45f6-84b1-e563f107d682"> --------- Co-authored-by: Jithin James <[email protected]>
Describe the bug
Anything that a) has a throttle limit and b) uses
Executor
(and in turn,Runner
), such as OpenAI's API, will slow to a halt if too many jobs are requested.Specifically, in my case I was attempting to generate testset for ~1k documents.
This refuses to even start if either there are too many documents (roughly >300) or too high

test_size
. Example with 500 docs stuck at 0% after 20min:Reducing
test_size
didn't help with high document count because the place that got stuck was at docstore.add_documents. However, reducing both document count andtest_size
to < 100 did get it going at reasonable speed.Ragas version:
0.1.2.dev8+gc18c7f4
Python version:
3.9.13
Code to Reproduce
Below is just one example that uses
Executor
&Runner
with higher job count (i.e. 1,000). Document contens averaged ~800 characters. OpenAI API is being used as LLM.Error trace
No error, just gets stuck.
Expected behavior
Don't get stuck with high job count.
Additional context
Will add my findings in comments below. Related: #394
The text was updated successfully, but these errors were encountered: