-
Notifications
You must be signed in to change notification settings - Fork 250
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Threads lock scenario at BulkIngester // FnCondition with high concurrency setup #651
Comments
Hi, I have the same problem with the version Java API client 8.12.1 and the Elasticsearch version 8.12.1. One detail that I saw is that when printing the id of the bulk in the beforeBulk there were jumps when the id should be sequential. @codehustler, Did you find any other solution? |
Hello, I'd like to try and reproduce this, I have used the BulkIngester recently with an 80K rows document and nothing of sort happened. Could you provide the code for the BulkIngester configuration? Thank you. |
Hello, We have observed this when there is a high level of parallelism. With an instance c5.9x.large (36 cores) it happened to us very often, however we changed the instance to a c5.4xlarge (16 cores) and it has not happened to us again. The bulk Listener that we use just log the errors and retry depending on the error. |
Thank you @victorGS18, we'll investigate this. |
I believe that we are running into this very same issue. We have multiple threads all sending updates to a single BulkIngester. They seems to hang in the same spot at |
@victorGS18 I tried to reproduce this again, starting from your configuration and then tweaking it, but I'm yet again failing to reproduce it. Could you try running it without the Listener and see if it gets stuck without it as well? @nmaves the BulkIngester already uses a threadpool underneath, so I'd avoid accessing it from multiple threads. |
@l-trotta Thanks for the update. I would ask that you update the docs to mention that this class is NOT thread safe. |
Hello again, sorry I didn't check the code thoroughly enough, the BulkIngester should indeed be thread safe. I'll perform more tests, sorry for the confusion. |
Final update: I've managed to reproduce the thread lock and find the issue. The problem lies with the fact that the Listener code can be executed by any of the BulkIngester threads. If the code performs many retries, or gets stuck for some reason, all of the BulkIngester threads will get stuck there at some point. The issue will be solved soon by having a thread pool execute the Listener code, and not just any thread. Thank you for your patience, this will be out in the next release after the PR gets merged. |
Thank you so much! @l-trotta |
@victorGS18 sorry for the confusion, the fix will be available in the next minor release, so 8.15.0, because we introduced breaking changes to the bulk ingester. |
Perfect @l-trotta , I got confused when I saw the changes in the versions |
Hi @l-trotta, I can confirm that this issue still persists in version 8.15.0. The root cause remains as the original author described, with threads getting stuck in the I am able to reproduce this issue by having multiple threads call The reason multiple threads are getting stuck is that when Additionally, during the flush process, only one thread is woken up due to the To resolve this issue, the solution is to replace the |
Hello @aliariff, I have managed to reproduce the original issue (ingester getting stuck at 99%), by configuring the bulk ingester so that it has very low |
Hi @l-trotta, Yes, that's correct. In our configuration, the |
@l-trotta I've start using the the last version of the client(8.15.1) and the same behavior persists. Threads keep blocking when there are more threads running than configured in the maxConcurrentRequests. |
@victorGS18 at this point I think I'll need either a full reproducer project or detailed information on how you're running the BulkIngester, because I really cannot reproduce this and I'm suspecting configuration differences. The information needed (generalizing for anyone else who might have the same problem)
Sorry this is taking so long, and thanks for the patience. |
@aliariff did 8.15.1 solve your case? |
Hi @l-trotta Yes, the |
Hi @l-trotta , BulkIngester configuration
Scheduler configuration
threadpool configuration Listener configuration
I think this is all the information you asked for. Thanks! |
@victorGS18 which library is |
@l-trotta It is a internal utility class, Sorry, I thought I had added all the code for external classes.
|
@victorGS18 I set up a project using your code with high concurrency adding operations to a BulkIngester instance configured in exactly the same way, but I wasn't able to reproduce the thread lock, and unfortunately there's not much I can do if I cannot reproduce the issue locally. The way forward is, if you're able to set up a reproducer which can reliably reproduce the issue then we can keep working on it, otherwise it could be a machine/OS/JVM specific problem which we have no control over. A couple of things that could help, that I noticed while working with your code: Thank you for all the information provided so far, I'll keep this open in case there are other updates or other confirmed cases from other users. |
I am able to reproduce a thead lock using Elasticsearch client 8.15.1, 8.15.5 or 8.17.0. It seems the issue happens when having multiple bulks blocked at the same time because I can't reproduce with bulk concurrent request = 1 or with bigger bulks (1000 records per bulk). Note that the same test is OK with deprecated Bulk Processor. |
@nicolasm35 this looks like a different issue to me, and it's tied to the BulkIngester's lack of a retry functionality (which we are working on), something the Bulk Processor used to have. |
@l-trotta but how do you explain there is no issue with 1 bulk concurrent request? |
Hey @nicolasm35, sorry, I misread parts of your bug explanation. This is something we need to investigate, regardless of retry policy. Could you please open a new issue, including more details possible? (for example, if any listener was configured and which java version were you using) thank you. |
Java API client version
7.17.12
Java version
11
Elasticsearch Version
7.17.12
Problem description
Hi, I think I have found a bug with the BulkIngester, maybe an issue with the locks.
The problem is, that only certain dev machines and some servers show this issue. We run the 7.17.12 java client lib. I cannot 100% figure out what is going on, and it probably makes no sense to create a ticket for this without being able to reproduce it properly. I have attached a thread dump which shows several threads still waiting, I hope this helps.
More context:
We use the bulk ingester to index a file with ~12k documents (just one example file). it runs to 99%, then gets stuck, and because we have configured a 10sec flush interval on the BulkIngester, every 10 seconds we see a bulk context getting flushed with just a single document in it. This goes on for 3 to 4 minutes and every 10 seconds the same picture: one bulk request with a single add operation. A thread dump shows that some threads are waiting in BulkIngester.add, which is waiting inside the FnCondition.whenReadyIf(...) at the "awaitUninterruptibly" call. So it seems, one bulk request comes back with a single request in it, that triggers the addCondition.signalIfReady() call which then lets the next request through, but again with just one single request in it. this does not happen when debugging, this does not happen when adding a per document log message, thats why I think it is a race condition somewhere. If I change the addCondition.signalIfReady() to signalAllIfReady, it works, but I would really like to find out the actual root cause of this!
I have a 32 core CPU, we are collecting and preparing our index documents in parallel. When I limit the pool to 8 threads, then it also works just fine.
thread_dump.txt
The text was updated successfully, but these errors were encountered: