-
Notifications
You must be signed in to change notification settings - Fork 5.5k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
1 changed file
with
9 additions
and
12 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
5249eac
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dwoz Are you absolutely sure that replacing salt.transport.client.AsyncReqChannel.factory() with salt.channel.client.ReqChannel.factory() for _send_req_async() is the right way to go? I'm investigating poor syndic performance after 3003 -> 3005 upgrade and this change kindly backported to 3003 with #64032 is now on my short list of the performance bottlenecks.
To be honest I can't say why both _send_req_sync() and _send_req_async() used AsyncReqChannel.factory() before but salt-syndic used _send_req_async() to for _return_pub_multi() so chances are _send_req_sync() was not used at all.
5249eac
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After running many tests I'm quite sure _forward_events() blocks for pretty much as long as the actual send takes when it uses that salt.channel.client.ReqChannel.factory() as self.async_req_channel. With AsyncReqChannel.factory() it returns promptly which allows the syndic to do other stuff such as send events back to other masters (that actually happens in the same loop, just with that pseudo-asynchronous code each MoM needs to wait its turn for long) or receive new job results from the local master while send() still goes on asynchronously.
Tested with big job return payloads and syndics/targets located on a different continent than the MoM.
5249eac
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if @dwoz or anyone else reads it but syndic performance with this patch is terrible under load - changing the above code to use AsyncReqChannel.factory() for self.async_req_channel helps a bit but it's still far behind 3003 performance-wise. I'm talking an installation with ~100k minions, dozens of syndics and many MoMs. It ends up with timeouts, duplicate events, big delays etc. Funnily enough despite keeping a connection to 4506 open to each MoM syndic keeps opening new ones too (they get closed why that initial one is kept open). I could not figure out the exact reason why this is happening.
Reverting Syndic._send_req_async() to what it was in 3003 seems to be a real game changer. Not only fixes it performance but also stops these dodgy "Got events for closed stream None" warnings from appearing in the log.
Maybe it all works smooth in 3006, I only tried 3005 with #64032 but I haven't noticed any revolutionary changes in the relevant parts of the latest Salt code.
Mind you, I doubt Syndic uses send_req_sync for anything.
5249eac
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We reverted this change in 3005 (#64032 backport) and made syndic send results as it did in 3003 and it started flying. For some reason these persistent req channels are counter-productive, despite common sense. Not only doing it over these persistent channels results in these dodgy "Got events for closed stream None" warnings but it causes long delays and duplicate results sent back to MoMs under stress.