Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DOCS] Master Cluster Tutorial - Suggested HAProxy Timeout Settings Cause Unstable Communication #66888

Open
clayoster opened this issue Sep 14, 2024 · 2 comments
Labels
Documentation Relates to Salt documentation needs-triage

Comments

@clayoster
Copy link
Contributor

Description
I have been testing the suggested HAProxy configuration from the Master Cluster tutorial and have found that the suggested client/server timeout values of 1m cause unstable minion communication, specifically with the publisher port (4505).

I am using the default transport with 3 masters and 50 minions all running 3007.1. My HAProxy version is 2.6.12-1 (Debian 12), though I have tested older and newer versions with the same results.

Adjusting TCP keepalive values on the masters and minions does not seem to affect HAProxy closing TCP sessions after 1 minute of inactivity. Reducing tcp_keepalive_idle does speed up minions reconnecting after HAProxy closes the connection though.

It seems that no matter how frequently the master and minions send keepalives, HAProxy will close the sessions after 1 minute if no data is sent through the session. If I run something like salt '*' test.ping every 30 seconds, this keeps the sessions with the publisher port alive for longer than 1 minute.

To Reproduce:

  • On the HAProxy server, run watch "netstat -nalpt | grep 4505" and watch for TCP sessions to switch from "ESTABLISHED" to "TIME_WAIT". This should happen within a minute. Run salt '*' test.ping from the master while sessions are in this condition and you'll see minions fail to respond as they did not see the event published from the master.

I am currently using timeout values of 12h on the publisher and request server ports to reduce the frequency TCP sessions being killed off. While this probably isn't the best solution, it does keep minion communication stable as it greatly reduces how often minions have to re-establish their connection with the master.

Suggested Fix
Is there other configuration expected to be set on the master and minion to allow stable minion communication with the suggested 1 minute timeouts in HAProxy?

Type of documentation
Tutorial

Location or format of documentation
https://docs.saltproject.io/en/latest/topics/tutorials/master-cluster.html

@clayoster clayoster added Documentation Relates to Salt documentation needs-triage labels Sep 14, 2024
@dwoz
Copy link
Contributor

dwoz commented Sep 14, 2024

Yes the timeouts for the publish port should match the 'publish_session'. The docs need to be updated. This has been in my backlog for some time.

@clayoster
Copy link
Contributor Author

Would you recommend leaving publish_session at the default of 24 hours, or setting it to something lower? Additionally, would it be a good idea to set ping_on_rotate: True in this configuration?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Documentation Relates to Salt documentation needs-triage
Projects
None yet
Development

No branches or pull requests

2 participants