Skip to content

Conversation

the-mikedavis
Copy link
Collaborator

@the-mikedavis the-mikedavis commented Sep 25, 2025

Previously if the peer discovery backend returned an error from failing to discover nodes, the service_discovery_nodes/0 helper returned an empty list. During cleanup this would mean that any nodes unreachable during a partition would have destructive action taken against them: rabbit_db_cluster:forget_member/2 and rabbit_quorum_queue:shrink_all/1. The list_nodes/0 callback can fail transiently, though, and a failure shouldn't mean that the cluster is empty. It's safer to avoid cleaning up any nodes when the peer discovery backend fails to return the intended set of nodes.

I also raised the log level of the error from debug to info. Maybe we could go to warning without too much log spam as this cleanup action happens on a timer measured in seconds.


Opening this as a draft for now - I'd like to write a test case for this. I've seen this in the wild when rabbitmq_aws fails to refresh its session token (because of a transient timeout) -> unauthorized request to get EC2 metadata -> list_nodes/0 returns the error tuple -> any node with the bad luck of being unreachable at that moment is forgotten and has its QQ data deleted. Especially if a broker is underprovisioned and overloaded, the session token refresh can fail around the same time that nodes mistakenly think they are partitioned from one another due to busy_dist_port / clogged internode communication.

@the-mikedavis the-mikedavis self-assigned this Sep 25, 2025
Including this info in the error report can help with sanity checks in
debugging `?awaitMatch/4` failures.
@the-mikedavis the-mikedavis force-pushed the md/peer-disc-cleanup-error branch from 5540c69 to cfc50d5 Compare September 30, 2025 01:36
Previously if the peer discovery backend returned an error from failing
to discover nodes, the `service_discovery_nodes/0` helper returned an
empty list. During cleanup this would mean that any nodes unreachable
during a partition would have destructive action taken against them:
`rabbit_db_cluster:forget_member/2` and `rabbit_quorum_queue:shrink_all/1`.
The `list_nodes/0` callback can fail transiently, though, and a failure
shouldn't mean that the cluster is empty. It's safer to avoid cleaning
up any nodes when the peer discovery backend fails to return the
intended set of nodes.
@the-mikedavis the-mikedavis force-pushed the md/peer-disc-cleanup-error branch from cfc50d5 to 2d4f19c Compare September 30, 2025 01:45
@the-mikedavis the-mikedavis marked this pull request as ready for review September 30, 2025 02:20
@michaelklishin michaelklishin added this to the 4.3.0 milestone Oct 8, 2025
@michaelklishin michaelklishin merged commit df7b065 into main Oct 8, 2025
285 of 286 checks passed
@michaelklishin michaelklishin deleted the md/peer-disc-cleanup-error branch October 8, 2025 02:34
michaelklishin added a commit that referenced this pull request Oct 8, 2025
Skip peer discovery cleanup when backend returns error (backport #14606)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants