Skip peer discovery cleanup when backend returns error #14606

the-mikedavis · 2025-09-25T00:41:32Z

Previously if the peer discovery backend returned an error from failing to discover nodes, the service_discovery_nodes/0 helper returned an empty list. During cleanup this would mean that any nodes unreachable during a partition would have destructive action taken against them: rabbit_db_cluster:forget_member/2 and rabbit_quorum_queue:shrink_all/1. The list_nodes/0 callback can fail transiently, though, and a failure shouldn't mean that the cluster is empty. It's safer to avoid cleaning up any nodes when the peer discovery backend fails to return the intended set of nodes.

I also raised the log level of the error from debug to info. Maybe we could go to warning without too much log spam as this cleanup action happens on a timer measured in seconds.

~~Opening this as a draft for now - I'd like to write a test case for this.~~ I've seen this in the wild when rabbitmq_aws fails to refresh its session token (because of a transient timeout) -> unauthorized request to get EC2 metadata -> list_nodes/0 returns the error tuple -> any node with the bad luck of being unreachable at that moment is forgotten and has its QQ data deleted. Especially if a broker is underprovisioned and overloaded, the session token refresh can fail around the same time that nodes mistakenly think they are partitioned from one another due to busy_dist_port / clogged internode communication.

Including this info in the error report can help with sanity checks in debugging `?awaitMatch/4` failures.

Previously if the peer discovery backend returned an error from failing to discover nodes, the `service_discovery_nodes/0` helper returned an empty list. During cleanup this would mean that any nodes unreachable during a partition would have destructive action taken against them: `rabbit_db_cluster:forget_member/2` and `rabbit_quorum_queue:shrink_all/1`. The `list_nodes/0` callback can fail transiently, though, and a failure shouldn't mean that the cluster is empty. It's safer to avoid cleaning up any nodes when the peer discovery backend fails to return the intended set of nodes.

Skip peer discovery cleanup when backend returns error (backport #14606)

the-mikedavis self-assigned this Sep 25, 2025

rabbit_assert: Include timeout & polling interval in error

a28a5f7

Including this info in the error report can help with sanity checks in debugging `?awaitMatch/4` failures.

the-mikedavis force-pushed the md/peer-disc-cleanup-error branch from 5540c69 to cfc50d5 Compare September 30, 2025 01:36

the-mikedavis added 2 commits September 29, 2025 21:45

Test peer discovery cleanup

f11198a

the-mikedavis force-pushed the md/peer-disc-cleanup-error branch from cfc50d5 to 2d4f19c Compare September 30, 2025 01:45

the-mikedavis marked this pull request as ready for review September 30, 2025 02:20

michaelklishin added this to the 4.3.0 milestone Oct 8, 2025

michaelklishin added the backport-v4.2.x label Oct 8, 2025

michaelklishin merged commit df7b065 into main Oct 8, 2025
285 of 286 checks passed

michaelklishin deleted the md/peer-disc-cleanup-error branch October 8, 2025 02:34

mergify bot mentioned this pull request Oct 8, 2025

Skip peer discovery cleanup when backend returns error (backport #14606) #14706

Merged

michaelklishin added a commit that referenced this pull request Oct 8, 2025

Merge pull request #14706 from rabbitmq/mergify/bp/v4.2.x/pr-14606

0012652

Skip peer discovery cleanup when backend returns error (backport #14606)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Skip peer discovery cleanup when backend returns error #14606

Skip peer discovery cleanup when backend returns error #14606

the-mikedavis commented Sep 25, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Skip peer discovery cleanup when backend returns error #14606

Skip peer discovery cleanup when backend returns error #14606

Conversation

the-mikedavis commented Sep 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

the-mikedavis commented Sep 25, 2025 •

edited

Loading