Use tick-related timeout to repair leader record #14672

acogoluegnes · 2025-10-03T12:25:27Z

A quorum queue tries to repair its record in a tick handler. This can happen during a network partition and the metadata store may itself be unavailable, making the update likely to time out.

The default metadata store timeout is usually higher than the tick interval, so the tick handler may be stuck during several ticks. The record takes some time to be updated (timeout + tick interval, 30 + 5 seconds by default), significantly longer than it takes the metadata store to trigger an election and recover.

Client applications may rely on the quorum queue topology to connect to an appropriate node, so making the system reflect the actual topology faster is important to them.

This commit makes the record update operations use a timeout 1-second lower than the tick interval. The tick handler process should finish earlier in case of metadata datastore unavailability and it should not take more than a couple of ticks once the datastore is available to update the record.

dumbbell

Two comments about type specs, otherwise it looks good to me.

deps/rabbit/src/rabbit_db_queue.erl

deps/rabbit/src/rabbit_amqqueue.erl

A quorum queue tries to repair its record in a tick handler. This can happen during a network partition and the metadata store may itself be unavailable, making the update likely to time out. The default metadata store timeout is usually higher than the tick interval, so the tick handler may be stuck during several ticks. The record takes some time to be updated (timeout + tick interval, 30 + 5 seconds by default), significantly longer than it takes the metadata store to trigger an election and recover. Client applications may rely on the quorum queue topology to connect to an appropriate node, so making the system reflect the actual topology faster is important to them. This commit makes the record update operations use a timeout 1-second lower than the tick interval. The tick handler process should finish earlier in case of metadata datastore unavailability and it should not take more than a couple of ticks once the datastore is available to update the record.

Use tick-related timeout to repair leader record (backport #14672)

acogoluegnes added this to the 4.3.0 milestone Oct 3, 2025

acogoluegnes added bug backport-v4.2.x labels Oct 3, 2025

acogoluegnes force-pushed the use-timeout-for-leader-record-repair branch from 5b2b01e to 183aaa1 Compare October 3, 2025 12:42

acogoluegnes marked this pull request as ready for review October 3, 2025 15:12

dumbbell requested changes Oct 3, 2025

View reviewed changes

deps/rabbit/src/rabbit_db_queue.erl Outdated Show resolved Hide resolved

deps/rabbit/src/rabbit_db_queue.erl Outdated Show resolved Hide resolved

acogoluegnes force-pushed the use-timeout-for-leader-record-repair branch from 183aaa1 to 489ee8b Compare October 6, 2025 07:23

kjnilsson reviewed Oct 6, 2025

View reviewed changes

deps/rabbit/src/rabbit_amqqueue.erl Show resolved Hide resolved

dumbbell approved these changes Oct 6, 2025

View reviewed changes

acogoluegnes force-pushed the use-timeout-for-leader-record-repair branch from 489ee8b to 8387d73 Compare October 6, 2025 08:43

acogoluegnes merged commit 3f719d5 into main Oct 6, 2025
576 of 577 checks passed

acogoluegnes deleted the use-timeout-for-leader-record-repair branch October 6, 2025 09:28

mergify bot mentioned this pull request Oct 6, 2025

Use tick-related timeout to repair leader record (backport #14672) #14698

Merged

acogoluegnes added a commit that referenced this pull request Oct 6, 2025

Merge pull request #14698 from rabbitmq/mergify/bp/v4.2.x/pr-14672

4100450

Use tick-related timeout to repair leader record (backport #14672)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Use tick-related timeout to repair leader record #14672

Use tick-related timeout to repair leader record #14672

Uh oh!

acogoluegnes commented Oct 3, 2025

Uh oh!

dumbbell left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Use tick-related timeout to repair leader record #14672

Use tick-related timeout to repair leader record #14672

Uh oh!

Conversation

acogoluegnes commented Oct 3, 2025

Uh oh!

dumbbell left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!