Skip to content

ZOOKEEPER-4925: Fix data loss due to propagation of discontinuous committedLog #2254

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

kezhuw
Copy link
Member

@kezhuw kezhuw commented May 4, 2025

There are two variants of ZooKeeperServer::processTxn. Those two variants diverge significantly since ZOOKEEPER-3484. processTxn(Request request) pops outstanding change from outstandingChanges and adds txn to committedLog for follower to sync in addition to what processTxn(TxnHeader hdr, Record txn) does. The Learner uses processTxn(TxnHeader hdr, Record txn) to commit txn to memory after ZOOKEEPER-4394, which means it leaves committedLog untouched in SYNCHRONIZATION phase.

This way, a stale follower will have hole in its committedLog after joining cluster. The stale follower will propagate the in memory hole to other stale nodes after becoming leader. This causes data loss.

The test case fails on master and 3.9.3, and passes on 3.9.2. So only 3.9.3 is affected.

This commit drops processTxn(TxnHeader hdr, Record txn) as processTxn(Request request) is capable in SYNCHRONIZATION phase too.

Also, this commit rejects discontinuous proposals in syncWithLeader and committedLog, so to avoid possible data loss.

Refs: ZOOKEEPER-4925, ZOOKEEPER-4394, ZOOKEEPER-3484

…mittedLog

There are two variants of `ZooKeeperServer::processTxn`. Those two
variants diverge significantly since ZOOKEEPER-3484.
`processTxn(Request request)` pops outstanding change from
`outstandingChanges` and adds txn to `committedLog` for follower to sync
in addition to what `processTxn(TxnHeader hdr, Record txn)` does. The
`Learner` uses `processTxn(TxnHeader hdr, Record txn)` to commit txn to
memory after ZOOKEEPER-4394, which means it leaves `committedLog`
untouched in `SYNCHRONIZATION` phase.

This way, a stale follower will have hole in its `committedLog` after
joining cluster. The stale follower will propagate the in memory hole
to other stale nodes after becoming leader. This causes data loss.

The test case fails on master and 3.9.3, and passes on 3.9.2. So only
3.9.3 is affected.

This commit drops `processTxn(TxnHeader hdr, Record txn)` as
`processTxn(Request request)` is capable in `SYNCHRONIZATION` phase too.

Also, this commit rejects discontinuous proposals in `syncWithLeader`
and `committedLog`, so to avoid possible data loss.

Refs: ZOOKEEPER-4925, ZOOKEEPER-4394, ZOOKEEPER-3484
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant