WIP replication: Fix snapshot fiber joining on itself #5183

abhijat · 2025-05-26T15:32:27Z

On the error path for incremental snapshot, the stream is finalized from the snapshot fiber by calling SliceSnapshot::FinalizeJournalStream. This ends up with a join call to the snapshot fiber on itself, which triggers an assertion.

In this change the call is removed. The call is redundant and safe to remove because on the error path we call the ReportError method of the context. This context has an error handler set from the replica info, which cancels the replication. The cancel call chain ends up calling Finalize method, which makes the call removed in this change redundant.

FIXES #5135

romange · 2025-05-26T16:07:25Z

Is there any way to reproduce this? for example, breaking the stable sync in a loop of 100 iterations could help?

adiholden · 2025-05-27T07:29:36Z

Is there any way to reproduce this? for example, breaking the stable sync in a loop of 100 iterations could help?

I dont think we have partial sync with master and its replica.
The flow we enabled partial sync now is:

create master and 2 replicas
close master server , make 1 replica master and make the other replica a replica of the new master

adiholden · 2025-05-27T07:30:17Z

@abhijat checkout test_partial_replication_on_same_source_master for example

abhijat · 2025-05-27T07:35:53Z

@abhijat checkout test_partial_replication_on_same_source_master for example

Thanks, I will check it out. In the failing dftest we have the following steps:

create master with 2 replicas
populate to 75% memory
in a loop:
kill master
wait for one of the replicas to become new-master, this is done by the dfcloud reconciler automatically (the first one which is up to date is promoted)
the other replica becomes replica of new-master
old master is brought up, it is made replica of new-master as well
repeat loop

The crash is seen around the second iteration of loop, there we have a partial sync running and it fails because of a missing lsn.

I have been trying to recreate the above steps in pytest. I can get the partial sync activated, but it doesn't get to the missing lsn and so doesn't enter the error path.

adiholden · 2025-05-27T12:15:19Z

I believe that in order to reproduce the flow of the missing lsn you need to send data to the new master before replicating from it.
Does the dftest that is failing sends write traffic when steps 4 and 5 are running?
To reproduce this missing lsn in pytest you need to call write commands to the new master while doing replicaof command to move the replica to replicate from the new master

On the error path for incremental snapshot, the stream is finalized from the snapshot fiber. This ends up with a join call to the snapshot fiber on itself, which triggers an assertion. In this change the call is removed. The call is redundant because on the error path we call the ReportError method of the context. The context has an error handler from the replica info, which cancels the replication. The cancel call chain ends up calling Finalize method, which makes the call removed in this change redundant. Signed-off-by: Abhijat Malviya <[email protected]>

abhijat · 2025-05-27T14:01:42Z

I added a test test_partial_sync_error_handler which gets into the partial sync code path, but sometimes it is running into another issue, we set the lsn on a flow.start_partial_sync_at on a given shard, but we may migrate the connection afterwards to a different shard.

Since the lsn comes from a thread local object, this can result in mismatch on the flow lsn we set initially and the final shard lsn used during partial sync, causing an assertion like:

F20250527 19:03:06.460731 196529 snapshot.cc:206] Check failed: lsn <= journal->GetLsn() (2 vs. 1) The replica tried to sync from the future.

I added some log statements to confirm and it does show a mismatch like

dragonfly.fedora.abhijat.log.INFO.20250527-190306.196513.log:I20250527 19:03:06.459610 196552 dflycmd.cc:348] the journal lsn on target shard is 1 and sync at is 2
dragonfly.fedora.abhijat.log.INFO.20250527-190306.196513.log:I20250527 19:03:06.459787 196529 dflycmd.cc:348] the journal lsn on target shard is 1 and sync at is 2

and then two assertion failures appear

F20250527 19:03:06.460731 196529 snapshot.cc:206] Check failed: lsn <= journal->GetLsn() (2 vs. 1) The replica tried to sync from the future.
F20250527 19:03:06.460841 196552 snapshot.cc:206] Check failed: lsn <= journal->GetLsn() (2 vs. 1) The replica tried to sync from the future.

To fix this IMO we could set the lsn on the shard after migration, or fetch the value from the correct shard. I tried the latter locally with

        auto cb = [this, &flow](const EngineShard*) {
          VLOG(1) << "the journal lsn on target shard is " << sf_->journal()->GetLsn()
                  << " and sync at is " << flow.start_partial_sync_at.value();
          flow.start_partial_sync_at = sf_->journal()->GetLsn();
        };

        shard_set->RunBriefInParallel(cb, [flow_id](auto shard_id) { return shard_id == flow_id; });

and I no longer see any errors in my new test.

abhijat changed the title ~~replication: Fix snapshot fiber joining on itself~~ WIP replication: Fix snapshot fiber joining on itself May 26, 2025

abhijat mentioned this pull request May 26, 2025

Dragonfly crash during replication: v1.30.0 #5135

Closed

abhijat force-pushed the abhijat/fix/snapshot-fb-self-join branch from 485aab7 to 7c5186e Compare May 27, 2025 13:52

abhijat marked this pull request as draft May 28, 2025 10:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

WIP replication: Fix snapshot fiber joining on itself #5183

WIP replication: Fix snapshot fiber joining on itself #5183

abhijat commented May 26, 2025 •

edited

Loading

Uh oh!

romange commented May 26, 2025

Uh oh!

adiholden commented May 27, 2025

Uh oh!

adiholden commented May 27, 2025

Uh oh!

abhijat commented May 27, 2025

Uh oh!

adiholden commented May 27, 2025 •

edited

Loading

Uh oh!

abhijat commented May 27, 2025

Uh oh!

Uh oh!

WIP replication: Fix snapshot fiber joining on itself #5183

Are you sure you want to change the base?

WIP replication: Fix snapshot fiber joining on itself #5183

Conversation

abhijat commented May 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

romange commented May 26, 2025

Uh oh!

adiholden commented May 27, 2025

Uh oh!

adiholden commented May 27, 2025

Uh oh!

abhijat commented May 27, 2025

Uh oh!

adiholden commented May 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

abhijat commented May 27, 2025

Uh oh!

Uh oh!

abhijat commented May 26, 2025 •

edited

Loading

adiholden commented May 27, 2025 •

edited

Loading