Skip to content

WIP replication: Fix snapshot fiber joining on itself #5183

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

abhijat
Copy link
Contributor

@abhijat abhijat commented May 26, 2025

On the error path for incremental snapshot, the stream is finalized from the snapshot fiber by calling SliceSnapshot::FinalizeJournalStream. This ends up with a join call to the snapshot fiber on itself, which triggers an assertion.

In this change the call is removed. The call is redundant and safe to remove because on the error path we call the ReportError method of the context. This context has an error handler set from the replica info, which cancels the replication. The cancel call chain ends up calling Finalize method, which makes the call removed in this change redundant.

FIXES #5135

@abhijat abhijat changed the title replication: Fix snapshot fiber joining on itself WIP replication: Fix snapshot fiber joining on itself May 26, 2025
@romange
Copy link
Collaborator

romange commented May 26, 2025

Is there any way to reproduce this? for example, breaking the stable sync in a loop of 100 iterations could help?

@adiholden
Copy link
Collaborator

Is there any way to reproduce this? for example, breaking the stable sync in a loop of 100 iterations could help?

I dont think we have partial sync with master and its replica.
The flow we enabled partial sync now is:

  1. create master and 2 replicas
  2. close master server , make 1 replica master and make the other replica a replica of the new master

@adiholden
Copy link
Collaborator

@abhijat checkout test_partial_replication_on_same_source_master for example

@abhijat
Copy link
Contributor Author

abhijat commented May 27, 2025

@abhijat checkout test_partial_replication_on_same_source_master for example

Thanks, I will check it out. In the failing dftest we have the following steps:

  1. create master with 2 replicas
  2. populate to 75% memory
  3. in a loop:
  4. kill master
  5. wait for one of the replicas to become new-master, this is done by the dfcloud reconciler automatically (the first one which is up to date is promoted)
  6. the other replica becomes replica of new-master
  7. old master is brought up, it is made replica of new-master as well
  8. repeat loop

The crash is seen around the second iteration of loop, there we have a partial sync running and it fails because of a missing lsn.

I have been trying to recreate the above steps in pytest. I can get the partial sync activated, but it doesn't get to the missing lsn and so doesn't enter the error path.

@adiholden
Copy link
Collaborator

adiholden commented May 27, 2025

I believe that in order to reproduce the flow of the missing lsn you need to send data to the new master before replicating from it.
Does the dftest that is failing sends write traffic when steps 4 and 5 are running?
To reproduce this missing lsn in pytest you need to call write commands to the new master while doing replicaof command to move the replica to replicate from the new master

On the error path for incremental snapshot, the stream is finalized from
the snapshot fiber. This ends up with a join call to the snapshot fiber
on itself, which triggers an assertion.

In this change the call is removed. The call is redundant because on the
error path we call the ReportError method of the context. The context
has an error handler from the replica info, which cancels the
replication. The cancel call chain ends up calling Finalize method,
which makes the call removed in this change redundant.

Signed-off-by: Abhijat Malviya <[email protected]>
@abhijat abhijat force-pushed the abhijat/fix/snapshot-fb-self-join branch from 485aab7 to 7c5186e Compare May 27, 2025 13:52
@abhijat
Copy link
Contributor Author

abhijat commented May 27, 2025

I added a test test_partial_sync_error_handler which gets into the partial sync code path, but sometimes it is running into another issue, we set the lsn on a flow.start_partial_sync_at on a given shard, but we may migrate the connection afterwards to a different shard.

Since the lsn comes from a thread local object, this can result in mismatch on the flow lsn we set initially and the final shard lsn used during partial sync, causing an assertion like:

F20250527 19:03:06.460731 196529 snapshot.cc:206] Check failed: lsn <= journal->GetLsn() (2 vs. 1) The replica tried to sync from the future.

I added some log statements to confirm and it does show a mismatch like

dragonfly.fedora.abhijat.log.INFO.20250527-190306.196513.log:I20250527 19:03:06.459610 196552 dflycmd.cc:348] the journal lsn on target shard is 1 and sync at is 2
dragonfly.fedora.abhijat.log.INFO.20250527-190306.196513.log:I20250527 19:03:06.459787 196529 dflycmd.cc:348] the journal lsn on target shard is 1 and sync at is 2

and then two assertion failures appear

F20250527 19:03:06.460731 196529 snapshot.cc:206] Check failed: lsn <= journal->GetLsn() (2 vs. 1) The replica tried to sync from the future.
F20250527 19:03:06.460841 196552 snapshot.cc:206] Check failed: lsn <= journal->GetLsn() (2 vs. 1) The replica tried to sync from the future.

To fix this IMO we could set the lsn on the shard after migration, or fetch the value from the correct shard. I tried the latter locally with

        auto cb = [this, &flow](const EngineShard*) {
          VLOG(1) << "the journal lsn on target shard is " << sf_->journal()->GetLsn()
                  << " and sync at is " << flow.start_partial_sync_at.value();
          flow.start_partial_sync_at = sf_->journal()->GetLsn();
        };

        shard_set->RunBriefInParallel(cb, [flow_id](auto shard_id) { return shard_id == flow_id; });

and I no longer see any errors in my new test.

@abhijat abhijat marked this pull request as draft May 28, 2025 10:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Dragonfly crash during replication: v1.30.0
3 participants