-
Notifications
You must be signed in to change notification settings - Fork 1k
WIP replication: Fix snapshot fiber joining on itself #5183
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Is there any way to reproduce this? for example, breaking the stable sync in a loop of 100 iterations could help? |
I dont think we have partial sync with master and its replica.
|
@abhijat checkout test_partial_replication_on_same_source_master for example |
Thanks, I will check it out. In the failing dftest we have the following steps:
The crash is seen around the second iteration of loop, there we have a partial sync running and it fails because of a missing lsn. I have been trying to recreate the above steps in pytest. I can get the partial sync activated, but it doesn't get to the missing lsn and so doesn't enter the error path. |
I believe that in order to reproduce the flow of the missing lsn you need to send data to the new master before replicating from it. |
On the error path for incremental snapshot, the stream is finalized from the snapshot fiber. This ends up with a join call to the snapshot fiber on itself, which triggers an assertion. In this change the call is removed. The call is redundant because on the error path we call the ReportError method of the context. The context has an error handler from the replica info, which cancels the replication. The cancel call chain ends up calling Finalize method, which makes the call removed in this change redundant. Signed-off-by: Abhijat Malviya <[email protected]>
485aab7
to
7c5186e
Compare
I added a test Since the lsn comes from a thread local object, this can result in mismatch on the flow lsn we set initially and the final shard lsn used during partial sync, causing an assertion like:
I added some log statements to confirm and it does show a mismatch like
and then two assertion failures appear
To fix this IMO we could set the lsn on the shard after migration, or fetch the value from the correct shard. I tried the latter locally with
and I no longer see any errors in my new test. |
On the error path for incremental snapshot, the stream is finalized from the snapshot fiber by calling
SliceSnapshot::FinalizeJournalStream
. This ends up with a join call to the snapshot fiber on itself, which triggers an assertion.In this change the call is removed. The call is redundant and safe to remove because on the error path we call the ReportError method of the context. This context has an error handler set from the replica info, which cancels the replication. The cancel call chain ends up calling Finalize method, which makes the call removed in this change redundant.
FIXES #5135