-
Notifications
You must be signed in to change notification settings - Fork 487
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ReplayMissedEvents Incorrectly assumes autoincrement value of 1 for all deployments. #5341
Comments
From reading extensively, the system is using Group Replication, which means that each server is tuned to avoid using auto-indexed values that might be used elsewhere. This prevents the need to ensure that the same ID isn't issued in two different servers at the same time, as might occur when switching the primary write server. In two servers, this would mean each generates ids with sequences that might resemble:
Under normal operations, only one server is generating IDs at a time, and to ensure better interoperability with programs that assume incrementing ids, on post-cutover most systems generate their "next" id higher than the last one they know was used. So a sequence might look like 1, 3, 5, 7, 8, 10, 12 when a cutover occurs at the 7 to 8 transition. This means that there is no guaranteed skip pattern that can be followed to match the server, because there is no guarantee that the replication cluster didn't get a new primary write server after any specifically used ID. The logic to ensure no ids are skipped will work without error as-is, so the concerns moving forward are about reducing resource consumption (the original goal of the entire effort). Without database replication, the logic will work very efficiently. With replication of this form, each added entry will suffer polling costs for 24 hours (the longest a transaction remains open on these platforms) and then will be efficient. We are currently investigating suitable backoff algorithms that will still keep the main logic as-is with reduced polling for skipped ids that linger for longer than a few minutes. This should reduce the current cost of skipped ids for the first 24 hours, while preserving the zero maintenance performance costs of entries that are static beyond 24 hours. |
@amartinezfayo Please assign this issue to me. |
We will be looking at the implementation of a different algorithm for the events-based cache event tracking in #5624. |
During a validation that we did for 1.9.6 Server, we noticed that we saw a single:
Detected skipped attested node event
And then constantly reoccuring logs:
Event not yet populated in database
.
After further discussion, it was found that the updateCache method assumes that all SQL stores use a value of
1
for theauto_increment_increment
value. In deployments where this is not true, this method will incorrectly populate the missedEvents map with all of the in-between entries that should never get populated.This behavior will incur additional CPU cycles, Noise, and Memory Usage from storing the false-positives and doing the the replayMissedEvents poll. At the scale of 10's thousands per deployment and/or a highly dynamic environment, this will causes a new false-positive missed-event to happen every single time an entry is created.
The text was updated successfully, but these errors were encountered: