ReplayMissedEvents Incorrectly assumes autoincrement value of 1 for all deployments. #5341

stevend-uber · 2024-07-30T21:49:09Z

During a validation that we did for 1.9.6 Server, we noticed that we saw a single:

Detected skipped attested node event
And then constantly reoccuring logs:
Event not yet populated in database
.

After further discussion, it was found that the updateCache method assumes that all SQL stores use a value of 1 for the auto_increment_increment value. In deployments where this is not true, this method will incorrectly populate the missedEvents map with all of the in-between entries that should never get populated.

This behavior will incur additional CPU cycles, Noise, and Memory Usage from storing the false-positives and doing the the replayMissedEvents poll. At the scale of 10's thousands per deployment and/or a highly dynamic environment, this will causes a new false-positive missed-event to happen every single time an entry is created.

The text was updated successfully, but these errors were encountered:

edwbuck · 2024-07-31T15:37:04Z

From reading extensively, the system is using Group Replication, which means that each server is tuned to avoid using auto-indexed values that might be used elsewhere. This prevents the need to ensure that the same ID isn't issued in two different servers at the same time, as might occur when switching the primary write server.

In two servers, this would mean each generates ids with sequences that might resemble:

server 1 - 1, 3, 5, 7, 9
server 2 - 2, 4, 6, 8, 10

Under normal operations, only one server is generating IDs at a time, and to ensure better interoperability with programs that assume incrementing ids, on post-cutover most systems generate their "next" id higher than the last one they know was used.

So a sequence might look like 1, 3, 5, 7, 8, 10, 12 when a cutover occurs at the 7 to 8 transition. This means that there is no guaranteed skip pattern that can be followed to match the server, because there is no guarantee that the replication cluster didn't get a new primary write server after any specifically used ID.

The logic to ensure no ids are skipped will work without error as-is, so the concerns moving forward are about reducing resource consumption (the original goal of the entire effort). Without database replication, the logic will work very efficiently. With replication of this form, each added entry will suffer polling costs for 24 hours (the longest a transaction remains open on these platforms) and then will be efficient.

We are currently investigating suitable backoff algorithms that will still keep the main logic as-is with reduced polling for skipped ids that linger for longer than a few minutes. This should reduce the current cost of skipped ids for the first 24 hours, while preserving the zero maintenance performance costs of entries that are static beyond 24 hours.

edwbuck · 2024-09-10T15:37:10Z

@amartinezfayo Please assign this issue to me.

amartinezfayo · 2024-11-05T18:29:48Z

We will be looking at the implementation of a different algorithm for the events-based cache event tracking in #5624.
The new algorithm should solve existing issues related with the events-based cache. Once the algorithm described in #5624 is implemented, we can go back to this issue and figure out if any additional work is needed.

stevend-uber changed the title ~~ReplayMissedEvents Incorrectly assumes autoincrement value of 1 for All Ecosystems.~~ ReplayMissedEvents Incorrectly assumes autoincrement value of 1 for all deployments. Jul 30, 2024

MarcosDY added the triage/in-progress Issue triage is in progress label Aug 1, 2024

azdagron added this to the 1.10.2 milestone Aug 6, 2024

azdagron added priority/backlog Issue is approved and in the backlog and removed triage/in-progress Issue triage is in progress labels Aug 6, 2024

edwbuck mentioned this issue Aug 16, 2024

Add comments to events based cache code #5327

Merged

3 tasks

amartinezfayo modified the milestones: 1.10.2, 1.11.0 Aug 22, 2024

edwbuck mentioned this issue Sep 6, 2024

Full missed-event-reconciliation based on SQLTransactionTimeout Proposal #5470

Closed

amartinezfayo assigned edwbuck Sep 10, 2024

edwbuck mentioned this issue Sep 20, 2024

Implement cache update deduplication per fetch cycle #5509

Merged

azdagron modified the milestones: 1.11.0, 1.11.1 Oct 15, 2024

amartinezfayo removed this from the 1.11.1 milestone Nov 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ReplayMissedEvents Incorrectly assumes autoincrement value of 1 for all deployments. #5341

ReplayMissedEvents Incorrectly assumes autoincrement value of 1 for all deployments. #5341

stevend-uber commented Jul 30, 2024

edwbuck commented Jul 31, 2024

edwbuck commented Sep 10, 2024

amartinezfayo commented Nov 5, 2024

ReplayMissedEvents Incorrectly assumes autoincrement value of 1 for all deployments. #5341

ReplayMissedEvents Incorrectly assumes autoincrement value of 1 for all deployments. #5341

Comments

stevend-uber commented Jul 30, 2024

edwbuck commented Jul 31, 2024

edwbuck commented Sep 10, 2024

amartinezfayo commented Nov 5, 2024