-
Notifications
You must be signed in to change notification settings - Fork 487
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Full missed-event-reconciliation based on SQLTransactionTimeout Proposal #5470
Comments
SQLTransactionTimeout settings don't impact the initial fill of the cache, because it's filled on startup regardless of the setting, or the value of the setting. What you have identified in your first item is the polling of "before the first" events to determine if a transaction was holding events before the first discovered event. Such polling only needs to occur for sql_transaction_timeout seconds, because after then, no transaction could hold an event prior to the first detected event. I'll open a new PR to improve the naming of element in this area, I agree it is confusing and not clear. For the second point; yes, we remove the Event IDs from the polling list sql_transaction_timeout seconds after the gap they represent was detected, because it is now impossible for them to be in an uncommitted transaction that can still commit them to database.
The event ids are never in the cache, because the cache is a cache of Entries, not of Events. The Event IDs for gaps are in the update routines of the cache to implement knowledge of what Events might arrive late (or out of order). While they are present, let's clarify that this is a cache of Entries, not Events. The approach you suggest works well for regions that are mostly empty. The returned data is small, and we currently implement that approach in the "before the first" gap polling. We scan for every Event before the first detected Event, we filter out the ones we successfully processed in prior updates, and we process any new ones. For ranges that are primarily full, such an algorithm comes with many issues. The returned data is significant, especially due to it being polled frequently. For most systems, nearly 100% of the range will be full. Even in systems that create an Event Gap for each Event they register, nearly 50% of the range will be full. This incurs additional costs not mentioned above, including both CPU costs to marshall the data into the reply packets as well as I/O costs. Those costs would be felt by every user of SPIRE, even those that do not have database replication setups that impose Event gaps for every registered Event. That's why #5341 and #5342 exist. They attempt to solve the issue of the costs of session creation and destruction for systems with large gaps in the Event table, through the following means:
I recommend that we close this issue, as it already is implemented for the "before the first" events, and the scenario of scanning already processed events incurs CPU and I/O costs that were not considered, as well as reestablishes the scaling issues (perhaps on a smaller scale) that the Event system was designed to remove. I'll also prioritize #5341 and #5342 to ensure that our current polling of Event gaps reduces impact on the database by using logarithmic backoff and batch querying mechanisms respectively. |
Currently SQLTransactionTimeout is used to do the following:
Assuming every eventID should exist, this puts all missing eventIDs into the in-memory cache. The proposal here is to remove the in-memory cache of assumed missed events and to instead make it a seen-events cache. The missed-events loop can then query for all events in the last SQLTransactionTimeout timespan (+wiggle room to account for clocks) and do a delta against the seen-events list. What this gives us is removing the assumption that every eventID should exist and enable SPIRE deployments with auto_increment_increment ≠ 1 without the included false-positive noise that comes with false-missed-events. This allows the same continued functionality at the cost of likely negligible CPU cycles.
Additional performance enhancements can additionally be made to do
We can take the older of
In order to account for db failure scenarios.
Additionally, this can be done in a separate go-routine so that it does not affect the performance of the hot-path of the normal cache sync.
The text was updated successfully, but these errors were encountered: