fix(grouping): Handle race condition creating grouphash metadata #85917

lobsterkatie · 2025-02-26T07:35:08Z

This is a second attempt to fix the errors we're seeing when enabling backfill for grouphash metadata, wherein a small percentage of new grouphashes can't have their Seer metadata updated because the metadata record "hasn't been created yet."

We noticed that the metadata records to which this happened all seemed to have an id of None, so the first attempt at a fix was to call save before doing the update in those cases, to make sure the record was in the database. This introduced its own problems, however, in that the save call then started throwing an integrity error, complaining that a metadata record for that grouphash already existed. Since the call to Seer only happens with new, unrecognized grouphashes, and since we have a lock to prevent double group creation in those cases, this didn't make a ton of sense... until I realized that the lock doesn’t kick in until after we're done calling Seer. Whoops.

As a result, here's what I think is currently happening when the backfill is off:

Two events with the same unrecognized hash come in essentially simultaneously.
They both hit the GroupHash.objects.get_or_create call inside of get_or_create_grouphashes. (We use get_or_create there because the same codepath serves both new and existing hashes.) The slightly-sooner event creates a new grouphash and the slightly-later event gets that newly-created grouphash from the database. Each event has its own GroupHash instance to represent the shared database record.
With backfill turned off, only the slightly-sooner event gets routed to create_or_update_grouphash_metadata_if_needed.
The slightly-sooner event creates metadata for the grouphash, and Django attaches the resulting GroupHashMetadata object to that event's GroupHash instance.
Both events get sent to Seer.
When it comes time to store the results in metadata, only the slightly sooner event tries to do so, because the slightly-later event's GroupHash instance hasn't had a metadata record attached to it.
Everything works as expected and the rest of the save_event process proceeds as it normally would.

And here's what I think is happening when the backfill is on:

Same as above - two events come in
Same as above - grouphash is created by slightly-sooner event, picked up by slightly-later event
Because the backfill is on, this time both events get routed to create_or_update_grouphash_metadata_if_needed. Within create_or_update_grouphash_metadata_if_needed, they both fall into the if not grouphash.metadata branch, since neither one has yet had a chance to create a metadata record.
Same as above - slightly-sooner event creates metadata
4a. The slightly-later event also tries to create metadata for the grouphash, but can't, because the slightly-sooner event already has. Django still attaches a GroupHashMetadata object to the event's GroupHash instance, but an id never gets filled in because no metadata record was successfully created.
Same as above - both events sent to Seer
As before, the slightly-sooner event stores its Seer results in metadata. The slightly-later event also tries to, since it now has a metadata object attached to its grouphash, but it can't, because the metadata object isn't actually tied to a real database record. (Or, after the first attempted fix, it tries to save its metadata before updating, and can't, because the metadata from the slightly-sooner event already exists.)
The slightly-sooner event proceeds as normal, while the slightly-later event throws an error.

With that scenario in mind, this PR makes three changes:

Inside of create_or_update_grouphash_metadata_if_needed, switch from using GroupHashMetadata.objects.create to using GroupHashMetadata.objects.get_or_create. Do this with just the grouphash filled in, because we know it will be the same between the events.
If get_or_create indicates that it got an already-existent record, log some info and then bail rather than trying to update the record with more data. That way, only the slightly-sooner event will have a metadata instance attached to its grouphash, and the Seer codepath will behave the way it did before backfill was turned on.
Just in case something about this theory is wrong and we do still try to update a bad metadata record with Seer results, bail early so neither the original error nor the new integrity error is thrown. Keep the log there, so we can know if we ever hit that logic.

If this works, we can then remove the band-aid from the Seer code.

lobsterkatie added 3 commits February 25, 2025 22:00

use get_or_create for new metadata records

d322dde

log and bail if we hit the race condition

722d1d1

bail when metadata has no id

c266ac2

github-actions bot added the Scope: Backend Automatically applied to PRs that change backend components label Feb 26, 2025

lobsterkatie marked this pull request as ready for review February 26, 2025 08:04

lobsterkatie requested a review from a team as a code owner February 26, 2025 08:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(grouping): Handle race condition creating grouphash metadata #85917

fix(grouping): Handle race condition creating grouphash metadata #85917

lobsterkatie commented Feb 26, 2025 •

edited

Loading

fix(grouping): Handle race condition creating grouphash metadata #85917

Are you sure you want to change the base?

fix(grouping): Handle race condition creating grouphash metadata #85917

Conversation

lobsterkatie commented Feb 26, 2025 • edited Loading

lobsterkatie commented Feb 26, 2025 •

edited

Loading