Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(grouping): Handle race condition creating grouphash metadata #85917

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

lobsterkatie
Copy link
Member

@lobsterkatie lobsterkatie commented Feb 26, 2025

This is a second attempt to fix the errors we're seeing when enabling backfill for grouphash metadata, wherein a small percentage of new grouphashes can't have their Seer metadata updated because the metadata record "hasn't been created yet."

We noticed that the metadata records to which this happened all seemed to have an id of None, so the first attempt at a fix was to call save before doing the update in those cases, to make sure the record was in the database. This introduced its own problems, however, in that the save call then started throwing an integrity error, complaining that a metadata record for that grouphash already existed. Since the call to Seer only happens with new, unrecognized grouphashes, and since we have a lock to prevent double group creation in those cases, this didn't make a ton of sense... until I realized that the lock doesn’t kick in until after we're done calling Seer. Whoops.

As a result, here's what I think is currently happening when the backfill is off:

  1. Two events with the same unrecognized hash come in essentially simultaneously.
  2. They both hit the GroupHash.objects.get_or_create call inside of get_or_create_grouphashes. (We use get_or_create there because the same codepath serves both new and existing hashes.) The slightly-sooner event creates a new grouphash and the slightly-later event gets that newly-created grouphash from the database. Each event has its own GroupHash instance to represent the shared database record.
  3. With backfill turned off, only the slightly-sooner event gets routed to create_or_update_grouphash_metadata_if_needed.
  4. The slightly-sooner event creates metadata for the grouphash, and Django attaches the resulting GroupHashMetadata object to that event's GroupHash instance.
  5. Both events get sent to Seer.
  6. When it comes time to store the results in metadata, only the slightly sooner event tries to do so, because the slightly-later event's GroupHash instance hasn't had a metadata record attached to it.
  7. Everything works as expected and the rest of the save_event process proceeds as it normally would.

And here's what I think is happening when the backfill is on:

  1. Same as above - two events come in
  2. Same as above - grouphash is created by slightly-sooner event, picked up by slightly-later event
  3. Because the backfill is on, this time both events get routed to create_or_update_grouphash_metadata_if_needed. Within create_or_update_grouphash_metadata_if_needed, they both fall into the if not grouphash.metadata branch, since neither one has yet had a chance to create a metadata record.
  4. Same as above - slightly-sooner event creates metadata
    4a. The slightly-later event also tries to create metadata for the grouphash, but can't, because the slightly-sooner event already has. Django still attaches a GroupHashMetadata object to the event's GroupHash instance, but an id never gets filled in because no metadata record was successfully created.
  5. Same as above - both events sent to Seer
  6. As before, the slightly-sooner event stores its Seer results in metadata. The slightly-later event also tries to, since it now has a metadata object attached to its grouphash, but it can't, because the metadata object isn't actually tied to a real database record. (Or, after the first attempted fix, it tries to save its metadata before updating, and can't, because the metadata from the slightly-sooner event already exists.)
  7. The slightly-sooner event proceeds as normal, while the slightly-later event throws an error.

With that scenario in mind, this PR makes three changes:

  • Inside of create_or_update_grouphash_metadata_if_needed, switch from using GroupHashMetadata.objects.create to using GroupHashMetadata.objects.get_or_create. Do this with just the grouphash filled in, because we know it will be the same between the events.
  • If get_or_create indicates that it got an already-existent record, log some info and then bail rather than trying to update the record with more data. That way, only the slightly-sooner event will have a metadata instance attached to its grouphash, and the Seer codepath will behave the way it did before backfill was turned on.
  • Just in case something about this theory is wrong and we do still try to update a bad metadata record with Seer results, bail early so neither the original error nor the new integrity error is thrown. Keep the log there, so we can know if we ever hit that logic.

If this works, we can then remove the band-aid from the Seer code.

@github-actions github-actions bot added the Scope: Backend Automatically applied to PRs that change backend components label Feb 26, 2025
@lobsterkatie lobsterkatie marked this pull request as ready for review February 26, 2025 08:04
@lobsterkatie lobsterkatie requested a review from a team as a code owner February 26, 2025 08:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Scope: Backend Automatically applied to PRs that change backend components
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant