-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
fix(grouping): Schedule seer deletion tasks with less hashes #95156
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
for i in range(last_deleted_index, len_hashes, BATCH_SIZE): | ||
# Slice operations are safe and will not raise IndexError | ||
chunked_hashes = hashes[i : i + BATCH_SIZE] | ||
delete_seer_grouping_records_by_hash.apply_async(args=[project_id, chunked_hashes, 0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Newer tasks will be scheduled with last_deleted_index=0
since we're scheduling a chunked task.
# Iterate through hashes in chunks and schedule a task for each chunk | ||
# There are tasks passing last_deleted_index, thus, we need to start from that index | ||
# Eventually all tasks will pass 0 | ||
for i in range(last_deleted_index, len_hashes, BATCH_SIZE): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we want to add similar tests to make sure the right number of tasks get called?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this test I added yesterday covers it.
def test_call_delete_seer_grouping_records_by_hash_chunked(self) -> None: |
Codecov ReportAttention: Patch coverage is ✅ All tests successful. No failed tests found.
Additional details and impacted files@@ Coverage Diff @@
## master #95156 +/- ##
==========================================
+ Coverage 87.84% 87.90% +0.05%
==========================================
Files 10469 10459 -10
Lines 605374 604694 -680
Branches 23674 23571 -103
==========================================
- Hits 531819 531575 -244
+ Misses 73195 72758 -437
- Partials 360 361 +1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense to me.
end_index = min(last_deleted_index + BATCH_SIZE, len_hashes) | ||
call_seer_to_delete_these_hashes(project_id, hashes[last_deleted_index:end_index]) | ||
if end_index < len_hashes: | ||
delete_seer_grouping_records_by_hash.apply_async(args=[project_id, hashes, end_index]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this where all the continued stream of big tasks was coming from?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@markstory yes, this is where it came from
0161fd9
to
97e7fba
Compare
bugbot run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
✅ BugBot reviewed your changes and found no bugs!
Was this report helpful? Give feedback by reacting with 👍 or 👎
This simplifies the tests for deletions of hashes from Seer and it also adds a test for #95156.
The original code would always pass all hashes to all tasks spawned, thus, we could end up with massive payloads for tasks causing trouble to taskbroker. We got into such a situation in the last few days when the deletion of a project would lead to hundreds of thousands of hashes being passed to tasks (179k+ hashes -> 6MB+ task payloads). The changes here would take all hashes from a task, chunk the hashes and spawn new tasks with a small size of hashes. This moves us from sequential scheduling of tasks to parallelized scheduling. This could have an impact on the Seer service if a massive number of hashes are requested for deletion. Ref inc-1236
This simplifies the tests for deletions of hashes from Seer and it also adds a test for #95156.
Suspect IssuesThis pull request was deployed and Sentry observed the following issues: Did you find this useful? React with a 👍 or 👎 |
The original code would always pass all hashes to all tasks spawned, thus, we could end up with massive payloads for tasks causing trouble to taskbroker.
We got into such a situation in the last few days when the deletion of a project would lead to hundreds of thousands of hashes being passed to tasks (179k+ hashes -> 6MB+ task payloads).
The changes here would take all hashes from a task, chunk the hashes and spawn new tasks with a small size of hashes.
This moves us from sequential scheduling of tasks to parallelized scheduling.
This could have an impact on the Seer service.
Ref inc-1236