-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add parallel component computation to timeseries #1054
Conversation
Codecov ReportAttention: Patch coverage is ✅ All tests successful. No failed tests found.
📢 Thoughts on this report? Let us know! |
Codecov ReportAttention: Patch coverage is
✅ All tests successful. No failed tests found.
Additional details and impacted files@@ Coverage Diff @@
## main #1054 +/- ##
==========================================
+ Coverage 97.52% 97.53% +0.01%
==========================================
Files 462 463 +1
Lines 37902 37986 +84
==========================================
+ Hits 36963 37049 +86
+ Misses 939 937 -2
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
❌ 1 Tests Failed:
View the top 1 failed tests by shortest run time
To view more test analytics, go to the Test Analytics Dashboard |
❌ 1 Tests Failed:
View the top 1 failed tests by shortest run time
To view more test analytics, go to the Test Analytics Dashboard |
✅ All tests successful. No failed tests were found. 📣 Thoughts on this report? Let Codecov know! | Powered by Codecov |
9856c08
to
3ac01a8
Compare
similar to what i just did with the compute component task when it comes time to upserting measurements in the save_commit_measurements task, i added a feature flag and behind that feature flag i altered the behaviour so that instead of upserting the measurements for each component sequentially it will queue up a task for each component so that they can upsert in parallel I had to reorganize the code to avoid some circular imports hence save_commit_measurements the function, being moved to tasks/save_commit_measurements.py
3ac01a8
to
e61e74d
Compare
) | ||
g.apply_async() | ||
else: | ||
upsert_components_measurements(commit, current_yaml, db_session, report) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, we changed the maybe_upsert_component_measurements
to this
services/timeseries.py
Outdated
@@ -105,67 +72,66 @@ def maybe_upsert_flag_measurements(commit, dataset_names, db_session, report): | |||
upsert_measurements(db_session, measurements) | |||
|
|||
|
|||
def maybe_upsert_components_measurements( | |||
commit, current_yaml, dataset_names, db_session, report | |||
def find_duplicate_component_ids(components: list[Component], commit: Commit): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we moved this outside the maybe
fn and that's what we extracted, now forming the upsert_components_measurements
fn
tasks/upsert_component.py
Outdated
if component.paths or component.flag_regexes: | ||
report_and_component_matching_flags = component.get_matching_flags( | ||
list(report.flags.keys()) | ||
) | ||
filtered_report = report.filter( | ||
flags=report_and_component_matching_flags, paths=component.paths | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the primary slowness comes from the report.filter
call, which is the thing we would like to parallelize.
All the rest should be trivial and fast. And I would imagine that the overhead of task spawning, fetching the report, and fetching the yaml is quite a bit of overhead.
How about:
You do the deduplication, resolution and filtering of these components in the main task.
As the main task is fetching the report anyway, you can end up with a list of component_id, flags, paths
which are already filtered down to only the relevant ones.
Then when you have that filtered and resolved list, you either iterate over that list, or spawn individual tasks.
- [save_commit_measurements] move work of checking timeseries_enabled to the caller - [save_commit_measurements] move work of getting the dataset names to the caller - [upsert_component, save_commit_measurements] move work of finding relevant flags and paths to save_commit_measurements, since work does not need to be duplicated in parallel tasks - [upsert_component, timeseries] remove find_duplicate_component_ids - update tests
Suspect IssuesThis pull request was deployed and Sentry observed the following issues:
Did you find this useful? React with a 👍 or 👎 |
similar to what i just did with the compute component task
when it comes time to upserting measurements in the save_commit_measurements
task, i added a feature flag and behind that feature flag i altered the
behaviour so that instead of upserting the measurements for each component
sequentially it will queue up a task for each component so that they can upsert
in parallel
I had to reorganize the code to avoid some circular imports hence save_commit_measurements
the function, being moved to tasks/save_commit_measurements.py