attempt to fix cff duplicates #497

adrian-codecov · 2024-06-08T00:30:28Z

Change isn't too bad, length is mostly from tests, but it's important to understand the problem. More described here as well: https://github.com/codecov/internal-issues/issues/478. These are the changes in summary but the proper explanation is below:

Added _possibly_delete_existing_cffs and _determine_cffs_and_depths_in_db fns. These find if there are suitable carryforwards sessions that should be deleted
Exposed the parent_depth in session_extras for uploads
Added a script to delete uploads + associated DB records
Added a couple of tests for the above ^

We've been seeing some very inconsistent and strange behaviour where customers using carryforward flags (cffs) see a disproportionate growth of cffs over time. It doesn't happen every commit, nor there is a specific pattern to it, but it is confirmed. A little on carryforward uploads.

What are cffs? It's a feature we offer where customers can persist upload coverage across commits without having to explicitly upload on every commit - customers specify this in their codecov yaml. Behind the scenes, we look for a commit's parent commit and "copy paste" the relevant upload information into the current commit.

The problem we're seeing is certain commits carry forward uploads from two different parent commits; this should never happen. (you'll see me here say upload and session interchangeably). How does this happen? See this example:

Example

Imagine you have a chain of commits, A -> B -> C, where C is the current commit, B is C's parent, and A is B's parent/C's grand-parent. In normal circumstances, if a customer has cffs on for say "unit" flag and they don't provide an upload with the "unit" flag in B and C, B would carry forward the upload from A, and C would carry forward the upload from B.

While we have a parent_commit field for the Commit model, we have a dedicated function to know which parent a commit should carry forward to. This function will find the closest suitable parent commit (normally it's immediate parent) and carryforward reports from it. This function is needed because sometimes a commit's direct parent might not have a valid report or one at all, the commit might be in an error state, there might not be a direct parent at all, etc. I won't go too much in detail in the different states but you can read the function if you're curious, but I want to focus on the scenario where a parent_commit is processing, which is when it is in "pending" state. A parent commit that is processing isn't a suitable candidate to carryforward a report from, because for all we know it doesn't have a valid report or it can error or other causes, so we advance to its parent and use that as a candidate. In the example above, C would look at B, which is processing, so C would carryforward from A.

This is so far expected and normal. The funky part occurs if you upload two or more times for commit C, one upload when B is processing, and one after B is successfully completed. In the the eyes of C, it would carryforward from A the first time and from B the second time, and if you consider A to have "N" uploads to carryforward and B to have "M" uploads to carry forward, then C would end up with "M + N" carryforward uploads - this is wrong, and C should only have "N" uploads. To take the example even further, if D eventually happens after C, and it carries the uploads from C, it would end up with "M + N" uploads, hence you see this "duplicate" effect every now and then. -- end of example

Now, if you're thinking, why doesn't this happen all the time as many people upload more than once per commit, you're thinking in the right vein. That's because, for this weird bug to happen, you need to upload multiple times very fast, and those uploads need to be very big. A commit doesn't stay processing for very long, so the window two different uploads to occur when a commit is processing and when it's not is slim. It is not impossible, but it's very unlikely. On top of that, two different uploads won't run this logic because it needs to go through this first, which is true when there isn't a report object and false after that is populated (so technically, 2 different uploads should never get here cause the second run will evaluate false and keep going). And thirdly, upload tasks have a lock for a task + repo + commit combo, so two uploads for the same commit couldn't be racing for the same function.

So, how does this actually happen? It's because there are two tasks that could run into this functionality: the upload and preprocessing tasks. These two can theoretically call the logic that chooses the commit's parent for cffs, one that chooses A as the "parent" and the other one that chooses "B" as the parent.

So if we piece everything together: if you run the preprocess + upload tasks in quick succession for a commit with a lot of uploads, there is a chance that they choose different parents.

So what am I doing? (finally), I'm writing some logic at "upload insertion time" to determine if there are already cffs from an older parent, and delete those if they exist. In the example above parting from C, I would be deleting the cffs from A when I see cffs coming from B.

Hopefully you enjoyed the read!

Legal Boilerplate

Look, I get it. The entity doing business as "Sentry" was incorporated in the State of Delaware in 2015 as Functional Software, Inc. In 2022 this entity acquired Codecov and as result Sentry is going to need some rights from me in order to utilize my contributions in this PR. So here's the deal: I retain all rights, title and interest in and to my contributions, and by keeping this boilerplate intact I confirm that Sentry can use, modify, copy, and redistribute my contributions, under Sentry's choice of terms.

codecov-notifications · 2024-06-08T00:36:26Z

Codecov Report

All modified and coverable lines are covered by tests ✅

✅ All tests successful. No failed tests found.

@@            Coverage Diff             @@
##             main     #497      +/-   ##
==========================================
- Coverage   97.26%   97.25%   -0.01%     
==========================================
  Files         412      413       +1     
  Lines       34424    34554     +130     
==========================================
+ Hits        33481    33607     +126     
- Misses        943      947       +4

Flag	Coverage Δ
integration	`97.25% <100.00%> (-0.01%)`	⬇️
latest-uploader-overall	`97.25% <100.00%> (-0.01%)`	⬇️
unit	`97.25% <100.00%> (-0.01%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components	Coverage Δ
NonTestCode	`94.40% <100.00%> (-0.02%)`	⬇️
OutsideTasks	`97.55% <100.00%> (+0.01%)`	⬆️

Files	Coverage Δ
services/delete/delete.py	`100.00% <100.00%> (ø)`
services/report/__init__.py	`92.75% <100.00%> (+0.44%)`	⬆️
services/tests/test_report.py	`100.00% <100.00%> (ø)`

... and 1 file with indirect coverage changes

codecov-qa · 2024-06-08T00:36:27Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 97.25%. Comparing base (fb91608) to head (f12fe8f).

✅ All tests successful. No failed tests found.

@@            Coverage Diff             @@
##             main     #497      +/-   ##
==========================================
- Coverage   97.26%   97.25%   -0.01%     
==========================================
  Files         412      413       +1     
  Lines       34424    34554     +130     
==========================================
+ Hits        33481    33607     +126     
- Misses        943      947       +4

Flag	Coverage Δ
integration	`97.25% <100.00%> (-0.01%)`	⬇️
latest-uploader-overall	`97.25% <100.00%> (-0.01%)`	⬇️
unit	`97.25% <100.00%> (-0.01%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components	Coverage Δ
NonTestCode	`94.40% <100.00%> (-0.02%)`	⬇️
OutsideTasks	`97.55% <100.00%> (+0.01%)`	⬆️

Files	Coverage Δ
services/delete/delete.py	`100.00% <100.00%> (ø)`
services/report/__init__.py	`92.75% <100.00%> (+0.44%)`	⬆️
services/tests/test_report.py	`100.00% <100.00%> (ø)`

... and 1 file with indirect coverage changes

codecov-public-qa · 2024-06-08T00:36:45Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 97.25%. Comparing base (fb91608) to head (f12fe8f).

✅ All tests successful. No failed tests found ☺️

@@            Coverage Diff             @@
##             main     #497      +/-   ##
==========================================
- Coverage   97.26%   97.25%   -0.01%     
==========================================
  Files         412      413       +1     
  Lines       34424    34554     +130     
==========================================
+ Hits        33481    33607     +126     
- Misses        943      947       +4

Flag	Coverage Δ
integration	`97.25% <100.00%> (-0.01%)`	⬇️
latest-uploader-overall	`97.25% <100.00%> (-0.01%)`	⬇️
unit	`97.25% <100.00%> (-0.01%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components	Coverage Δ
NonTestCode	`94.40% <100.00%> (-0.02%)`	⬇️
OutsideTasks	`97.55% <100.00%> (+0.01%)`	⬆️

Files	Coverage Δ
services/delete/delete.py	`100.00% <100.00%> (ø)`
services/report/__init__.py	`92.75% <100.00%> (+0.44%)`	⬆️
services/tests/test_report.py	`100.00% <100.00%> (ø)`

... and 1 file with indirect coverage changes

codecov · 2024-06-08T00:37:15Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 97.28%. Comparing base (fb91608) to head (f12fe8f).

Changes have been made to critical files, which contain lines commonly executed in production. Learn more

✅ All tests successful. No failed tests found.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #497      +/-   ##
==========================================
- Coverage   97.28%   97.28%   -0.01%     
==========================================
  Files         443      444       +1     
  Lines       35153    35283     +130     
==========================================
+ Hits        34200    34326     +126     
- Misses        953      957       +4

Flag	Coverage Δ
integration	`97.25% <100.00%> (-0.01%)`	⬇️
latest-uploader-overall	`97.25% <100.00%> (-0.01%)`	⬇️
unit	`97.25% <100.00%> (-0.01%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components	Coverage Δ
NonTestCode	`94.45% <100.00%> (-0.02%)`	⬇️
OutsideTasks	`97.55% <100.00%> (+0.01%)`	⬆️

Files	Coverage Δ
services/delete/delete.py	`100.00% <100.00%> (ø)`
services/report/__init__.py Critical	`92.76% <100.00%> (+0.44%)`	⬆️
services/tests/test_report.py	`100.00% <100.00%> (ø)`

... and 1 file with indirect coverage changes

Related Entrypoints
run/app.tasks.upload.Upload
run/app.tasks.upload.PreProcessUpload

giovanni-guidini

Thanks for the detailed explanation in the PR. Very helpful.

After reading through that and the proposed solution it seems that you are trying to fix the effects of the issue, not the cause (the cause being that 2 different tasks might be doing the same thing at the same time).

Wouldn't it be preferable to fix the cause of the problem?
It saves us from having to do extra work (saving sessions and then deleting them).

I would actually go as far as suggesting that we remove the pre_process_upload task completely. I don't think we are getting any serious benefit from it at this time...

adrian-codecov · 2024-06-10T16:20:23Z

@giovanni-guidini I'm not as versed in the preprocess task to know the implications of deleting it altogether, for example what happens to people that use the preprocess cli command (create-report)? Do we know how many people use this + the origins of the task?

While I think this is 99% the cause, I'd like to see if it's indeed the case in prod so that's why I thought of first trying out this.

Other things we could do:

Have the pre-processing and uploads locks be the same
Extend this timeout time

worker/tasks/preprocess_upload.py

Line 64 in fa2ddab

timeout=60 * 5,

cause the task could also timeout w/ the new deletions

giovanni-guidini · 2024-06-11T11:22:22Z

@adrian-codecov the main objective of the create-report command is to create a report (as the name implies). That process happens in the api, not in the task.

The pre-process upload task simply initializes the report that was created (carryforward is part of the initialization), but as we've seen that is also done in the upload task. So for people using the create-report command there's no real change. The difference is that the initialization will happen in the Upload task, not PreProcessUpload one.

To which you might be wondering "what's the point of a redundant task?"
Because the report only needs to be initialized once the idea of PreProcessUpload task is to speed up report processing by doing the initialization. As such when the upload comes the system is primed to process it.

Anyway, if we remove it the upload task would just take care of the process initialization. It might increase the time to notify by a little bit, but that would be the extend of the impact, I think.

Using the same lock would indeed solve it, but having more tasks fighting for the same lock would increase coordination overhead (and we have a clear way to have less tasks fighting for locks).

I don't see how increasing the timeout would help with the issue

giovanni-guidini · 2024-06-11T11:29:33Z

About this bit

While I think this is 99% the cause, I'd like to see if it's indeed the case in prod so that's why I thought of first trying out this.

That makes sense, but the code as is seems quite "final" and "permanent". Assuming this solution works to fix the carryforward issues we see (which it seems that it would work, the change itself seems fine*), what would be the next steps to solve the cause?
Or would we just leave this fix there and potentially drown other sources of this issue (with more work by the system), instead of preventing the system from creating this error condition?

(Of course I might be missing a very obvious "this is urgent, we need to patch this asap and worry about not making it happening again later", which would be very valid)

*The sessions that belong to a report are also saved in the report_json object that goes to GCS. I do believe that the report json would list sessions for just 1 of the parents - "as if" there were no duplicates in the report itself - but have you verified that?

giovanni-guidini · 2024-06-14T09:15:43Z

services/report/__init__.py

+        if not cffs_and_depths_in_db:
+            return
+
+        # Gets first 'parent_depth' it finds, as all cffs should be carried forward


Shouldn't it get the minimum 'parent_depth' instead of the first?

The next four lines do that, it goes through the Report object which contains the uploads that will be carryforwarded and will determine the depth they have. I'm initializing carryforward_report_depth with the arbitrary depth of 1 and in case line 1241 doesn't run, so the variable has a non-null value. I update it shortly after in line 1243 if a depth for the carryforward reports are found, does that make sense?

well... no. The question is if the code should get the smallest 'parent_depth' from all the sessions in the report. Both the comment and the code are getting the 1st parent_depth that appears. Those values can be different depending on the order of sessions that were carried forward, and you made no claims about the order of the sessions

asked differently: Is the 1st session that appears always the smallest one?

giovanni-guidini · 2024-06-14T09:15:57Z

services/report/__init__.py

@@ -60,6 +62,8 @@
 from services.repository import get_repo_provider_service
 from services.yaml.reader import get_paths_from_flags, read_yaml_field

+MIN_PARENT_COMMIT_DEPTH = 1


Will this ever not be 1?

This value itself will always be 1, but the parent depth a commit chooses is determined by this

worker/services/report/__init__.py

Lines 648 to 649 in f12fe8f

def get_appropriate_commit_to_carryforward_from(

self, commit: Commit, max_parenthood_deepness: int = 10

. I just made this into a constant cause I was referencing it in 2 different places

giovanni-guidini · 2024-06-14T09:16:44Z

services/report/__init__.py

+                        commit=commit.commitid,
+                    ),
+                )
+                # Potentially make task here - this would make the preprocess task likely timeout


The upload task would be likely to timeout here too? Because that's a way bigger problem

It could depending on the amount of things to delete, hence the ask here to see if it would make sense to create a separate deletion task. I could do that if we think it makes more sense

I think that would be a good idea, yes.
Especially if it only changes the database and not the Report object it would make a lot of sense to move this to a different task

[ps.: "I" can't speak for the team, obviously. You might want to gather more opinions]

giovanni-guidini · 2024-06-17T09:16:51Z

services/report/__init__.py

+        for parent_depth, upload_ids in cffs_and_depths_in_db.items():
+            if parent_depth > carryforward_report_depth:
+                log.info(
+                    "Deleting upload from DB",


[nit] Big chance that this will spam our logs. Might be better to accumulate the number of uploads deleted and just make a single log afterwards...

giovanni-guidini · 2024-06-17T09:18:14Z

services/delete/delete.py

+    db_session.query(Upload).filter(Upload.id_.in_(upload_ids)).delete(
+        synchronize_session=False
+    )
+    db_session.commit()


Shouldn't we use flush instead of commit? Like we do in the sync_repos and sync_teams tasks? I don't really know the difference, tho. Just going for the consistency

attempt to fix cff duplicates

f12fe8f

giovanni-guidini reviewed Jun 10, 2024

View reviewed changes

adrian-codecov mentioned this pull request Jun 13, 2024

Have the same lock for upload + preprocess task #502

Open

giovanni-guidini reviewed Jun 14, 2024

View reviewed changes

giovanni-guidini reviewed Jun 17, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

attempt to fix cff duplicates #497

attempt to fix cff duplicates #497

adrian-codecov commented Jun 8, 2024

codecov-notifications bot commented Jun 8, 2024 •

edited

Loading

codecov-qa bot commented Jun 8, 2024 •

edited

Loading

codecov-public-qa bot commented Jun 8, 2024 •

edited

Loading

codecov bot commented Jun 8, 2024

giovanni-guidini left a comment •

edited

Loading

adrian-codecov commented Jun 10, 2024

giovanni-guidini commented Jun 11, 2024

giovanni-guidini commented Jun 11, 2024 •

edited

Loading

giovanni-guidini Jun 14, 2024

adrian-codecov Jun 14, 2024

giovanni-guidini Jun 17, 2024

giovanni-guidini Jun 14, 2024

adrian-codecov Jun 14, 2024

giovanni-guidini Jun 14, 2024

adrian-codecov Jun 14, 2024 •

edited

Loading

giovanni-guidini Jun 17, 2024 •

edited

Loading

giovanni-guidini Jun 17, 2024

giovanni-guidini Jun 17, 2024

	def get_appropriate_commit_to_carryforward_from(
	self, commit: Commit, max_parenthood_deepness: int = 10

attempt to fix cff duplicates #497

Are you sure you want to change the base?

attempt to fix cff duplicates #497

Conversation

adrian-codecov commented Jun 8, 2024

Example

Legal Boilerplate

codecov-notifications bot commented Jun 8, 2024 • edited Loading

Codecov Report

codecov-qa bot commented Jun 8, 2024 • edited Loading

Codecov Report

codecov-public-qa bot commented Jun 8, 2024 • edited Loading

Codecov Report

codecov bot commented Jun 8, 2024

Codecov Report

giovanni-guidini left a comment • edited Loading

Choose a reason for hiding this comment

adrian-codecov commented Jun 10, 2024

giovanni-guidini commented Jun 11, 2024

giovanni-guidini commented Jun 11, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adrian-codecov Jun 14, 2024 • edited Loading

Choose a reason for hiding this comment

giovanni-guidini Jun 17, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-notifications bot commented Jun 8, 2024 •

edited

Loading

codecov-qa bot commented Jun 8, 2024 •

edited

Loading

codecov-public-qa bot commented Jun 8, 2024 •

edited

Loading

giovanni-guidini left a comment •

edited

Loading

giovanni-guidini commented Jun 11, 2024 •

edited

Loading

adrian-codecov Jun 14, 2024 •

edited

Loading

giovanni-guidini Jun 17, 2024 •

edited

Loading