Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

attempt to fix cff duplicates #497

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

adrian-codecov
Copy link
Contributor

Change isn't too bad, length is mostly from tests, but it's important to understand the problem. More described here as well: https://github.com/codecov/internal-issues/issues/478. These are the changes in summary but the proper explanation is below:

  • Added _possibly_delete_existing_cffs and _determine_cffs_and_depths_in_db fns. These find if there are suitable carryforwards sessions that should be deleted
  • Exposed the parent_depth in session_extras for uploads
  • Added a script to delete uploads + associated DB records
  • Added a couple of tests for the above ^

We've been seeing some very inconsistent and strange behaviour where customers using carryforward flags (cffs) see a disproportionate growth of cffs over time. It doesn't happen every commit, nor there is a specific pattern to it, but it is confirmed. A little on carryforward uploads.

What are cffs? It's a feature we offer where customers can persist upload coverage across commits without having to explicitly upload on every commit - customers specify this in their codecov yaml. Behind the scenes, we look for a commit's parent commit and "copy paste" the relevant upload information into the current commit.

The problem we're seeing is certain commits carry forward uploads from two different parent commits; this should never happen. (you'll see me here say upload and session interchangeably). How does this happen? See this example:

Example

Imagine you have a chain of commits, A -> B -> C, where C is the current commit, B is C's parent, and A is B's parent/C's grand-parent. In normal circumstances, if a customer has cffs on for say "unit" flag and they don't provide an upload with the "unit" flag in B and C, B would carry forward the upload from A, and C would carry forward the upload from B.

While we have a parent_commit field for the Commit model, we have a dedicated function to know which parent a commit should carry forward to. This function will find the closest suitable parent commit (normally it's immediate parent) and carryforward reports from it. This function is needed because sometimes a commit's direct parent might not have a valid report or one at all, the commit might be in an error state, there might not be a direct parent at all, etc. I won't go too much in detail in the different states but you can read the function if you're curious, but I want to focus on the scenario where a parent_commit is processing, which is when it is in "pending" state. A parent commit that is processing isn't a suitable candidate to carryforward a report from, because for all we know it doesn't have a valid report or it can error or other causes, so we advance to its parent and use that as a candidate. In the example above, C would look at B, which is processing, so C would carryforward from A.

This is so far expected and normal. The funky part occurs if you upload two or more times for commit C, one upload when B is processing, and one after B is successfully completed. In the the eyes of C, it would carryforward from A the first time and from B the second time, and if you consider A to have "N" uploads to carryforward and B to have "M" uploads to carry forward, then C would end up with "M + N" carryforward uploads - this is wrong, and C should only have "N" uploads. To take the example even further, if D eventually happens after C, and it carries the uploads from C, it would end up with "M + N" uploads, hence you see this "duplicate" effect every now and then. -- end of example

Now, if you're thinking, why doesn't this happen all the time as many people upload more than once per commit, you're thinking in the right vein. That's because, for this weird bug to happen, you need to upload multiple times very fast, and those uploads need to be very big. A commit doesn't stay processing for very long, so the window two different uploads to occur when a commit is processing and when it's not is slim. It is not impossible, but it's very unlikely. On top of that, two different uploads won't run this logic because it needs to go through this first, which is true when there isn't a report object and false after that is populated (so technically, 2 different uploads should never get here cause the second run will evaluate false and keep going). And thirdly, upload tasks have a lock for a task + repo + commit combo, so two uploads for the same commit couldn't be racing for the same function.

So, how does this actually happen? It's because there are two tasks that could run into this functionality: the upload and preprocessing tasks. These two can theoretically call the logic that chooses the commit's parent for cffs, one that chooses A as the "parent" and the other one that chooses "B" as the parent.

So if we piece everything together: if you run the preprocess + upload tasks in quick succession for a commit with a lot of uploads, there is a chance that they choose different parents.

So what am I doing? (finally), I'm writing some logic at "upload insertion time" to determine if there are already cffs from an older parent, and delete those if they exist. In the example above parting from C, I would be deleting the cffs from A when I see cffs coming from B.

Hopefully you enjoyed the read!

Legal Boilerplate

Look, I get it. The entity doing business as "Sentry" was incorporated in the State of Delaware in 2015 as Functional Software, Inc. In 2022 this entity acquired Codecov and as result Sentry is going to need some rights from me in order to utilize my contributions in this PR. So here's the deal: I retain all rights, title and interest in and to my contributions, and by keeping this boilerplate intact I confirm that Sentry can use, modify, copy, and redistribute my contributions, under Sentry's choice of terms.

@codecov-notifications
Copy link

codecov-notifications bot commented Jun 8, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

✅ All tests successful. No failed tests found.

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #497      +/-   ##
==========================================
- Coverage   97.26%   97.25%   -0.01%     
==========================================
  Files         412      413       +1     
  Lines       34424    34554     +130     
==========================================
+ Hits        33481    33607     +126     
- Misses        943      947       +4     
Flag Coverage Δ
integration 97.25% <100.00%> (-0.01%) ⬇️
latest-uploader-overall 97.25% <100.00%> (-0.01%) ⬇️
unit 97.25% <100.00%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
NonTestCode 94.40% <100.00%> (-0.02%) ⬇️
OutsideTasks 97.55% <100.00%> (+0.01%) ⬆️
Files Coverage Δ
services/delete/delete.py 100.00% <100.00%> (ø)
services/report/__init__.py 92.75% <100.00%> (+0.44%) ⬆️
services/tests/test_report.py 100.00% <100.00%> (ø)

... and 1 file with indirect coverage changes

@codecov-qa
Copy link

codecov-qa bot commented Jun 8, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 97.25%. Comparing base (fb91608) to head (f12fe8f).

✅ All tests successful. No failed tests found.

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #497      +/-   ##
==========================================
- Coverage   97.26%   97.25%   -0.01%     
==========================================
  Files         412      413       +1     
  Lines       34424    34554     +130     
==========================================
+ Hits        33481    33607     +126     
- Misses        943      947       +4     
Flag Coverage Δ
integration 97.25% <100.00%> (-0.01%) ⬇️
latest-uploader-overall 97.25% <100.00%> (-0.01%) ⬇️
unit 97.25% <100.00%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
NonTestCode 94.40% <100.00%> (-0.02%) ⬇️
OutsideTasks 97.55% <100.00%> (+0.01%) ⬆️
Files Coverage Δ
services/delete/delete.py 100.00% <100.00%> (ø)
services/report/__init__.py 92.75% <100.00%> (+0.44%) ⬆️
services/tests/test_report.py 100.00% <100.00%> (ø)

... and 1 file with indirect coverage changes

Copy link

codecov-public-qa bot commented Jun 8, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 97.25%. Comparing base (fb91608) to head (f12fe8f).

✅ All tests successful. No failed tests found ☺️

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #497      +/-   ##
==========================================
- Coverage   97.26%   97.25%   -0.01%     
==========================================
  Files         412      413       +1     
  Lines       34424    34554     +130     
==========================================
+ Hits        33481    33607     +126     
- Misses        943      947       +4     
Flag Coverage Δ
integration 97.25% <100.00%> (-0.01%) ⬇️
latest-uploader-overall 97.25% <100.00%> (-0.01%) ⬇️
unit 97.25% <100.00%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
NonTestCode 94.40% <100.00%> (-0.02%) ⬇️
OutsideTasks 97.55% <100.00%> (+0.01%) ⬆️
Files Coverage Δ
services/delete/delete.py 100.00% <100.00%> (ø)
services/report/__init__.py 92.75% <100.00%> (+0.44%) ⬆️
services/tests/test_report.py 100.00% <100.00%> (ø)

... and 1 file with indirect coverage changes

Copy link

codecov bot commented Jun 8, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 97.28%. Comparing base (fb91608) to head (f12fe8f).

Changes have been made to critical files, which contain lines commonly executed in production. Learn more

✅ All tests successful. No failed tests found.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #497      +/-   ##
==========================================
- Coverage   97.28%   97.28%   -0.01%     
==========================================
  Files         443      444       +1     
  Lines       35153    35283     +130     
==========================================
+ Hits        34200    34326     +126     
- Misses        953      957       +4     
Flag Coverage Δ
integration 97.25% <100.00%> (-0.01%) ⬇️
latest-uploader-overall 97.25% <100.00%> (-0.01%) ⬇️
unit 97.25% <100.00%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
NonTestCode 94.45% <100.00%> (-0.02%) ⬇️
OutsideTasks 97.55% <100.00%> (+0.01%) ⬆️
Files Coverage Δ
services/delete/delete.py 100.00% <100.00%> (ø)
services/report/__init__.py Critical 92.76% <100.00%> (+0.44%) ⬆️
services/tests/test_report.py 100.00% <100.00%> (ø)

... and 1 file with indirect coverage changes

Related Entrypoints
run/app.tasks.upload.Upload
run/app.tasks.upload.PreProcessUpload

Copy link
Contributor

@giovanni-guidini giovanni-guidini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed explanation in the PR. Very helpful.

After reading through that and the proposed solution it seems that you are trying to fix the effects of the issue, not the cause (the cause being that 2 different tasks might be doing the same thing at the same time).

Wouldn't it be preferable to fix the cause of the problem?
It saves us from having to do extra work (saving sessions and then deleting them).

I would actually go as far as suggesting that we remove the pre_process_upload task completely. I don't think we are getting any serious benefit from it at this time...

@adrian-codecov
Copy link
Contributor Author

@giovanni-guidini I'm not as versed in the preprocess task to know the implications of deleting it altogether, for example what happens to people that use the preprocess cli command (create-report)? Do we know how many people use this + the origins of the task?

While I think this is 99% the cause, I'd like to see if it's indeed the case in prod so that's why I thought of first trying out this.

Other things we could do:

  • Have the pre-processing and uploads locks be the same
  • Extend this timeout time
    timeout=60 * 5,
    cause the task could also timeout w/ the new deletions

@giovanni-guidini
Copy link
Contributor

@adrian-codecov the main objective of the create-report command is to create a report (as the name implies). That process happens in the api, not in the task.

The pre-process upload task simply initializes the report that was created (carryforward is part of the initialization), but as we've seen that is also done in the upload task. So for people using the create-report command there's no real change. The difference is that the initialization will happen in the Upload task, not PreProcessUpload one.

To which you might be wondering "what's the point of a redundant task?"
Because the report only needs to be initialized once the idea of PreProcessUpload task is to speed up report processing by doing the initialization. As such when the upload comes the system is primed to process it.

Anyway, if we remove it the upload task would just take care of the process initialization. It might increase the time to notify by a little bit, but that would be the extend of the impact, I think.

Using the same lock would indeed solve it, but having more tasks fighting for the same lock would increase coordination overhead (and we have a clear way to have less tasks fighting for locks).

I don't see how increasing the timeout would help with the issue

@giovanni-guidini
Copy link
Contributor

giovanni-guidini commented Jun 11, 2024

About this bit

While I think this is 99% the cause, I'd like to see if it's indeed the case in prod so that's why I thought of first trying out this.

That makes sense, but the code as is seems quite "final" and "permanent". Assuming this solution works to fix the carryforward issues we see (which it seems that it would work, the change itself seems fine*), what would be the next steps to solve the cause?
Or would we just leave this fix there and potentially drown other sources of this issue (with more work by the system), instead of preventing the system from creating this error condition?

(Of course I might be missing a very obvious "this is urgent, we need to patch this asap and worry about not making it happening again later", which would be very valid)

*The sessions that belong to a report are also saved in the report_json object that goes to GCS. I do believe that the report json would list sessions for just 1 of the parents - "as if" there were no duplicates in the report itself - but have you verified that?

if not cffs_and_depths_in_db:
return

# Gets first 'parent_depth' it finds, as all cffs should be carried forward
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't it get the minimum 'parent_depth' instead of the first?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The next four lines do that, it goes through the Report object which contains the uploads that will be carryforwarded and will determine the depth they have. I'm initializing carryforward_report_depth with the arbitrary depth of 1 and in case line 1241 doesn't run, so the variable has a non-null value. I update it shortly after in line 1243 if a depth for the carryforward reports are found, does that make sense?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well... no. The question is if the code should get the smallest 'parent_depth' from all the sessions in the report. Both the comment and the code are getting the 1st parent_depth that appears. Those values can be different depending on the order of sessions that were carried forward, and you made no claims about the order of the sessions

asked differently: Is the 1st session that appears always the smallest one?

@@ -60,6 +62,8 @@
from services.repository import get_repo_provider_service
from services.yaml.reader import get_paths_from_flags, read_yaml_field

MIN_PARENT_COMMIT_DEPTH = 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this ever not be 1?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This value itself will always be 1, but the parent depth a commit chooses is determined by this

def get_appropriate_commit_to_carryforward_from(
self, commit: Commit, max_parenthood_deepness: int = 10
. I just made this into a constant cause I was referencing it in 2 different places

commit=commit.commitid,
),
)
# Potentially make task here - this would make the preprocess task likely timeout
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The upload task would be likely to timeout here too? Because that's a way bigger problem

Copy link
Contributor Author

@adrian-codecov adrian-codecov Jun 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could depending on the amount of things to delete, hence the ask here to see if it would make sense to create a separate deletion task. I could do that if we think it makes more sense

Copy link
Contributor

@giovanni-guidini giovanni-guidini Jun 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that would be a good idea, yes.
Especially if it only changes the database and not the Report object it would make a lot of sense to move this to a different task

[ps.: "I" can't speak for the team, obviously. You might want to gather more opinions]

for parent_depth, upload_ids in cffs_and_depths_in_db.items():
if parent_depth > carryforward_report_depth:
log.info(
"Deleting upload from DB",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit] Big chance that this will spam our logs. Might be better to accumulate the number of uploads deleted and just make a single log afterwards...

db_session.query(Upload).filter(Upload.id_.in_(upload_ids)).delete(
synchronize_session=False
)
db_session.commit()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we use flush instead of commit? Like we do in the sync_repos and sync_teams tasks? I don't really know the difference, tho. Just going for the consistency

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants