Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Have the same lock for upload + preprocess task #502

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 7 additions & 2 deletions tasks/preprocess_upload.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@
from services.yaml import save_repo_yaml_to_database_if_needed
from services.yaml.fetcher import fetch_commit_yaml_from_provider
from tasks.base import BaseCodecovTask
from tasks.upload import UPLOAD_LOCK_NAME

log = logging.getLogger(__name__)

Expand All @@ -48,7 +49,8 @@ def run_impl(
"Received preprocess upload task",
extra=dict(repoid=repoid, commit=commitid, report_code=report_code),
)
lock_name = f"preprocess_upload_lock_{repoid}_{commitid}_{report_code}"
lock_name = UPLOAD_LOCK_NAME(repoid, commitid)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might be useful to add a comment explaining why this task needs the same lock as the upload task

# lock_name = f"preprocess_upload_lock_{repoid}_{commitid}_{report_code}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why keep it as a comment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

caaaause I forgot 🙃

redis_connection = get_redis_connection()
# This task only needs to run once per commit (per report_code)
# To generate the report. So if one is already running we don't need another
Expand All @@ -62,7 +64,10 @@ def run_impl(
with redis_connection.lock(
lock_name,
timeout=60 * 5,
blocking_timeout=None,
# This is the timeout that this task will wait to wait for the lock. This should
# be non-zero as otherwise it waits indefinitely to get the lock, and ideally smaller than
# the blocking timeout for the upload task so this one goes first, although it can go second
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not the order that matters. The reason that the blocking timeout should be smaller is to avoid the scenario in which the PreProcess tasks will wait forever and get the locks. This can drain Upload tasks (because they wait a limited time, and retry a limited amount of times).

If only PreProcess tasks run for a commit that's a problem. If only Upload tasks run it's OK.
We know that the number of Upload tasks is always >= than the number of PreProcess tasks for a given commit.
We know that the Upload tasks retry if they can't get the lock, but the PreProcess doesn't.
By forcing the PreProcess tasks to collectively wait less time for the lock we can guarantee that an Upload task will run and everything will be OK

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also the reason why nuking the PreProcess task makes sense.
It avoids this problem completely

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm kinda realizing now that the current setup could lead us into trouble...

Consider this scenario:
1 Preprocess and 1 Upload tasks execute at roughly the same time. PreProcess get's the lock. It can stay with the lock for a maximum of 5minutes. Let's say it keeps the lock for 5 minutes.

Upload waits 5s, doesn't get the lock. It retries (after 20s). Then it waits 5 more seconds to get the lock and fails. It will retry again (after 40s), and wait 5 more seconds. Then it will not retry again.
In total the task waiter 15s + 20s + 40s = 1m15s

In this scenario no Upload tasks would run for the commit and nothing would be processed

Copy link
Contributor Author

@adrian-codecov adrian-codecov Jun 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense, so if we just ensure the upload task waits at least 5 minutes in the worst case scenario, that would fix this right?

Every unsuccessful run goes through blocking_timeout (0) + retry_countdown (20 * 2**self.request.retries), so it goes like

0 retries: 5 + 20 = 25
1 retries: 5 + 40 = 45 (total 70)
2 retries: 5 + 204 = 85 (total 155)
3 retries: 5 + 20
8 = 165 (total 320)

So if we'd allow this to retry at most 3 times, this could solve this, or we could tweak the specific values of retrying. That's one way.

Another is to shorten the time during preprocess and fit the upload retries around that.

There's the elimination of the preprocess task altogether, which I can ask other people/engs but I suspect that will be met with more friction + unforeseen things cause we're suddenly unsupporting a command. Although if we fundamentally think this task is more harmful than not, it's worth pushing back on getting rid of it.

Let me know what you think about this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would push for the elimination of the preprocess task altogether. It would be a good idea to get other opinions from the team, definetely.

It is not true that "we're suddenly unsupporting a command" if we drop that task. The api would still create an empty report.

In the meantime we can go with the proposal in the other PR to alleviate customer's problems (although for the record I don't particularly like that solution, but that's just 1 opinion)

blocking_timeout=3,
):
return self.process_impl_within_lock(
db_session=db_session,
Expand Down
4 changes: 3 additions & 1 deletion tasks/upload.py
Original file line number Diff line number Diff line change
Expand Up @@ -64,6 +64,8 @@

CHUNK_SIZE = 3

UPLOAD_LOCK_NAME = lambda repoid, commitid: f"upload_lock_{repoid}_{commitid}"


class UploadContext:
"""
Expand Down Expand Up @@ -91,7 +93,7 @@ def lock_name(self, lock_type: str):
if lock_type == "upload_processing":
return UPLOAD_PROCESSING_LOCK_NAME(self.repoid, self.commitid)
else:
return f"{lock_type}_lock_{self.repoid}_{self.commitid}"
return UPLOAD_LOCK_NAME(self.repoid, self.commitid)
else:
return f"{lock_type}_lock_{self.repoid}_{self.commitid}_{self.report_type.value}"

Expand Down
Loading