Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feat] - S3 Progress Tracker #3568

Open
wants to merge 18 commits into
base: main
Choose a base branch
from
Open

[feat] - S3 Progress Tracker #3568

wants to merge 18 commits into from

Conversation

ahrav
Copy link
Collaborator

@ahrav ahrav commented Nov 7, 2024

Description:

This PR adds a mechanism to track S3 scan progress, enabling resumption.

Checklist:

  • Tests passing (make test-community)?
  • Lint passing (make lint this requires golangci-lint)?

@ahrav ahrav requested review from a team as code owners November 7, 2024 02:29
@ahrav ahrav marked this pull request as draft November 7, 2024 02:30
@ahrav ahrav changed the title add config option for s3 resumption [feat] - S3 Progress Tracker Nov 7, 2024
@ahrav ahrav requested a review from a team November 7, 2024 22:10
@ahrav ahrav marked this pull request as ready for review November 7, 2024 22:10
type ProgressTracker struct {
enabled bool

isResuming bool // Indicates if the tracker is resuming a previous scan
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like it gets set once during initialization and never updated. Is that right?

When I read it, I thought it was like a state flag, and once we finished resuming it would be set to false, but I haven't reviewed the whole ProgressTracker module yet.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name is likely misleading—it was intended as a flag to indicate some progress, primarily used in the update flow for calculating SectionRemaining. However, I don’t think we need to manage progress at all, so I’ll likely remove this and any related logic.

ex:

pageSize := int32(len(pageContents))
    if p.currentPageSize == 0 {
        if !p.isResuming {
            // Only add to the total if this is a fresh scan.
            p.progress.SectionsRemaining += pageSize
        }
	p.currentPageSize = pageSize
}

Comment on lines 39 to 42
if progress == nil {
ctx.Logger().Info("Nil progress provided. Progress initialized.")
progress = new(sources.Progress)
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this actually allow progress tracking? I think not, because progress is now only accessible internally.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sigh, I thought I removed this. 😢 Yea this doesn't work, we should just return an error here. If a source doesn't have Progress, we have bigger problems.

// enabling resumable scans by tracking which objects have been successfully processed.
// It provides checkpoints that can be used to resume interrupted scans without missing objects.
type ProgressTracker struct {
enabled bool
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If enabled does not change at runtime, have you considered a noop implementation?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was being lazy, assuming this would only be temporary. Since progress tracking will eventually always be enabled, I just went with this approach. 😓

Comment on lines +111 to +114
// Preserve existing progress counters while clearing resume state.
p.progress.SetProgressComplete(
int(p.progress.SectionsCompleted),
int(p.progress.SectionsRemaining),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why preserve the counters?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was because Complete is only called at the end of the scan. Since progress was tracked at the object level, there’s no need to set these values. I’ll be removing all progress logic. Progress tracking will still happen at the bucket level, but resuming will remain at the object level.

@ahrav ahrav requested a review from mcastorina November 8, 2024 18:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

2 participants