Skip to content

Conversation

LinZhihao-723
Copy link
Member

Description

Before this PR, the CLP package compression only accepted a single S3 URL as the input. The S3 URL is decomposed into a bucket, a region code, and a key prefix. Everything under the key prefix will be compressed.

This PR adds support for providing multiple S3 object keys as input for compression. Now, users can specify a list of S3 URLs, where each URL is treated as a URL to an actual key in the bucket. Key-prefix-based ingestion is still supported, however, users must explicitly specify that the input is a prefix by using --s3-single-prefix option.

The current implementation has the following requirements for the given input URLs:

  • All the specified objects must belong to the same bucket.
  • All the specified objects must be in the same region.
  • The keys must share a common prefix.
  • There should be no duplicate keys.

To support multi-keys, we update S3InputConfig to store an optional keys field. If this field is not set, prefix-based ingestion will be used as before. Otherwise, we will traverse the bucket to collect object metadata of the given keys to create compression jobs.

Checklist

  • The PR satisfies the contribution guidelines.
  • This is a breaking change and that has been indicated in the PR title, OR this isn't a
    breaking change.
  • Necessary docs have been updated, OR no docs need to be updated.

Validation performed

  • Ensure all workflows pass.
  • Tested locally that prefix-based ingestion still works as previously.
  • Tested locally that with a list of valid keys given, they can be compressed as expected.
  • Tested locally that missing keys will be properly reported.

Copy link
Contributor

coderabbitai bot commented Oct 4, 2025

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Comment on lines +199 to +206
args_parser.add_argument(
"--s3-single-prefix",
action="store_true",
help=(
"Treat the S3 URL as a single prefix. If set, only a single S3 URL should be provided"
" and it must be explicitly given as a positional argument."
),
)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check whether this flag name makes sense @kirkrodrigues @junhaoliao


class S3InputConfig(S3Config):
type: Literal[InputType.S3.value] = InputType.S3.value
keys: Optional[List[str]] = None
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check whether passing keys in this way makes sense @kirkrodrigues @junhaoliao.
In eithere key ingestion or prefix ingestion, the key prefix will be required so S3Config base is untouched.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant