-
Notifications
You must be signed in to change notification settings - Fork 82
feat(Presto): Generate split filter file from set-up-config.sh
command
#1170
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat(Presto): Generate split filter file from set-up-config.sh
command
#1170
Conversation
WalkthroughAdds a new CLI tool to generate split-filter JSON for datasets in CLP archives and updates the setup script to also generate a metadata-filter JSON by invoking the metadata generator during setup. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant U as User
participant S as set-up-config.sh
participant M as generate-metadata-filter-file.py
U->>S: Run setup script
S->>M: --clp-package-dir <path>
M-->>S: Write ../coordinator/config-template/metadata-filter.json
S-->>U: Setup complete
sequenceDiagram
autonumber
participant U as User
participant G as generate-split-filter-file.py
participant FS as File System
participant L as Logger
U->>G: --clp-package-dir <dir> --output-file <file>
G->>FS: Validate archives dir and output path
FS-->>G: Exists / OK or error
alt archives found
G->>FS: List var/data/archives/*
FS-->>G: [dataset names]
loop For each dataset
G->>U: Prompt timestamp key (default "timestamp")
U-->>G: Input or empty
end
G->>FS: Write split-filter JSON to output-file
FS-->>G: Success
G-->>U: Exit 0
else none found / validation error
G->>L: Log error
G-->>U: Exit 1
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Pre-merge checks (3 passed)✅ Passed checks (3 passed)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 Ruff (0.12.2)tools/deployment/presto-clp/scripts/generate-split-filter-file.py�[1;31mruff failed�[0m Tip 👮 Agentic pre-merge checks are now available in preview!Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.
Please see the documentation for more information. Example: reviews:
pre_merge_checks:
custom_checks:
- name: "Undocumented Breaking Changes"
mode: "warning"
instructions: |
Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal). Please share your feedback with us on this Discord post. ✨ Finishing touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (2)
tools/deployment/presto-clp/scripts/generate-metadata-filter-file.py
(1 hunks)tools/deployment/presto-clp/scripts/set-up-config.sh
(1 hunks)
🧰 Additional context used
🧠 Learnings (9)
📓 Common learnings
Learnt from: haiqi96
PR: y-scope/clp#0
File: :0-0
Timestamp: 2025-07-29T14:04:13.769Z
Learning: User haiqi96 requested creating a GitHub issue to document a bug fix from PR #1136, which addressed MySQL compatibility issues with invalid SQL CAST operations in the WebUI component.
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1050
File: components/clp-package-utils/clp_package_utils/scripts/search.py:100-106
Timestamp: 2025-07-03T20:10:43.789Z
Learning: In the current CLP codebase implementation, dataset validation using validate_dataset() is performed within the native scripts (like clp_package_utils/scripts/native/search.py) rather than at the wrapper script level, where the native scripts handle their own parameter validation.
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1050
File: components/clp-package-utils/clp_package_utils/scripts/search.py:100-106
Timestamp: 2025-07-03T20:10:43.789Z
Learning: In the current CLP codebase implementation, dataset validation using validate_dataset() is performed within the native scripts (like clp_package_utils/scripts/native/search.py) rather than at the wrapper script level, where the native scripts handle their own parameter validation.
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1036
File: components/clp-py-utils/clp_py_utils/clp_metadata_db_utils.py:204-211
Timestamp: 2025-07-03T12:58:18.407Z
Learning: In the CLP codebase, the validate_and_cache_dataset function in components/clp-py-utils/clp_py_utils/clp_metadata_db_utils.py uses in-place updates of the existing_datasets set parameter rather than returning a new set, as preferred by the development team.
Learnt from: kirkrodrigues
PR: y-scope/clp#881
File: components/core/tools/scripts/lib_install/ubuntu-jammy/install-prebuilt-packages.sh:35-41
Timestamp: 2025-05-06T09:48:55.408Z
Learning: For installation scripts in the CLP project, prefer explicit error handling over automatic dependency resolution (like `apt-get install -f`) when installing packages to give users more control over their system.
Learnt from: Bill-hbrhbr
PR: y-scope/clp#868
File: components/clp-py-utils/clp_py_utils/clp_metadata_db_utils.py:141-144
Timestamp: 2025-05-05T16:32:55.163Z
Learning: The column metadata table (created by `_create_column_metadata_table`) is only needed for dataset-specific workflows in `CLP_S` and is obsolete for non-dataset workflows.
Learnt from: haiqi96
PR: y-scope/clp#594
File: components/clp-package-utils/clp_package_utils/scripts/del_archives.py:56-65
Timestamp: 2024-11-18T16:49:20.248Z
Learning: When reviewing wrapper scripts in `components/clp-package-utils/clp_package_utils/scripts/`, note that it's preferred to keep error handling simple without adding extra complexity.
Learnt from: haiqi96
PR: y-scope/clp#651
File: components/clp-package-utils/clp_package_utils/scripts/compress.py:0-0
Timestamp: 2025-01-16T16:58:43.190Z
Learning: In the clp-package compression flow, path validation and error handling is performed at the scheduler level rather than in the compress.py script to maintain simplicity and avoid code duplication.
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1036
File: components/clp-package-utils/clp_package_utils/general.py:564-579
Timestamp: 2025-06-28T07:10:47.295Z
Learning: The validate_dataset function in components/clp-package-utils/clp_package_utils/general.py is designed to be called once upon function startup for dataset validation, not repeatedly during execution, making caching optimizations unnecessary.
📚 Learning: in the clp codebase, the validate_and_cache_dataset function in components/clp-py-utils/clp_py_utils...
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1036
File: components/clp-py-utils/clp_py_utils/clp_metadata_db_utils.py:204-211
Timestamp: 2025-07-03T12:58:18.407Z
Learning: In the CLP codebase, the validate_and_cache_dataset function in components/clp-py-utils/clp_py_utils/clp_metadata_db_utils.py uses in-place updates of the existing_datasets set parameter rather than returning a new set, as preferred by the development team.
Applied to files:
tools/deployment/presto-clp/scripts/set-up-config.sh
tools/deployment/presto-clp/scripts/generate-metadata-filter-file.py
📚 Learning: in clp project build scripts (specifically build.sh files in docker-images directories), maintain co...
Learnt from: jackluo923
PR: y-scope/clp#1054
File: components/core/tools/docker-images/clp-env-base-musllinux_1_2-aarch64/build.sh:3-5
Timestamp: 2025-07-07T17:43:04.349Z
Learning: In CLP project build scripts (specifically build.sh files in docker-images directories), maintain consistency with the established pattern of using separate `set -eu` and `set -o pipefail` commands rather than combining them into `set -euo pipefail`, to ensure uniform script structure across all platform build scripts.
Applied to files:
tools/deployment/presto-clp/scripts/set-up-config.sh
📚 Learning: in clp installation scripts within `components/core/tools/scripts/lib_install/`, maintain consistenc...
Learnt from: jackluo923
PR: y-scope/clp#1054
File: components/core/tools/scripts/lib_install/musllinux_1_2/install-packages-from-source.sh:6-8
Timestamp: 2025-07-01T14:51:19.172Z
Learning: In CLP installation scripts within `components/core/tools/scripts/lib_install/`, maintain consistency with existing variable declaration patterns across platforms rather than adding individual improvements like `readonly` declarations.
Applied to files:
tools/deployment/presto-clp/scripts/set-up-config.sh
📚 Learning: in clp installation scripts, consistency across platform scripts is prioritized over defensive progr...
Learnt from: jackluo923
PR: y-scope/clp#1054
File: components/core/tools/scripts/lib_install/musllinux_1_2/install-prebuilt-packages.sh:27-32
Timestamp: 2025-07-07T17:41:15.655Z
Learning: In CLP installation scripts, consistency across platform scripts is prioritized over defensive programming improvements. For example, when extracting Task binaries with tar in `install-prebuilt-packages.sh`, the extraction pattern should remain consistent with other platform scripts rather than adding defensive flags like `--strip-components=1` to handle potential tarball layout changes.
Applied to files:
tools/deployment/presto-clp/scripts/set-up-config.sh
📚 Learning: in the taskfiles dependency system (taskfiles/deps/main.yaml), echo commands are used to generate .c...
Learnt from: anlowee
PR: y-scope/clp#925
File: taskfiles/deps/main.yaml:97-106
Timestamp: 2025-05-28T18:33:30.155Z
Learning: In the taskfiles dependency system (taskfiles/deps/main.yaml), echo commands are used to generate .cmake settings files that are consumed by the main CMake build process. These files set variables like ANTLR_RUNTIME_HEADER to point to dependency locations for use during compilation.
Applied to files:
tools/deployment/presto-clp/scripts/set-up-config.sh
📚 Learning: when reviewing wrapper scripts in `components/clp-package-utils/clp_package_utils/scripts/`, note th...
Learnt from: haiqi96
PR: y-scope/clp#594
File: components/clp-package-utils/clp_package_utils/scripts/del_archives.py:56-65
Timestamp: 2024-11-18T16:49:20.248Z
Learning: When reviewing wrapper scripts in `components/clp-package-utils/clp_package_utils/scripts/`, note that it's preferred to keep error handling simple without adding extra complexity.
Applied to files:
tools/deployment/presto-clp/scripts/set-up-config.sh
📚 Learning: the column metadata table (created by `_create_column_metadata_table`) is only needed for dataset-sp...
Learnt from: Bill-hbrhbr
PR: y-scope/clp#868
File: components/clp-py-utils/clp_py_utils/clp_metadata_db_utils.py:141-144
Timestamp: 2025-05-05T16:32:55.163Z
Learning: The column metadata table (created by `_create_column_metadata_table`) is only needed for dataset-specific workflows in `CLP_S` and is obsolete for non-dataset workflows.
Applied to files:
tools/deployment/presto-clp/scripts/generate-metadata-filter-file.py
📚 Learning: in the current clp codebase implementation, dataset validation using validate_dataset() is performed...
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1050
File: components/clp-package-utils/clp_package_utils/scripts/search.py:100-106
Timestamp: 2025-07-03T20:10:43.789Z
Learning: In the current CLP codebase implementation, dataset validation using validate_dataset() is performed within the native scripts (like clp_package_utils/scripts/native/search.py) rather than at the wrapper script level, where the native scripts handle their own parameter validation.
Applied to files:
tools/deployment/presto-clp/scripts/generate-metadata-filter-file.py
🔇 Additional comments (6)
tools/deployment/presto-clp/scripts/generate-metadata-filter-file.py (5)
1-22
: LGTM! Well-structured imports and logging setup.The imports are appropriate for the functionality, and the logging configuration follows best practices with proper formatting and level settings.
24-36
: Excellent use of TypedDict for type safety.The type definitions provide clear structure for the JSON output format and enhance code maintainability with proper typing.
78-105
: LGTM! Robust dataset discovery with proper error handling.The function correctly validates directory paths, handles edge cases, and uses in-place list updates which aligns with CLP codebase patterns.
140-190
: Excellent error handling and input validation.The function properly validates inputs, handles missing timestamp keys, and includes robust exception handling with appropriate cleanup.
192-193
: LGTM! Standard script execution pattern.The script execution follows Python best practices correctly.
tools/deployment/presto-clp/scripts/set-up-config.sh (1)
30-33
: LGTM! Clean integration following established patterns.The new step integrates seamlessly with the existing script flow, uses consistent formatting, and places the output file in an appropriate location.
tools/deployment/presto-clp/scripts/generate-metadata-filter-file.py
Outdated
Show resolved
Hide resolved
tools/deployment/presto-clp/scripts/generate-metadata-filter-file.py
Outdated
Show resolved
Hide resolved
tools/deployment/presto-clp/scripts/generate-metadata-filter-file.py
Outdated
Show resolved
Hide resolved
@coderabbitai review |
✅ Actions performedReview triggered.
|
set-up-config.sh
commandset-up-config.sh
command
The PR claims that I feel the real issue is that the config file we ask user to edit is overly complicated. since it contains a lot of unrelated key-value pairs. What if we let user specifies a simple yaml, or even plain text file with only |
@haiqi96 fwiw, this PR proposes a process where the user only has to type out the timestamp; the dataset name is spoonfed. For example, if I had already compressed two datasets,
all I would have to type for each dataset is "timestamp", which the user will have to do anyway regardless of what method we use. That being said, I do agree that the inability for the user to roll back is a downside of this method, that's a good point. And I think that your idea re. letting the user specify a simple yaml or something has merit. @kirkrodrigues what do you think? |
tools/deployment/presto-clp/scripts/generate-metadata-filter-file.py
Outdated
Show resolved
Hide resolved
tools/deployment/presto-clp/scripts/generate-metadata-filter-file.py
Outdated
Show resolved
Hide resolved
tools/deployment/presto-clp/scripts/generate-metadata-filter-file.py
Outdated
Show resolved
Hide resolved
tools/deployment/presto-clp/scripts/generate-metadata-filter-file.py
Outdated
Show resolved
Hide resolved
tools/deployment/presto-clp/scripts/generate-metadata-filter-file.py
Outdated
Show resolved
Hide resolved
tools/deployment/presto-clp/scripts/generate-metadata-filter-file.py
Outdated
Show resolved
Hide resolved
tools/deployment/presto-clp/scripts/generate-metadata-filter-file.py
Outdated
Show resolved
Hide resolved
tools/deployment/presto-clp/scripts/generate-metadata-filter-file.py
Outdated
Show resolved
Hide resolved
tools/deployment/presto-clp/scripts/generate-metadata-filter-file.py
Outdated
Show resolved
Hide resolved
tools/deployment/presto-clp/scripts/generate-metadata-filter-file.py
Outdated
Show resolved
Hide resolved
Quick question; What will happen if user hit ctrl+c, or anything that forces the input() to terminate? Will script return a failure? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 8
♻️ Duplicate comments (3)
tools/deployment/presto-clp/scripts/generate-metadata-filter-file.py (3)
52-55
: Resolve output file path for consistency with clp_package_dir.Aligns with earlier suggestion and avoids surprises with relative paths.
parsed_args = args_parser.parse_args(argv[1:]) clp_package_dir: Path = parsed_args.clp_package_dir.resolve() -json_output_file: Path = parsed_args.output_file +json_output_file: Path = parsed_args.output_file.resolve()
78-105
: Return datasets instead of mutating an out-parameter (follow-up).Pythonic API: return Optional[List[str]] and drop the bool + mutation pattern. This was raised earlier; consider a follow-up PR to do it repo-wide for consistency.
123-125
: Mark ANSI control codes as constants.Avoids magic strings and clarifies intent.
- BOLD = "\033[1m" - RESET = "\033[0m" + BOLD: Final[str] = "\033[1m" + RESET: Final[str] = "\033[0m"
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
tools/deployment/presto-clp/scripts/generate-metadata-filter-file.py
(1 hunks)
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-07-28T23:14:57.751Z
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1100
File: integration-tests/tests/utils/config.py:42-45
Timestamp: 2025-07-28T23:14:57.751Z
Learning: In the CLP project, the team prefers Yoda-style conditions (e.g., `0 == len(dataset_name)`) over standard Python idioms to help catch accidental assignment operators (writing `=` instead of `==`). Additionally, the team finds using `not` for string emptiness checks ambiguous and prefers explicit length comparisons.
Applied to files:
tools/deployment/presto-clp/scripts/generate-metadata-filter-file.py
📚 Learning: 2025-08-15T15:21:51.607Z
Learnt from: haiqi96
PR: y-scope/clp#1186
File: components/clp-package-utils/clp_package_utils/scripts/archive_manager.py:40-40
Timestamp: 2025-08-15T15:21:51.607Z
Learning: In components/clp-package-utils/clp_package_utils/scripts/archive_manager.py, the _validate_timestamps function correctly uses Optional[int] for end_ts parameter without a default value because it forces callers to explicitly pass the parsed argument value (which can be None), making the API contract clear and avoiding ambiguity about whether None came from defaults or actual parsed values.
Applied to files:
tools/deployment/presto-clp/scripts/generate-metadata-filter-file.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: lint-check (macos-15)
- GitHub Check: lint-check (ubuntu-24.04)
tools/deployment/presto-clp/scripts/generate-metadata-filter-file.py
Outdated
Show resolved
Hide resolved
tools/deployment/presto-clp/scripts/generate-metadata-filter-file.py
Outdated
Show resolved
Hide resolved
tools/deployment/presto-clp/scripts/generate-metadata-filter-file.py
Outdated
Show resolved
Hide resolved
tools/deployment/presto-clp/scripts/generate-metadata-filter-file.py
Outdated
Show resolved
Hide resolved
tools/deployment/presto-clp/scripts/generate-metadata-filter-file.py
Outdated
Show resolved
Hide resolved
tools/deployment/presto-clp/scripts/generate-metadata-filter-file.py
Outdated
Show resolved
Hide resolved
tools/deployment/presto-clp/scripts/generate-metadata-filter-file.py
Outdated
Show resolved
Hide resolved
tools/deployment/presto-clp/scripts/generate-metadata-filter-file.py
Outdated
Show resolved
Hide resolved
set-up-config.sh
commandset-up-config.sh
command
@haiqi96 this will happen:
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
tools/deployment/presto-clp/scripts/generate-split-filter-file.py
Outdated
Show resolved
Hide resolved
tools/deployment/presto-clp/scripts/generate-split-filter-file.py
Outdated
Show resolved
Hide resolved
tools/deployment/presto-clp/scripts/generate-split-filter-file.py
Outdated
Show resolved
Hide resolved
tools/deployment/presto-clp/scripts/generate-split-filter-file.py
Outdated
Show resolved
Hide resolved
tools/deployment/presto-clp/scripts/generate-split-filter-file.py
Outdated
Show resolved
Hide resolved
tools/deployment/presto-clp/scripts/generate-split-filter-file.py
Outdated
Show resolved
Hide resolved
tools/deployment/presto-clp/scripts/generate-split-filter-file.py
Outdated
Show resolved
Hide resolved
tools/deployment/presto-clp/scripts/generate-split-filter-file.py
Outdated
Show resolved
Hide resolved
tools/deployment/presto-clp/scripts/generate-split-filter-file.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 6
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
components/package-template/src/etc/clp-config.yml (1)
1-132
: Template parameterisation and non-interactive setup.Given the new interactive flow for Presto filters, keep this file generic and drive env-specific values from set-up-config.sh or a simple “values.yml” (e.g., key:value pairs) to support non-interactive runs and easy rollback, as suggested in review.
Minimal approach:
- Replace site-specific literals here with CHANGE_ME placeholders.
- Extend set-up-config.sh to read an optional values.yml and render this template (envsubst/jinja2/yq), falling back to prompts only for missing keys.
- Persist the resolved values.yml alongside generated files for audit/rollback.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (1)
components/package-template/src/etc/clp-config.yml
(1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: quinntaylormitchell
PR: y-scope/clp#1170
File: tools/deployment/presto-clp/scripts/generate-metadata-filter-file.py:0-0
Timestamp: 2025-09-04T12:26:54.788Z
Learning: The tools/deployment/presto-clp/coordinator/config-template/ directory already exists in the CLP deployment structure and contains configuration files, so directory creation is not needed when writing metadata-filter.json to this location.
Learnt from: quinntaylormitchell
PR: y-scope/clp#1170
File: tools/deployment/presto-clp/scripts/generate-metadata-filter-file.py:0-0
Timestamp: 2025-09-04T12:26:54.788Z
Learning: The `set-up-config.sh` script in tools/deployment/presto-clp/scripts/ ensures that the output directory for metadata-filter.json already exists before calling generate-metadata-filter-file.py, so directory creation in the Python script is not strictly necessary.
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: package-image
- GitHub Check: lint-check (macos-15)
🔇 Additional comments (5)
components/package-template/src/etc/clp-config.yml (5)
13-16
: Confirm query engine selection aligns with the Presto flow.With Presto in this PR, is
query_engine: "clp-s"
intentional? If Presto should serve queries, switch to"presto"
.Option if needed:
- query_engine: "clp-s" + query_engine: "presto"
71-98
: Archive path matches the Presto split-filter generator’s expectations.Directory
var/data/archives
aligns with the generator scanning location introduced in this PR. No changes requested.
100-109
: Stream output configuration looks sane.Targets and path are consistent with archive settings. No action.
44-49
: Confirm no runtime port collisions for base_portOnly occurrence of
14010
is in components/package-template/src/etc/clp-config.yml; still verify at deployment that no other host/container service uses or dynamically allocates ports overlapping 14010–14019.
34-37
: Port 5673 is intentionally mapped to container port 5672; no change required.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 11
♻️ Duplicate comments (2)
tools/deployment/presto-clp/scripts/generate-split-filter-file.py (2)
90-90
: Preferis None
over equality for None.Project style also discourages Yoda comparisons except
== 0
.- if None == datasets: + if datasets is None:
57-66
: Support non-interactive and file-driven key input to reduce error-prone prompts.Add optional flags to read dataset:key pairs from a simple file or accept defaults for all. This addresses reviewer feedback and improves automation.
args_parser.add_argument( "--clp-package-dir", help="CLP package directory.", required=True, type=Path ) args_parser.add_argument( "--output-file", help="Path for the split filter file.", required=True, type=Path ) + args_parser.add_argument( + "--keys-file", + type=Path, + help="Optional file mapping dataset to timestamp key (format: 'dataset key' per line).", + ) + args_parser.add_argument( + "--accept-defaults", + action="store_true", + help=f"Use default key '{DEFAULT_TIMESTAMP_KEY}' for all datasets; skip prompts.", + )- try: - timestamp_keys_by_dataset = _prompt_timestamp_keys(datasets) - except KeyboardInterrupt: - logger.error("Interrupted while collecting timestamp keys.") - return 1 + if parsed_args.keys_file: + try: + timestamp_keys_by_dataset = _load_timestamp_keys_from_file(parsed_args.keys_file, datasets) + except (OSError, ValueError) as e: + logger.error("Failed to load timestamp keys from %s: %s", parsed_args.keys_file, e) + return 1 + elif parsed_args.accept_defaults: + timestamp_keys_by_dataset = {d: DEFAULT_TIMESTAMP_KEY for d in datasets} + else: + try: + timestamp_keys_by_dataset = _prompt_timestamp_keys(datasets) + except KeyboardInterrupt: + logger.error("Interrupted while collecting timestamp keys.") + return 1Add this helper:
def _load_timestamp_keys_from_file(keys_file: Path, datasets: List[str]) -> Dict[str, str]: """ Load mapping from a simple text file: - Each non-empty, non-comment line: "<dataset> <timestamp_key>" - Whitespace separated. Lines starting with '#' ignored. Raises ValueError if a listed dataset is unknown or if parsing fails. """ mapping: Dict[str, str] = {} valid = set(datasets) for i, raw in enumerate(keys_file.read_text(encoding="utf-8").splitlines(), start=1): line = raw.strip() if not line or line.startswith("#"): continue parts = line.split(None, 1) if len(parts) != 2: raise ValueError(f"{keys_file}:{i}: expected '<dataset> <timestamp_key>'") ds, key = parts[0].strip(), parts[1].strip() if ds not in valid: raise ValueError(f"{keys_file}:{i}: unknown dataset '{ds}'") mapping[ds] = key return mappingAlso applies to: 94-99
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (1)
tools/deployment/presto-clp/scripts/generate-split-filter-file.py
(1 hunks)
🧰 Additional context used
🧠 Learnings (3)
📓 Common learnings
Learnt from: quinntaylormitchell
PR: y-scope/clp#1170
File: tools/deployment/presto-clp/scripts/generate-metadata-filter-file.py:0-0
Timestamp: 2025-09-04T12:26:54.788Z
Learning: The `set-up-config.sh` script in tools/deployment/presto-clp/scripts/ ensures that the output directory for metadata-filter.json already exists before calling generate-metadata-filter-file.py, so directory creation in the Python script is not strictly necessary.
📚 Learning: 2025-09-04T12:26:54.788Z
Learnt from: quinntaylormitchell
PR: y-scope/clp#1170
File: tools/deployment/presto-clp/scripts/generate-metadata-filter-file.py:0-0
Timestamp: 2025-09-04T12:26:54.788Z
Learning: The `set-up-config.sh` script in tools/deployment/presto-clp/scripts/ ensures that the output directory for metadata-filter.json already exists before calling generate-metadata-filter-file.py, so directory creation in the Python script is not strictly necessary.
Applied to files:
tools/deployment/presto-clp/scripts/generate-split-filter-file.py
📚 Learning: 2025-08-17T15:26:22.790Z
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1100
File: integration-tests/tests/utils/config.py:0-0
Timestamp: 2025-08-17T15:26:22.790Z
Learning: In the CLP project, Yoda-style conditions (literal on the left) are only used for `== 0` checks, not for other comparison operators like `>` or `<`. This is specifically to catch accidental assignment operators (writing `=` instead of `==`) in equality comparisons against zero.
Applied to files:
tools/deployment/presto-clp/scripts/generate-split-filter-file.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: package-image
- GitHub Check: lint-check (macos-15)
tools/deployment/presto-clp/scripts/generate-split-filter-file.py
Outdated
Show resolved
Hide resolved
tools/deployment/presto-clp/scripts/generate-split-filter-file.py
Outdated
Show resolved
Hide resolved
tools/deployment/presto-clp/scripts/generate-split-filter-file.py
Outdated
Show resolved
Hide resolved
tools/deployment/presto-clp/scripts/generate-split-filter-file.py
Outdated
Show resolved
Hide resolved
tools/deployment/presto-clp/scripts/generate-split-filter-file.py
Outdated
Show resolved
Hide resolved
tools/deployment/presto-clp/scripts/generate-split-filter-file.py
Outdated
Show resolved
Hide resolved
tools/deployment/presto-clp/scripts/generate-split-filter-file.py
Outdated
Show resolved
Hide resolved
tools/deployment/presto-clp/scripts/generate-split-filter-file.py
Outdated
Show resolved
Hide resolved
tools/deployment/presto-clp/scripts/generate-split-filter-file.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
♻️ Duplicate comments (10)
tools/deployment/presto-clp/scripts/generate-split-filter-file.py (10)
14-27
: Avoid configuring the root logger at import-time; add a local init function.Side effects on import (double handlers, noisy logs) if this module is imported elsewhere. Initialise logging from main().
-# Set up console logging -logging_console_handler = logging.StreamHandler() -logging_formatter = logging.Formatter( - "%(asctime)s.%(msecs)03d %(levelname)s [%(module)s] %(message)s", datefmt="%Y-%m-%dT%H:%M:%S" -) -logging_console_handler.setFormatter(logging_formatter) - -# Set up root logger -root_logger = logging.getLogger() -root_logger.setLevel(logging.INFO) -root_logger.addHandler(logging_console_handler) - -# Create logger -logger = logging.getLogger(__name__) +# Create logger +logger = logging.getLogger(__name__) + +def _configure_logging() -> None: + if logger.handlers: + return + handler = logging.StreamHandler() + formatter = logging.Formatter( + "%(asctime)s.%(msecs)03d %(levelname)s [%(module)s] %(message)s", + datefmt="%Y-%m-%dT%H:%M:%S", + ) + handler.setFormatter(formatter) + logger.setLevel(logging.INFO) + logger.addHandler(handler)
72-79
: Initialise logging inside main().Call your logger setup once at runtime, not on import.
def main(argv=None) -> int: if argv is None: argv = sys.argv + _configure_logging()
101-105
: Use identity test for None.- if datasets == None: + if datasets is None:
117-124
: Specify encoding and log success for observability.- try: - with open(json_output_file, "w") as json_file: - json.dump(split_filters, json_file, indent=2) + try: + with open(json_output_file, "w", encoding="utf-8") as json_file: + json.dump(split_filters, json_file, indent=2) + logger.info( + "Wrote split-filter JSON to %s for %d dataset(s).", + json_output_file, + len(datasets), + )
137-139
: Guard against I/O errors when listing archives.- datasets = sorted([p.name for p in archives_dir.iterdir() if p.is_dir()]) - return datasets if len(datasets) >= 1 else None + try: + datasets = sorted([p.name for p in archives_dir.iterdir() if p.is_dir()]) + except OSError as e: + logger.error("Failed to read archives directory %s: %s", archives_dir, e) + return None + return datasets if len(datasets) >= 1 else None
154-162
: Trim input and use a clearer mapping variable.Prevents accidental trailing spaces and clarifies intent.
- data_time_pairs: Dict[str, str] = {} + timestamp_keys_by_dataset: Dict[str, str] = {} for dataset in datasets: - user_input = input( + user_input = input( f">>> {ANSI_BOLD}{dataset}{ANSI_RESET} [default key: {ANSI_BOLD}{DEFAULT_TIMESTAMP_KEY}{ANSI_RESET}]: " - ) + ).strip() key = DEFAULT_TIMESTAMP_KEY if 0 == len(user_input) else user_input - data_time_pairs[dataset] = key + timestamp_keys_by_dataset[dataset] = key - return data_time_pairs + return timestamp_keys_by_dataset
1-7
: Avoid sharing mutable default dicts across rules; deepcopy options per rule.Prevents cross-rule mutation bugs.
import argparse import json import logging import sys from pathlib import Path from typing import Dict, Final, List, Optional, TypedDict +from copy import deepcopy @@ rule: SplitFilterRule = { "columnName": timestamp_keys[dataset], - "customOptions": DEFAULT_CUSTOM_OPTIONS, + "customOptions": deepcopy(DEFAULT_CUSTOM_OPTIONS), "required": DEFAULT_REQUIRED, }Also applies to: 186-191
76-85
: Don’t hard-code "clp.default"; make catalog and schema configurable via flags.args_parser.add_argument( "--output-file", help="Path for the split filter file.", required=True, type=Path ) + args_parser.add_argument("--catalog", help="Presto catalog.", default="clp") + args_parser.add_argument("--schema", help="Presto schema.", default="default") @@ parsed_args = args_parser.parse_args(argv[1:]) clp_package_dir: Path = parsed_args.clp_package_dir.resolve() archives_dir = clp_package_dir / "var" / "data" / "archives" json_output_file: Path = parsed_args.output_file.resolve() out_dir = json_output_file.parent + catalog: str = parsed_args.catalog + schema: str = parsed_args.schema @@ - split_filters[f"clp.default.{dataset}"] = [rule] + split_filters[f"{catalog}.{schema}.{dataset}"] = [rule]Also applies to: 86-91, 191-191
196-197
: Conventional main guard order.-if "__main__" == __name__: +if __name__ == "__main__": sys.exit(main(sys.argv))
170-177
: Fix duplicated :return: lines in docstring.- :return: A SplitFilterDict containing all the SplitFilterRule objects for the JSON file. - :return: A `SplitFilterDict` containing all the `SplitFilterRule` objects for the JSON file, or - None if there are any datasets that don't have an associated timestamp key. + :return: SplitFilterDict for all rules, or None if any dataset lacks a timestamp key.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (1)
tools/deployment/presto-clp/scripts/generate-split-filter-file.py
(1 hunks)
🧰 Additional context used
🧠 Learnings (6)
📓 Common learnings
Learnt from: quinntaylormitchell
PR: y-scope/clp#1170
File: tools/deployment/presto-clp/scripts/generate-metadata-filter-file.py:0-0
Timestamp: 2025-09-04T12:26:54.788Z
Learning: The `set-up-config.sh` script in tools/deployment/presto-clp/scripts/ ensures that the output directory for metadata-filter.json already exists before calling generate-metadata-filter-file.py, so directory creation in the Python script is not strictly necessary.
📚 Learning: 2025-09-04T12:26:54.788Z
Learnt from: quinntaylormitchell
PR: y-scope/clp#1170
File: tools/deployment/presto-clp/scripts/generate-metadata-filter-file.py:0-0
Timestamp: 2025-09-04T12:26:54.788Z
Learning: The `set-up-config.sh` script in tools/deployment/presto-clp/scripts/ ensures that the output directory for metadata-filter.json already exists before calling generate-metadata-filter-file.py, so directory creation in the Python script is not strictly necessary.
Applied to files:
tools/deployment/presto-clp/scripts/generate-split-filter-file.py
📚 Learning: 2025-08-17T15:26:22.790Z
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1100
File: integration-tests/tests/utils/config.py:0-0
Timestamp: 2025-08-17T15:26:22.790Z
Learning: In the CLP project, Yoda-style conditions (literal on the left) are only used for `== 0` checks, not for other comparison operators like `>` or `<`. This is specifically to catch accidental assignment operators (writing `=` instead of `==`) in equality comparisons against zero.
Applied to files:
tools/deployment/presto-clp/scripts/generate-split-filter-file.py
📚 Learning: 2025-08-19T14:41:28.901Z
Learnt from: junhaoliao
PR: y-scope/clp#1152
File: components/clp-package-utils/clp_package_utils/general.py:0-0
Timestamp: 2025-08-19T14:41:28.901Z
Learning: In the CLP codebase, prefer explicit failures over automatic directory creation in utility functions like dump_config. The user junhaoliao prefers to let file operations fail when parent directories don't exist, as this helps catch implementation errors during development rather than masking setup issues with automatic directory creation.
Applied to files:
tools/deployment/presto-clp/scripts/generate-split-filter-file.py
📚 Learning: 2024-11-15T16:21:52.122Z
Learnt from: haiqi96
PR: y-scope/clp#594
File: components/clp-package-utils/clp_package_utils/scripts/native/del_archives.py:104-110
Timestamp: 2024-11-15T16:21:52.122Z
Learning: In `clp_package_utils/scripts/native/del_archives.py`, when deleting archives, the `archive` variable retrieved from the database is controlled and is always a single string without path components. Therefore, it's acceptable to skip additional validation checks for directory traversal in this context.
Applied to files:
tools/deployment/presto-clp/scripts/generate-split-filter-file.py
📚 Learning: 2025-07-28T23:14:57.751Z
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1100
File: integration-tests/tests/utils/config.py:42-45
Timestamp: 2025-07-28T23:14:57.751Z
Learning: In the CLP project, the team prefers Yoda-style conditions (e.g., `0 == len(dataset_name)`) over standard Python idioms to help catch accidental assignment operators (writing `=` instead of `==`). Additionally, the team finds using `not` for string emptiness checks ambiguous and prefers explicit length comparisons.
Applied to files:
tools/deployment/presto-clp/scripts/generate-split-filter-file.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: package-image
🔇 Additional comments (1)
tools/deployment/presto-clp/scripts/generate-split-filter-file.py (1)
92-100
: Create parent directory for split-filter.json in set-up-config.shset-up-config.sh calls generate-split-filter-file.py with --output-file "${script_dir}/../coordinator/config-template/split-filter.json" but does not create the parent directory; add a line like: mkdir -p "${script_dir}/../coordinator/config-template" before the Python invocation.
Location: tools/deployment/presto-clp/scripts/set-up-config.sh (around lines 30–33).⛔ Skipped due to learnings
Learnt from: quinntaylormitchell PR: y-scope/clp#1170 File: tools/deployment/presto-clp/scripts/generate-metadata-filter-file.py:0-0 Timestamp: 2025-09-04T12:26:54.788Z Learning: The `set-up-config.sh` script in tools/deployment/presto-clp/scripts/ ensures that the output directory for metadata-filter.json already exists before calling generate-metadata-filter-file.py, so directory creation in the Python script is not strictly necessary.
Learnt from: quinntaylormitchell PR: y-scope/clp#1170 File: tools/deployment/presto-clp/scripts/generate-metadata-filter-file.py:0-0 Timestamp: 2025-09-04T12:26:54.788Z Learning: The tools/deployment/presto-clp/coordinator/config-template/ directory already exists in the CLP deployment structure and contains configuration files, so directory creation is not needed when writing metadata-filter.json to this location.
tools/deployment/presto-clp/scripts/generate-split-filter-file.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (11)
tools/deployment/presto-clp/scripts/generate-split-filter-file.py (11)
14-27
: Avoid root-logger side effects; scope logging to this module.Configure a module logger via an idempotent helper; don’t mutate the root logger at import-time.
-# Set up console logging -logging_console_handler = logging.StreamHandler() -logging_formatter = logging.Formatter( - "%(asctime)s.%(msecs)03d %(levelname)s [%(module)s] %(message)s", datefmt="%Y-%m-%dT%H:%M:%S" -) -logging_console_handler.setFormatter(logging_formatter) - -# Set up root logger -root_logger = logging.getLogger() -root_logger.setLevel(logging.INFO) -root_logger.addHandler(logging_console_handler) - -# Create logger -logger = logging.getLogger(__name__) +logger = logging.getLogger(__name__) + +def _configure_logging() -> None: + if logger.handlers: + return + handler = logging.StreamHandler() + formatter = logging.Formatter( + "%(asctime)s.%(msecs)03d %(levelname)s [%(module)s] %(message)s", + datefmt="%Y-%m-%dT%H:%M:%S", + ) + handler.setFormatter(formatter) + logger.setLevel(logging.INFO) + logger.addHandler(handler)
72-75
: Initialise logging inside main().One-time init; keeps import side‑effect free.
def main(argv=None) -> int: if argv is None: argv = sys.argv + _configure_logging()
101-105
: Use ‘is None’ for None checks.Idiomatic and avoids edge cases.
- if datasets == None: + if datasets is None:
106-111
: Handle EOF (Ctrl+D) during prompts like Ctrl+C.Clean exit on non-interactive/EOF.
- except KeyboardInterrupt: + except (KeyboardInterrupt, EOFError): logger.error("Interrupted while collecting timestamp keys.") return 1
117-123
: Write with UTF‑8 and log success for observability.- try: - with open(json_output_file, "w") as json_file: - json.dump(split_filters, json_file, indent=2) + try: + with open(json_output_file, "w", encoding="utf-8") as json_file: + json.dump(split_filters, json_file, indent=2) + logger.info( + "Wrote split-filter file to %s for %d dataset(s).", + json_output_file, + len(datasets), + ) except OSError as e:
137-139
: Guard archive listing against I/O errors.Avoids crash on permission/IO issues.
- datasets = sorted([p.name for p in archives_dir.iterdir() if p.is_dir()]) - return datasets if len(datasets) >= 1 else None + try: + datasets = sorted([p.name for p in archives_dir.iterdir() if p.is_dir()]) + except OSError as e: + logger.error("Failed to read archives directory %s: %s", archives_dir, e) + return None + return datasets if len(datasets) >= 1 else None
146-162
: Trim input and use a clearer variable name.Prevents trailing spaces and improves readability.
- :param datasets: - :return: mapping of `dataset` -> `timestamp_keys`. + :param datasets: + :return: Mapping of dataset -> timestamp key. @@ - data_time_pairs: Dict[str, str] = {} + timestamp_keys_by_dataset: Dict[str, str] = {} for dataset in datasets: - user_input = input( + user_input = input( f">>> {ANSI_BOLD}{dataset}{ANSI_RESET} [default key: {ANSI_BOLD}{DEFAULT_TIMESTAMP_KEY}{ANSI_RESET}]: " - ) + ).strip() key = DEFAULT_TIMESTAMP_KEY if 0 == len(user_input) else user_input - data_time_pairs[dataset] = key + timestamp_keys_by_dataset[dataset] = key - return data_time_pairs + return timestamp_keys_by_dataset
1-7
: Avoid sharing mutable defaults across rules; deepcopy options.Prevents cross‑rule mutation hazards.
from typing import Dict, Final, List, Optional, TypedDict +from copy import deepcopy
rule: SplitFilterRule = { "columnName": timestamp_keys[dataset], - "customOptions": DEFAULT_CUSTOM_OPTIONS, + "customOptions": deepcopy(DEFAULT_CUSTOM_OPTIONS), "required": DEFAULT_REQUIRED, }Also applies to: 186-191
196-197
: Conventional main guard.-if "__main__" == __name__: +if __name__ == "__main__": sys.exit(main(sys.argv))
172-177
: Deduplicate/condense return description in docstring.- :return: A SplitFilterDict containing all the SplitFilterRule objects for the JSON file. - :return: A `SplitFilterDict` containing all the `SplitFilterRule` objects for the JSON file, or - None if there are any datasets that don't have an associated timestamp key. + :return: A SplitFilterDict for the JSON file, or None if any dataset lacks a timestamp key.
53-62
: Fix validate_dir docstring (stale parameter, clarity).Remove non-existent param and tighten return description.
""" Determines whether or not the path exists and whether or not it is a directory. If either of those are false, logs error message with path and returns False. :param path: :param label: - :param logger: - :return: True if path exists and is a directory; False if either of those are not true. + :return: True if path exists and is a directory; False otherwise. """
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (1)
tools/deployment/presto-clp/scripts/generate-split-filter-file.py
(1 hunks)
🧰 Additional context used
🧠 Learnings (6)
📓 Common learnings
Learnt from: quinntaylormitchell
PR: y-scope/clp#1170
File: tools/deployment/presto-clp/scripts/generate-metadata-filter-file.py:0-0
Timestamp: 2025-09-04T12:26:54.788Z
Learning: The `set-up-config.sh` script in tools/deployment/presto-clp/scripts/ ensures that the output directory for metadata-filter.json already exists before calling generate-metadata-filter-file.py, so directory creation in the Python script is not strictly necessary.
📚 Learning: 2025-09-04T12:26:54.788Z
Learnt from: quinntaylormitchell
PR: y-scope/clp#1170
File: tools/deployment/presto-clp/scripts/generate-metadata-filter-file.py:0-0
Timestamp: 2025-09-04T12:26:54.788Z
Learning: The `set-up-config.sh` script in tools/deployment/presto-clp/scripts/ ensures that the output directory for metadata-filter.json already exists before calling generate-metadata-filter-file.py, so directory creation in the Python script is not strictly necessary.
Applied to files:
tools/deployment/presto-clp/scripts/generate-split-filter-file.py
📚 Learning: 2025-08-17T15:26:22.790Z
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1100
File: integration-tests/tests/utils/config.py:0-0
Timestamp: 2025-08-17T15:26:22.790Z
Learning: In the CLP project, Yoda-style conditions (literal on the left) are only used for `== 0` checks, not for other comparison operators like `>` or `<`. This is specifically to catch accidental assignment operators (writing `=` instead of `==`) in equality comparisons against zero.
Applied to files:
tools/deployment/presto-clp/scripts/generate-split-filter-file.py
📚 Learning: 2025-08-19T14:41:28.901Z
Learnt from: junhaoliao
PR: y-scope/clp#1152
File: components/clp-package-utils/clp_package_utils/general.py:0-0
Timestamp: 2025-08-19T14:41:28.901Z
Learning: In the CLP codebase, prefer explicit failures over automatic directory creation in utility functions like dump_config. The user junhaoliao prefers to let file operations fail when parent directories don't exist, as this helps catch implementation errors during development rather than masking setup issues with automatic directory creation.
Applied to files:
tools/deployment/presto-clp/scripts/generate-split-filter-file.py
📚 Learning: 2024-11-15T16:21:52.122Z
Learnt from: haiqi96
PR: y-scope/clp#594
File: components/clp-package-utils/clp_package_utils/scripts/native/del_archives.py:104-110
Timestamp: 2024-11-15T16:21:52.122Z
Learning: In `clp_package_utils/scripts/native/del_archives.py`, when deleting archives, the `archive` variable retrieved from the database is controlled and is always a single string without path components. Therefore, it's acceptable to skip additional validation checks for directory traversal in this context.
Applied to files:
tools/deployment/presto-clp/scripts/generate-split-filter-file.py
📚 Learning: 2025-07-28T23:14:57.751Z
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1100
File: integration-tests/tests/utils/config.py:42-45
Timestamp: 2025-07-28T23:14:57.751Z
Learning: In the CLP project, the team prefers Yoda-style conditions (e.g., `0 == len(dataset_name)`) over standard Python idioms to help catch accidental assignment operators (writing `=` instead of `==`). Additionally, the team finds using `not` for string emptiness checks ambiguous and prefers explicit length comparisons.
Applied to files:
tools/deployment/presto-clp/scripts/generate-split-filter-file.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: package-image
🔇 Additional comments (1)
tools/deployment/presto-clp/scripts/generate-split-filter-file.py (1)
92-96
: Directory validation behaviour LGTM and matches team preference.Fail fast if parent dirs are missing; aligns with prior guidance to avoid auto‑creation.
Description
At the moment, a CLP + Presto user must configure Presto to use CLP’s metadata database by opening
split-filter.json
and adding a filter for each dataset they wish to query (see the Presto setup instructions for more information). Furthermore, within each filter, the user must correctly replace the<dataset>
and<timestamp-key>
fields with the correct corresponding information. This process is error-prone, and it can become time-consuming if the user has compressed (and wishes to query) many datasets.This PR pads this process for the user by generating the filters for the
split-filter.json
file from a new script calledgenerate-split-filter-file.py
, which is run fromset-up-config.sh
. The user now only has to run thescripts/set-up-config.sh <clp-json-dir>
command specified in step 3 of the Presto setup instructions, and then they will be prompted to enter the timestamp key for each of their datasets they've compressed (i.e., each of the first-level subdirectories inclp-package/var/data/archives
).Checklist
breaking change.
Validation performed
After starting the package, I performed the following two compressions:
Then I ran
./scripts/set-up-config.sh /home/quinnmitchell/clp/build/clp-package
fromclp/tools/deployment/presto-clp
, and received the following output:The
split-filter.json
that was generated from that command looks like this:and the output from running the
SHOW TABLES;
command from within the Presto CLI is as follows:Summary by CodeRabbit
New Features
Chores