Skip to content

Conversation

quinntaylormitchell
Copy link
Collaborator

@quinntaylormitchell quinntaylormitchell commented Aug 6, 2025

Description

At the moment, a CLP + Presto user must configure Presto to use CLP’s metadata database by opening split-filter.json and adding a filter for each dataset they wish to query (see the Presto setup instructions for more information). Furthermore, within each filter, the user must correctly replace the <dataset> and <timestamp-key> fields with the correct corresponding information. This process is error-prone, and it can become time-consuming if the user has compressed (and wishes to query) many datasets.

This PR pads this process for the user by generating the filters for the split-filter.json file from a new script called generate-split-filter-file.py, which is run from set-up-config.sh. The user now only has to run the scripts/set-up-config.sh <clp-json-dir> command specified in step 3 of the Presto setup instructions, and then they will be prompted to enter the timestamp key for each of their datasets they've compressed (i.e., each of the first-level subdirectories in clp-package/var/data/archives).

Checklist

  • The PR satisfies the contribution guidelines.
  • This is a breaking change and that has been indicated in the PR title, OR this isn't a
    breaking change.
  • Necessary docs have been updated, OR no docs need to be updated.

Validation performed

After starting the package, I performed the following two compressions:

./sbin/compress.sh --timestamp-key 'timestamp' --dataset 'postgresql' /home/quinnmitchell/logs/postgresql/postgresql.log \
&& ./sbin/compress.sh --timestamp-key 'timestamp' --dataset 'cockroachdb' /home/quinnmitchell/logs/cockroachdb/cockroach.node1.log

Then I ran ./scripts/set-up-config.sh /home/quinnmitchell/clp/build/clp-package from clp/tools/deployment/presto-clp, and received the following output:

Installing required Python packages...
Requirement already satisfied: python-dotenv in ./scripts/.venv/lib/python3.10/site-packages (from -r /home/quinnmitchell/clp/tools/deployment/presto-clp/scripts/requirements.txt (line 1)) (1.1.1)
Requirement already satisfied: PyYAML in ./scripts/.venv/lib/python3.10/site-packages (from -r /home/quinnmitchell/clp/tools/deployment/presto-clp/scripts/requirements.txt (line 2)) (6.0.2)
Generating config files corresponding to user-configured properties...
Generating split filter file for user-configured datasets...

Please enter the timestamp key that corresponds to each of your archived datasets.
Press <Enter> to accept the default key.

>>> cockroachdb [default key: timestamp]: timestamp
>>> postgresql [default key: timestamp]: timestamp

The split-filter.json that was generated from that command looks like this:

{
  "clp.default.cockroachdb": [
    {
      "columnName": "timestamp",
      "customOptions": {
        "rangeMapping": {
          "lowerBound": "begin_timestamp",
          "upperBound": "end_timestamp"
        }
      },
      "required": false
    }
  ],
  "clp.default.postgresql": [
    {
      "columnName": "timestamp",
      "customOptions": {
        "rangeMapping": {
          "lowerBound": "begin_timestamp",
          "upperBound": "end_timestamp"
        }
      },
      "required": false
    }
  ]
}

and the output from running the SHOW TABLES; command from within the Presto CLI is as follows:

    Table    
-------------
 cockroachdb 
 postgresql  
(2 rows)

Summary by CodeRabbit

  • New Features

    • Added a CLI to generate per-dataset split‑filter JSONs: discovers configured datasets, prompts for per‑dataset timestamp keys, validates inputs, logs progress, and writes a formatted output.
  • Chores

    • Deployment setup now also generates a metadata‑filter configuration file alongside environment variables during setup.

Copy link
Contributor

coderabbitai bot commented Aug 6, 2025

Walkthrough

Adds a new CLI tool to generate split-filter JSON for datasets in CLP archives and updates the setup script to also generate a metadata-filter JSON by invoking the metadata generator during setup.

Changes

Cohort / File(s) Summary
Setup script update
tools/deployment/presto-clp/scripts/set-up-config.sh
After generating env vars, now invokes generate-metadata-filter-file.py --clp-package-dir <dir> and writes ../coordinator/config-template/metadata-filter.json.
New split-filter generator CLI
tools/deployment/presto-clp/scripts/generate-split-filter-file.py
New Python CLI that discovers datasets under var/data/archives, prompts per-dataset timestamp keys (default "timestamp"), constructs split-filter rules (TypedDict schema and defaults), validates inputs/paths, and writes a JSON output file. Exposes helper functions and main().

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant U as User
  participant S as set-up-config.sh
  participant M as generate-metadata-filter-file.py

  U->>S: Run setup script
  S->>M: --clp-package-dir <path>
  M-->>S: Write ../coordinator/config-template/metadata-filter.json
  S-->>U: Setup complete
Loading
sequenceDiagram
  autonumber
  participant U as User
  participant G as generate-split-filter-file.py
  participant FS as File System
  participant L as Logger

  U->>G: --clp-package-dir <dir> --output-file <file>
  G->>FS: Validate archives dir and output path
  FS-->>G: Exists / OK or error
  alt archives found
    G->>FS: List var/data/archives/*
    FS-->>G: [dataset names]
    loop For each dataset
      G->>U: Prompt timestamp key (default "timestamp")
      U-->>G: Input or empty
    end
    G->>FS: Write split-filter JSON to output-file
    FS-->>G: Success
    G-->>U: Exit 0
  else none found / validation error
    G->>L: Log error
    G-->>U: Exit 1
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Pre-merge checks (3 passed)

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title accurately and concisely describes the primary change: generating a split filter file by invoking that behavior from the set-up-config.sh command, which matches the PR that adds generate-split-filter-file.py and hooks it into the setup script. It is specific enough for a reviewer scanning history to understand the main intent and uses a standard feat(Presto) prefix.
Docstring Coverage ✅ Passed Docstring coverage is 80.00% which is sufficient. The required threshold is 80.00%.

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 Ruff (0.12.2)
tools/deployment/presto-clp/scripts/generate-split-filter-file.py

�[1;31mruff failed�[0m
�[1mCause:�[0m Failed to load configuration /ruff.toml
�[1mCause:�[0m Failed to parse /ruff.toml
�[1mCause:�[0m TOML parse error at line 26, column 3
|
26 | "RSE100", # Use of assert detected
| ^^^^^^^^
Unknown rule selector: RSE100

Tip

👮 Agentic pre-merge checks are now available in preview!

Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.

  • Built-in checks – Quickly apply ready-made checks to enforce title conventions, require pull request descriptions that follow templates, validate linked issues for compliance, and more.
  • Custom agentic checks – Define your own rules using CodeRabbit’s advanced agentic capabilities to enforce organization-specific policies and workflows. For example, you can instruct CodeRabbit’s agent to verify that API documentation is updated whenever API schema files are modified in a PR. Note: Upto 5 custom checks are currently allowed during the preview period. Pricing for this feature will be announced in a few weeks.

Please see the documentation for more information.

Example:

reviews:
  pre_merge_checks:
    custom_checks:
      - name: "Undocumented Breaking Changes"
        mode: "warning"
        instructions: |
          Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).

Please share your feedback with us on this Discord post.

✨ Finishing touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@quinntaylormitchell
Copy link
Collaborator Author

@coderabbitai review

Copy link
Contributor

coderabbitai bot commented Aug 6, 2025

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

📜 Review details

Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c72168a and c599ded.

📒 Files selected for processing (2)
  • tools/deployment/presto-clp/scripts/generate-metadata-filter-file.py (1 hunks)
  • tools/deployment/presto-clp/scripts/set-up-config.sh (1 hunks)
🧰 Additional context used
🧠 Learnings (9)
📓 Common learnings
Learnt from: haiqi96
PR: y-scope/clp#0
File: :0-0
Timestamp: 2025-07-29T14:04:13.769Z
Learning: User haiqi96 requested creating a GitHub issue to document a bug fix from PR #1136, which addressed MySQL compatibility issues with invalid SQL CAST operations in the WebUI component.
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1050
File: components/clp-package-utils/clp_package_utils/scripts/search.py:100-106
Timestamp: 2025-07-03T20:10:43.789Z
Learning: In the current CLP codebase implementation, dataset validation using validate_dataset() is performed within the native scripts (like clp_package_utils/scripts/native/search.py) rather than at the wrapper script level, where the native scripts handle their own parameter validation.
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1050
File: components/clp-package-utils/clp_package_utils/scripts/search.py:100-106
Timestamp: 2025-07-03T20:10:43.789Z
Learning: In the current CLP codebase implementation, dataset validation using validate_dataset() is performed within the native scripts (like clp_package_utils/scripts/native/search.py) rather than at the wrapper script level, where the native scripts handle their own parameter validation.
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1036
File: components/clp-py-utils/clp_py_utils/clp_metadata_db_utils.py:204-211
Timestamp: 2025-07-03T12:58:18.407Z
Learning: In the CLP codebase, the validate_and_cache_dataset function in components/clp-py-utils/clp_py_utils/clp_metadata_db_utils.py uses in-place updates of the existing_datasets set parameter rather than returning a new set, as preferred by the development team.
Learnt from: kirkrodrigues
PR: y-scope/clp#881
File: components/core/tools/scripts/lib_install/ubuntu-jammy/install-prebuilt-packages.sh:35-41
Timestamp: 2025-05-06T09:48:55.408Z
Learning: For installation scripts in the CLP project, prefer explicit error handling over automatic dependency resolution (like `apt-get install -f`) when installing packages to give users more control over their system.
Learnt from: Bill-hbrhbr
PR: y-scope/clp#868
File: components/clp-py-utils/clp_py_utils/clp_metadata_db_utils.py:141-144
Timestamp: 2025-05-05T16:32:55.163Z
Learning: The column metadata table (created by `_create_column_metadata_table`) is only needed for dataset-specific workflows in `CLP_S` and is obsolete for non-dataset workflows.
Learnt from: haiqi96
PR: y-scope/clp#594
File: components/clp-package-utils/clp_package_utils/scripts/del_archives.py:56-65
Timestamp: 2024-11-18T16:49:20.248Z
Learning: When reviewing wrapper scripts in `components/clp-package-utils/clp_package_utils/scripts/`, note that it's preferred to keep error handling simple without adding extra complexity.
Learnt from: haiqi96
PR: y-scope/clp#651
File: components/clp-package-utils/clp_package_utils/scripts/compress.py:0-0
Timestamp: 2025-01-16T16:58:43.190Z
Learning: In the clp-package compression flow, path validation and error handling is performed at the scheduler level rather than in the compress.py script to maintain simplicity and avoid code duplication.
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1036
File: components/clp-package-utils/clp_package_utils/general.py:564-579
Timestamp: 2025-06-28T07:10:47.295Z
Learning: The validate_dataset function in components/clp-package-utils/clp_package_utils/general.py is designed to be called once upon function startup for dataset validation, not repeatedly during execution, making caching optimizations unnecessary.
📚 Learning: in the clp codebase, the validate_and_cache_dataset function in components/clp-py-utils/clp_py_utils...
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1036
File: components/clp-py-utils/clp_py_utils/clp_metadata_db_utils.py:204-211
Timestamp: 2025-07-03T12:58:18.407Z
Learning: In the CLP codebase, the validate_and_cache_dataset function in components/clp-py-utils/clp_py_utils/clp_metadata_db_utils.py uses in-place updates of the existing_datasets set parameter rather than returning a new set, as preferred by the development team.

Applied to files:

  • tools/deployment/presto-clp/scripts/set-up-config.sh
  • tools/deployment/presto-clp/scripts/generate-metadata-filter-file.py
📚 Learning: in clp project build scripts (specifically build.sh files in docker-images directories), maintain co...
Learnt from: jackluo923
PR: y-scope/clp#1054
File: components/core/tools/docker-images/clp-env-base-musllinux_1_2-aarch64/build.sh:3-5
Timestamp: 2025-07-07T17:43:04.349Z
Learning: In CLP project build scripts (specifically build.sh files in docker-images directories), maintain consistency with the established pattern of using separate `set -eu` and `set -o pipefail` commands rather than combining them into `set -euo pipefail`, to ensure uniform script structure across all platform build scripts.

Applied to files:

  • tools/deployment/presto-clp/scripts/set-up-config.sh
📚 Learning: in clp installation scripts within `components/core/tools/scripts/lib_install/`, maintain consistenc...
Learnt from: jackluo923
PR: y-scope/clp#1054
File: components/core/tools/scripts/lib_install/musllinux_1_2/install-packages-from-source.sh:6-8
Timestamp: 2025-07-01T14:51:19.172Z
Learning: In CLP installation scripts within `components/core/tools/scripts/lib_install/`, maintain consistency with existing variable declaration patterns across platforms rather than adding individual improvements like `readonly` declarations.

Applied to files:

  • tools/deployment/presto-clp/scripts/set-up-config.sh
📚 Learning: in clp installation scripts, consistency across platform scripts is prioritized over defensive progr...
Learnt from: jackluo923
PR: y-scope/clp#1054
File: components/core/tools/scripts/lib_install/musllinux_1_2/install-prebuilt-packages.sh:27-32
Timestamp: 2025-07-07T17:41:15.655Z
Learning: In CLP installation scripts, consistency across platform scripts is prioritized over defensive programming improvements. For example, when extracting Task binaries with tar in `install-prebuilt-packages.sh`, the extraction pattern should remain consistent with other platform scripts rather than adding defensive flags like `--strip-components=1` to handle potential tarball layout changes.

Applied to files:

  • tools/deployment/presto-clp/scripts/set-up-config.sh
📚 Learning: in the taskfiles dependency system (taskfiles/deps/main.yaml), echo commands are used to generate .c...
Learnt from: anlowee
PR: y-scope/clp#925
File: taskfiles/deps/main.yaml:97-106
Timestamp: 2025-05-28T18:33:30.155Z
Learning: In the taskfiles dependency system (taskfiles/deps/main.yaml), echo commands are used to generate .cmake settings files that are consumed by the main CMake build process. These files set variables like ANTLR_RUNTIME_HEADER to point to dependency locations for use during compilation.

Applied to files:

  • tools/deployment/presto-clp/scripts/set-up-config.sh
📚 Learning: when reviewing wrapper scripts in `components/clp-package-utils/clp_package_utils/scripts/`, note th...
Learnt from: haiqi96
PR: y-scope/clp#594
File: components/clp-package-utils/clp_package_utils/scripts/del_archives.py:56-65
Timestamp: 2024-11-18T16:49:20.248Z
Learning: When reviewing wrapper scripts in `components/clp-package-utils/clp_package_utils/scripts/`, note that it's preferred to keep error handling simple without adding extra complexity.

Applied to files:

  • tools/deployment/presto-clp/scripts/set-up-config.sh
📚 Learning: the column metadata table (created by `_create_column_metadata_table`) is only needed for dataset-sp...
Learnt from: Bill-hbrhbr
PR: y-scope/clp#868
File: components/clp-py-utils/clp_py_utils/clp_metadata_db_utils.py:141-144
Timestamp: 2025-05-05T16:32:55.163Z
Learning: The column metadata table (created by `_create_column_metadata_table`) is only needed for dataset-specific workflows in `CLP_S` and is obsolete for non-dataset workflows.

Applied to files:

  • tools/deployment/presto-clp/scripts/generate-metadata-filter-file.py
📚 Learning: in the current clp codebase implementation, dataset validation using validate_dataset() is performed...
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1050
File: components/clp-package-utils/clp_package_utils/scripts/search.py:100-106
Timestamp: 2025-07-03T20:10:43.789Z
Learning: In the current CLP codebase implementation, dataset validation using validate_dataset() is performed within the native scripts (like clp_package_utils/scripts/native/search.py) rather than at the wrapper script level, where the native scripts handle their own parameter validation.

Applied to files:

  • tools/deployment/presto-clp/scripts/generate-metadata-filter-file.py
🔇 Additional comments (6)
tools/deployment/presto-clp/scripts/generate-metadata-filter-file.py (5)

1-22: LGTM! Well-structured imports and logging setup.

The imports are appropriate for the functionality, and the logging configuration follows best practices with proper formatting and level settings.


24-36: Excellent use of TypedDict for type safety.

The type definitions provide clear structure for the JSON output format and enhance code maintainability with proper typing.


78-105: LGTM! Robust dataset discovery with proper error handling.

The function correctly validates directory paths, handles edge cases, and uses in-place list updates which aligns with CLP codebase patterns.


140-190: Excellent error handling and input validation.

The function properly validates inputs, handles missing timestamp keys, and includes robust exception handling with appropriate cleanup.


192-193: LGTM! Standard script execution pattern.

The script execution follows Python best practices correctly.

tools/deployment/presto-clp/scripts/set-up-config.sh (1)

30-33: LGTM! Clean integration following established patterns.

The new step integrates seamlessly with the existing script flow, uses consistent formatting, and places the output file in an appropriate location.

@quinntaylormitchell
Copy link
Collaborator Author

@coderabbitai review

Copy link
Contributor

coderabbitai bot commented Aug 6, 2025

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@quinntaylormitchell quinntaylormitchell marked this pull request as ready for review August 6, 2025 21:53
@quinntaylormitchell quinntaylormitchell requested a review from a team as a code owner August 6, 2025 21:53
@quinntaylormitchell quinntaylormitchell changed the title feat(presto): Generate metadata filter file from set-up-config.sh command feat(Presto): Generate metadata filter file from set-up-config.sh command Aug 6, 2025
@haiqi96
Copy link
Contributor

haiqi96 commented Sep 2, 2025

The PR claims that This process is error-prone, and it can become time-consuming if the user has compressed (and wishes to query) many datasets.
From high-level, however, I feel the process proposed in this PR is also very error prone. User will have to type dataset and timestamp key correctly one by one. and if they make one mistake, they can't rollback.

I feel the real issue is that the config file we ask user to edit is overly complicated. since it contains a lot of unrelated key-value pairs.

What if we let user specifies a simple yaml, or even plain text file with only dataset:timestamp-key pair, and parse that file instead?

@quinntaylormitchell
Copy link
Collaborator Author

User will have to type dataset and timestamp key correctly one by one

@haiqi96 fwiw, this PR proposes a process where the user only has to type out the timestamp; the dataset name is spoonfed. For example, if I had already compressed two datasets, cockroachdb and postgresql, then this is the prompt I'd get:

Please enter the timestamp key that corresponds to each of your archived datasets.
Press <Enter> to accept the default key.

>>> cockroachdb [default key: timestamp]: timestamp
>>> postgresql [default key: timestamp]: timestamp

all I would have to type for each dataset is "timestamp", which the user will have to do anyway regardless of what method we use.

That being said, I do agree that the inability for the user to roll back is a downside of this method, that's a good point. And I think that your idea re. letting the user specify a simple yaml or something has merit. @kirkrodrigues what do you think?

@haiqi96
Copy link
Contributor

haiqi96 commented Sep 2, 2025

Quick question; What will happen if user hit ctrl+c, or anything that forces the input() to terminate? Will script return a failure?

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

♻️ Duplicate comments (3)
tools/deployment/presto-clp/scripts/generate-metadata-filter-file.py (3)

52-55: Resolve output file path for consistency with clp_package_dir.

Aligns with earlier suggestion and avoids surprises with relative paths.

 parsed_args = args_parser.parse_args(argv[1:])
 clp_package_dir: Path = parsed_args.clp_package_dir.resolve()
-json_output_file: Path = parsed_args.output_file
+json_output_file: Path = parsed_args.output_file.resolve()

78-105: Return datasets instead of mutating an out-parameter (follow-up).

Pythonic API: return Optional[List[str]] and drop the bool + mutation pattern. This was raised earlier; consider a follow-up PR to do it repo-wide for consistency.


123-125: Mark ANSI control codes as constants.

Avoids magic strings and clarifies intent.

-    BOLD = "\033[1m"
-    RESET = "\033[0m"
+    BOLD: Final[str] = "\033[1m"
+    RESET: Final[str] = "\033[0m"
📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between e16f4ff and 597d60b.

📒 Files selected for processing (1)
  • tools/deployment/presto-clp/scripts/generate-metadata-filter-file.py (1 hunks)
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-07-28T23:14:57.751Z
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1100
File: integration-tests/tests/utils/config.py:42-45
Timestamp: 2025-07-28T23:14:57.751Z
Learning: In the CLP project, the team prefers Yoda-style conditions (e.g., `0 == len(dataset_name)`) over standard Python idioms to help catch accidental assignment operators (writing `=` instead of `==`). Additionally, the team finds using `not` for string emptiness checks ambiguous and prefers explicit length comparisons.

Applied to files:

  • tools/deployment/presto-clp/scripts/generate-metadata-filter-file.py
📚 Learning: 2025-08-15T15:21:51.607Z
Learnt from: haiqi96
PR: y-scope/clp#1186
File: components/clp-package-utils/clp_package_utils/scripts/archive_manager.py:40-40
Timestamp: 2025-08-15T15:21:51.607Z
Learning: In components/clp-package-utils/clp_package_utils/scripts/archive_manager.py, the _validate_timestamps function correctly uses Optional[int] for end_ts parameter without a default value because it forces callers to explicitly pass the parsed argument value (which can be None), making the API contract clear and avoiding ambiguity about whether None came from defaults or actual parsed values.

Applied to files:

  • tools/deployment/presto-clp/scripts/generate-metadata-filter-file.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: lint-check (macos-15)
  • GitHub Check: lint-check (ubuntu-24.04)

@quinntaylormitchell quinntaylormitchell changed the title feat(Presto): Generate metadata filter file from set-up-config.sh command feat(Presto): Generate split filter file from set-up-config.sh command Sep 3, 2025
@quinntaylormitchell
Copy link
Collaborator Author

What will happen if user hit ctrl+c, or anything that forces the input() to terminate?

@haiqi96 this will happen:

quinnmitchell@baker22:/home/quinnmitchell/clp/tools/deployment/presto-clp$ ./scripts/set-up-config.sh /home/quinnmitchell/clp/build/clp-package

... # other output not relevant to this

Please enter the timestamp key that corresponds to each of your archived datasets.
Press <Enter> to accept the default key.

>>> cockroachdb [default key: timestamp]: timestamp
>>> postgresql [default key: timestamp]: ^C2025-09-03T22:07:13.564 ERROR [generate-split-filter-file] Interrupted while collecting timestamp keys.
quinnmitchell@baker22:/home/quinnmitchell/clp/tools/deployment/presto-clp$ 

@quinntaylormitchell
Copy link
Collaborator Author

@coderabbitai review

Copy link
Contributor

coderabbitai bot commented Sep 4, 2025

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
components/package-template/src/etc/clp-config.yml (1)

1-132: Template parameterisation and non-interactive setup.

Given the new interactive flow for Presto filters, keep this file generic and drive env-specific values from set-up-config.sh or a simple “values.yml” (e.g., key:value pairs) to support non-interactive runs and easy rollback, as suggested in review.

Minimal approach:

  • Replace site-specific literals here with CHANGE_ME placeholders.
  • Extend set-up-config.sh to read an optional values.yml and render this template (envsubst/jinja2/yq), falling back to prompts only for missing keys.
  • Persist the resolved values.yml alongside generated files for audit/rollback.
📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 597d60b and 5ef6d95.

📒 Files selected for processing (1)
  • components/package-template/src/etc/clp-config.yml (1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: quinntaylormitchell
PR: y-scope/clp#1170
File: tools/deployment/presto-clp/scripts/generate-metadata-filter-file.py:0-0
Timestamp: 2025-09-04T12:26:54.788Z
Learning: The tools/deployment/presto-clp/coordinator/config-template/ directory already exists in the CLP deployment structure and contains configuration files, so directory creation is not needed when writing metadata-filter.json to this location.
Learnt from: quinntaylormitchell
PR: y-scope/clp#1170
File: tools/deployment/presto-clp/scripts/generate-metadata-filter-file.py:0-0
Timestamp: 2025-09-04T12:26:54.788Z
Learning: The `set-up-config.sh` script in tools/deployment/presto-clp/scripts/ ensures that the output directory for metadata-filter.json already exists before calling generate-metadata-filter-file.py, so directory creation in the Python script is not strictly necessary.
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: package-image
  • GitHub Check: lint-check (macos-15)
🔇 Additional comments (5)
components/package-template/src/etc/clp-config.yml (5)

13-16: Confirm query engine selection aligns with the Presto flow.

With Presto in this PR, is query_engine: "clp-s" intentional? If Presto should serve queries, switch to "presto".

Option if needed:

-  query_engine: "clp-s"
+  query_engine: "presto"

71-98: Archive path matches the Presto split-filter generator’s expectations.

Directory var/data/archives aligns with the generator scanning location introduced in this PR. No changes requested.


100-109: Stream output configuration looks sane.

Targets and path are consistent with archive settings. No action.


44-49: Confirm no runtime port collisions for base_port

Only occurrence of 14010 is in components/package-template/src/etc/clp-config.yml; still verify at deployment that no other host/container service uses or dynamically allocates ports overlapping 14010–14019.


34-37: Port 5673 is intentionally mapped to container port 5672; no change required.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 11

♻️ Duplicate comments (2)
tools/deployment/presto-clp/scripts/generate-split-filter-file.py (2)

90-90: Prefer is None over equality for None.

Project style also discourages Yoda comparisons except == 0.

-    if None == datasets:
+    if datasets is None:

57-66: Support non-interactive and file-driven key input to reduce error-prone prompts.

Add optional flags to read dataset:key pairs from a simple file or accept defaults for all. This addresses reviewer feedback and improves automation.

     args_parser.add_argument(
         "--clp-package-dir", help="CLP package directory.", required=True, type=Path
     )
     args_parser.add_argument(
         "--output-file", help="Path for the split filter file.", required=True, type=Path
     )
+    args_parser.add_argument(
+        "--keys-file",
+        type=Path,
+        help="Optional file mapping dataset to timestamp key (format: 'dataset key' per line).",
+    )
+    args_parser.add_argument(
+        "--accept-defaults",
+        action="store_true",
+        help=f"Use default key '{DEFAULT_TIMESTAMP_KEY}' for all datasets; skip prompts.",
+    )
-    try:
-        timestamp_keys_by_dataset = _prompt_timestamp_keys(datasets)
-    except KeyboardInterrupt:
-        logger.error("Interrupted while collecting timestamp keys.")
-        return 1
+    if parsed_args.keys_file:
+        try:
+            timestamp_keys_by_dataset = _load_timestamp_keys_from_file(parsed_args.keys_file, datasets)
+        except (OSError, ValueError) as e:
+            logger.error("Failed to load timestamp keys from %s: %s", parsed_args.keys_file, e)
+            return 1
+    elif parsed_args.accept_defaults:
+        timestamp_keys_by_dataset = {d: DEFAULT_TIMESTAMP_KEY for d in datasets}
+    else:
+        try:
+            timestamp_keys_by_dataset = _prompt_timestamp_keys(datasets)
+        except KeyboardInterrupt:
+            logger.error("Interrupted while collecting timestamp keys.")
+            return 1

Add this helper:

def _load_timestamp_keys_from_file(keys_file: Path, datasets: List[str]) -> Dict[str, str]:
    """
    Load mapping from a simple text file:
      - Each non-empty, non-comment line: "<dataset> <timestamp_key>"
      - Whitespace separated. Lines starting with '#' ignored.
    Raises ValueError if a listed dataset is unknown or if parsing fails.
    """
    mapping: Dict[str, str] = {}
    valid = set(datasets)
    for i, raw in enumerate(keys_file.read_text(encoding="utf-8").splitlines(), start=1):
        line = raw.strip()
        if not line or line.startswith("#"):
            continue
        parts = line.split(None, 1)
        if len(parts) != 2:
            raise ValueError(f"{keys_file}:{i}: expected '<dataset> <timestamp_key>'")
        ds, key = parts[0].strip(), parts[1].strip()
        if ds not in valid:
            raise ValueError(f"{keys_file}:{i}: unknown dataset '{ds}'")
        mapping[ds] = key
    return mapping

Also applies to: 94-99

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5ef6d95 and 7a3fc6d.

📒 Files selected for processing (1)
  • tools/deployment/presto-clp/scripts/generate-split-filter-file.py (1 hunks)
🧰 Additional context used
🧠 Learnings (3)
📓 Common learnings
Learnt from: quinntaylormitchell
PR: y-scope/clp#1170
File: tools/deployment/presto-clp/scripts/generate-metadata-filter-file.py:0-0
Timestamp: 2025-09-04T12:26:54.788Z
Learning: The `set-up-config.sh` script in tools/deployment/presto-clp/scripts/ ensures that the output directory for metadata-filter.json already exists before calling generate-metadata-filter-file.py, so directory creation in the Python script is not strictly necessary.
📚 Learning: 2025-09-04T12:26:54.788Z
Learnt from: quinntaylormitchell
PR: y-scope/clp#1170
File: tools/deployment/presto-clp/scripts/generate-metadata-filter-file.py:0-0
Timestamp: 2025-09-04T12:26:54.788Z
Learning: The `set-up-config.sh` script in tools/deployment/presto-clp/scripts/ ensures that the output directory for metadata-filter.json already exists before calling generate-metadata-filter-file.py, so directory creation in the Python script is not strictly necessary.

Applied to files:

  • tools/deployment/presto-clp/scripts/generate-split-filter-file.py
📚 Learning: 2025-08-17T15:26:22.790Z
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1100
File: integration-tests/tests/utils/config.py:0-0
Timestamp: 2025-08-17T15:26:22.790Z
Learning: In the CLP project, Yoda-style conditions (literal on the left) are only used for `== 0` checks, not for other comparison operators like `>` or `<`. This is specifically to catch accidental assignment operators (writing `=` instead of `==`) in equality comparisons against zero.

Applied to files:

  • tools/deployment/presto-clp/scripts/generate-split-filter-file.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: package-image
  • GitHub Check: lint-check (macos-15)

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

♻️ Duplicate comments (10)
tools/deployment/presto-clp/scripts/generate-split-filter-file.py (10)

14-27: Avoid configuring the root logger at import-time; add a local init function.

Side effects on import (double handlers, noisy logs) if this module is imported elsewhere. Initialise logging from main().

-# Set up console logging
-logging_console_handler = logging.StreamHandler()
-logging_formatter = logging.Formatter(
-    "%(asctime)s.%(msecs)03d %(levelname)s [%(module)s] %(message)s", datefmt="%Y-%m-%dT%H:%M:%S"
-)
-logging_console_handler.setFormatter(logging_formatter)
-
-# Set up root logger
-root_logger = logging.getLogger()
-root_logger.setLevel(logging.INFO)
-root_logger.addHandler(logging_console_handler)
-
-# Create logger
-logger = logging.getLogger(__name__)
+# Create logger
+logger = logging.getLogger(__name__)
+
+def _configure_logging() -> None:
+    if logger.handlers:
+        return
+    handler = logging.StreamHandler()
+    formatter = logging.Formatter(
+        "%(asctime)s.%(msecs)03d %(levelname)s [%(module)s] %(message)s",
+        datefmt="%Y-%m-%dT%H:%M:%S",
+    )
+    handler.setFormatter(formatter)
+    logger.setLevel(logging.INFO)
+    logger.addHandler(handler)

72-79: Initialise logging inside main().

Call your logger setup once at runtime, not on import.

 def main(argv=None) -> int:
     if argv is None:
         argv = sys.argv
+    _configure_logging()

101-105: Use identity test for None.

-    if datasets == None:
+    if datasets is None:

117-124: Specify encoding and log success for observability.

-    try:
-        with open(json_output_file, "w") as json_file:
-            json.dump(split_filters, json_file, indent=2)
+    try:
+        with open(json_output_file, "w", encoding="utf-8") as json_file:
+            json.dump(split_filters, json_file, indent=2)
+        logger.info(
+            "Wrote split-filter JSON to %s for %d dataset(s).",
+            json_output_file,
+            len(datasets),
+        )

137-139: Guard against I/O errors when listing archives.

-    datasets = sorted([p.name for p in archives_dir.iterdir() if p.is_dir()])
-    return datasets if len(datasets) >= 1 else None
+    try:
+        datasets = sorted([p.name for p in archives_dir.iterdir() if p.is_dir()])
+    except OSError as e:
+        logger.error("Failed to read archives directory %s: %s", archives_dir, e)
+        return None
+    return datasets if len(datasets) >= 1 else None

154-162: Trim input and use a clearer mapping variable.

Prevents accidental trailing spaces and clarifies intent.

-    data_time_pairs: Dict[str, str] = {}
+    timestamp_keys_by_dataset: Dict[str, str] = {}
     for dataset in datasets:
-        user_input = input(
+        user_input = input(
             f">>> {ANSI_BOLD}{dataset}{ANSI_RESET} [default key: {ANSI_BOLD}{DEFAULT_TIMESTAMP_KEY}{ANSI_RESET}]: "
-        )
+        ).strip()
         key = DEFAULT_TIMESTAMP_KEY if 0 == len(user_input) else user_input
-        data_time_pairs[dataset] = key
+        timestamp_keys_by_dataset[dataset] = key
 
-    return data_time_pairs
+    return timestamp_keys_by_dataset

1-7: Avoid sharing mutable default dicts across rules; deepcopy options per rule.

Prevents cross-rule mutation bugs.

 import argparse
 import json
 import logging
 import sys
 from pathlib import Path
 from typing import Dict, Final, List, Optional, TypedDict
+from copy import deepcopy
@@
         rule: SplitFilterRule = {
             "columnName": timestamp_keys[dataset],
-            "customOptions": DEFAULT_CUSTOM_OPTIONS,
+            "customOptions": deepcopy(DEFAULT_CUSTOM_OPTIONS),
             "required": DEFAULT_REQUIRED,
         }

Also applies to: 186-191


76-85: Don’t hard-code "clp.default"; make catalog and schema configurable via flags.

     args_parser.add_argument(
         "--output-file", help="Path for the split filter file.", required=True, type=Path
     )
+    args_parser.add_argument("--catalog", help="Presto catalog.", default="clp")
+    args_parser.add_argument("--schema", help="Presto schema.", default="default")
@@
     parsed_args = args_parser.parse_args(argv[1:])
     clp_package_dir: Path = parsed_args.clp_package_dir.resolve()
     archives_dir = clp_package_dir / "var" / "data" / "archives"
     json_output_file: Path = parsed_args.output_file.resolve()
     out_dir = json_output_file.parent
+    catalog: str = parsed_args.catalog
+    schema: str = parsed_args.schema
@@
-        split_filters[f"clp.default.{dataset}"] = [rule]
+        split_filters[f"{catalog}.{schema}.{dataset}"] = [rule]

Also applies to: 86-91, 191-191


196-197: Conventional main guard order.

-if "__main__" == __name__:
+if __name__ == "__main__":
     sys.exit(main(sys.argv))

170-177: Fix duplicated :return: lines in docstring.

-    :return: A SplitFilterDict containing all the SplitFilterRule objects for the JSON file.
-    :return: A `SplitFilterDict` containing all the `SplitFilterRule` objects for the JSON file, or
-    None if there are any datasets that don't have an associated timestamp key.
+    :return: SplitFilterDict for all rules, or None if any dataset lacks a timestamp key.
📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7a3fc6d and 906a2d8.

📒 Files selected for processing (1)
  • tools/deployment/presto-clp/scripts/generate-split-filter-file.py (1 hunks)
🧰 Additional context used
🧠 Learnings (6)
📓 Common learnings
Learnt from: quinntaylormitchell
PR: y-scope/clp#1170
File: tools/deployment/presto-clp/scripts/generate-metadata-filter-file.py:0-0
Timestamp: 2025-09-04T12:26:54.788Z
Learning: The `set-up-config.sh` script in tools/deployment/presto-clp/scripts/ ensures that the output directory for metadata-filter.json already exists before calling generate-metadata-filter-file.py, so directory creation in the Python script is not strictly necessary.
📚 Learning: 2025-09-04T12:26:54.788Z
Learnt from: quinntaylormitchell
PR: y-scope/clp#1170
File: tools/deployment/presto-clp/scripts/generate-metadata-filter-file.py:0-0
Timestamp: 2025-09-04T12:26:54.788Z
Learning: The `set-up-config.sh` script in tools/deployment/presto-clp/scripts/ ensures that the output directory for metadata-filter.json already exists before calling generate-metadata-filter-file.py, so directory creation in the Python script is not strictly necessary.

Applied to files:

  • tools/deployment/presto-clp/scripts/generate-split-filter-file.py
📚 Learning: 2025-08-17T15:26:22.790Z
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1100
File: integration-tests/tests/utils/config.py:0-0
Timestamp: 2025-08-17T15:26:22.790Z
Learning: In the CLP project, Yoda-style conditions (literal on the left) are only used for `== 0` checks, not for other comparison operators like `>` or `<`. This is specifically to catch accidental assignment operators (writing `=` instead of `==`) in equality comparisons against zero.

Applied to files:

  • tools/deployment/presto-clp/scripts/generate-split-filter-file.py
📚 Learning: 2025-08-19T14:41:28.901Z
Learnt from: junhaoliao
PR: y-scope/clp#1152
File: components/clp-package-utils/clp_package_utils/general.py:0-0
Timestamp: 2025-08-19T14:41:28.901Z
Learning: In the CLP codebase, prefer explicit failures over automatic directory creation in utility functions like dump_config. The user junhaoliao prefers to let file operations fail when parent directories don't exist, as this helps catch implementation errors during development rather than masking setup issues with automatic directory creation.

Applied to files:

  • tools/deployment/presto-clp/scripts/generate-split-filter-file.py
📚 Learning: 2024-11-15T16:21:52.122Z
Learnt from: haiqi96
PR: y-scope/clp#594
File: components/clp-package-utils/clp_package_utils/scripts/native/del_archives.py:104-110
Timestamp: 2024-11-15T16:21:52.122Z
Learning: In `clp_package_utils/scripts/native/del_archives.py`, when deleting archives, the `archive` variable retrieved from the database is controlled and is always a single string without path components. Therefore, it's acceptable to skip additional validation checks for directory traversal in this context.

Applied to files:

  • tools/deployment/presto-clp/scripts/generate-split-filter-file.py
📚 Learning: 2025-07-28T23:14:57.751Z
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1100
File: integration-tests/tests/utils/config.py:42-45
Timestamp: 2025-07-28T23:14:57.751Z
Learning: In the CLP project, the team prefers Yoda-style conditions (e.g., `0 == len(dataset_name)`) over standard Python idioms to help catch accidental assignment operators (writing `=` instead of `==`). Additionally, the team finds using `not` for string emptiness checks ambiguous and prefers explicit length comparisons.

Applied to files:

  • tools/deployment/presto-clp/scripts/generate-split-filter-file.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: package-image
🔇 Additional comments (1)
tools/deployment/presto-clp/scripts/generate-split-filter-file.py (1)

92-100: Create parent directory for split-filter.json in set-up-config.sh

set-up-config.sh calls generate-split-filter-file.py with --output-file "${script_dir}/../coordinator/config-template/split-filter.json" but does not create the parent directory; add a line like: mkdir -p "${script_dir}/../coordinator/config-template" before the Python invocation.
Location: tools/deployment/presto-clp/scripts/set-up-config.sh (around lines 30–33).

⛔ Skipped due to learnings
Learnt from: quinntaylormitchell
PR: y-scope/clp#1170
File: tools/deployment/presto-clp/scripts/generate-metadata-filter-file.py:0-0
Timestamp: 2025-09-04T12:26:54.788Z
Learning: The `set-up-config.sh` script in tools/deployment/presto-clp/scripts/ ensures that the output directory for metadata-filter.json already exists before calling generate-metadata-filter-file.py, so directory creation in the Python script is not strictly necessary.
Learnt from: quinntaylormitchell
PR: y-scope/clp#1170
File: tools/deployment/presto-clp/scripts/generate-metadata-filter-file.py:0-0
Timestamp: 2025-09-04T12:26:54.788Z
Learning: The tools/deployment/presto-clp/coordinator/config-template/ directory already exists in the CLP deployment structure and contains configuration files, so directory creation is not needed when writing metadata-filter.json to this location.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (11)
tools/deployment/presto-clp/scripts/generate-split-filter-file.py (11)

14-27: Avoid root-logger side effects; scope logging to this module.

Configure a module logger via an idempotent helper; don’t mutate the root logger at import-time.

-# Set up console logging
-logging_console_handler = logging.StreamHandler()
-logging_formatter = logging.Formatter(
-    "%(asctime)s.%(msecs)03d %(levelname)s [%(module)s] %(message)s", datefmt="%Y-%m-%dT%H:%M:%S"
-)
-logging_console_handler.setFormatter(logging_formatter)
-
-# Set up root logger
-root_logger = logging.getLogger()
-root_logger.setLevel(logging.INFO)
-root_logger.addHandler(logging_console_handler)
-
-# Create logger
-logger = logging.getLogger(__name__)
+logger = logging.getLogger(__name__)
+
+def _configure_logging() -> None:
+    if logger.handlers:
+        return
+    handler = logging.StreamHandler()
+    formatter = logging.Formatter(
+        "%(asctime)s.%(msecs)03d %(levelname)s [%(module)s] %(message)s",
+        datefmt="%Y-%m-%dT%H:%M:%S",
+    )
+    handler.setFormatter(formatter)
+    logger.setLevel(logging.INFO)
+    logger.addHandler(handler)

72-75: Initialise logging inside main().

One-time init; keeps import side‑effect free.

 def main(argv=None) -> int:
     if argv is None:
         argv = sys.argv
+    _configure_logging()

101-105: Use ‘is None’ for None checks.

Idiomatic and avoids edge cases.

-    if datasets == None:
+    if datasets is None:

106-111: Handle EOF (Ctrl+D) during prompts like Ctrl+C.

Clean exit on non-interactive/EOF.

-    except KeyboardInterrupt:
+    except (KeyboardInterrupt, EOFError):
         logger.error("Interrupted while collecting timestamp keys.")
         return 1

117-123: Write with UTF‑8 and log success for observability.

-    try:
-        with open(json_output_file, "w") as json_file:
-            json.dump(split_filters, json_file, indent=2)
+    try:
+        with open(json_output_file, "w", encoding="utf-8") as json_file:
+            json.dump(split_filters, json_file, indent=2)
+        logger.info(
+            "Wrote split-filter file to %s for %d dataset(s).",
+            json_output_file,
+            len(datasets),
+        )
     except OSError as e:

137-139: Guard archive listing against I/O errors.

Avoids crash on permission/IO issues.

-    datasets = sorted([p.name for p in archives_dir.iterdir() if p.is_dir()])
-    return datasets if len(datasets) >= 1 else None
+    try:
+        datasets = sorted([p.name for p in archives_dir.iterdir() if p.is_dir()])
+    except OSError as e:
+        logger.error("Failed to read archives directory %s: %s", archives_dir, e)
+        return None
+    return datasets if len(datasets) >= 1 else None

146-162: Trim input and use a clearer variable name.

Prevents trailing spaces and improves readability.

-    :param datasets:
-    :return: mapping of `dataset` -> `timestamp_keys`.
+    :param datasets:
+    :return: Mapping of dataset -> timestamp key.
@@
-    data_time_pairs: Dict[str, str] = {}
+    timestamp_keys_by_dataset: Dict[str, str] = {}
     for dataset in datasets:
-        user_input = input(
+        user_input = input(
             f">>> {ANSI_BOLD}{dataset}{ANSI_RESET} [default key: {ANSI_BOLD}{DEFAULT_TIMESTAMP_KEY}{ANSI_RESET}]: "
-        )
+        ).strip()
         key = DEFAULT_TIMESTAMP_KEY if 0 == len(user_input) else user_input
-        data_time_pairs[dataset] = key
+        timestamp_keys_by_dataset[dataset] = key
 
-    return data_time_pairs
+    return timestamp_keys_by_dataset

1-7: Avoid sharing mutable defaults across rules; deepcopy options.

Prevents cross‑rule mutation hazards.

 from typing import Dict, Final, List, Optional, TypedDict
+from copy import deepcopy
         rule: SplitFilterRule = {
             "columnName": timestamp_keys[dataset],
-            "customOptions": DEFAULT_CUSTOM_OPTIONS,
+            "customOptions": deepcopy(DEFAULT_CUSTOM_OPTIONS),
             "required": DEFAULT_REQUIRED,
         }

Also applies to: 186-191


196-197: Conventional main guard.

-if "__main__" == __name__:
+if __name__ == "__main__":
     sys.exit(main(sys.argv))

172-177: Deduplicate/condense return description in docstring.

-    :return: A SplitFilterDict containing all the SplitFilterRule objects for the JSON file.
-    :return: A `SplitFilterDict` containing all the `SplitFilterRule` objects for the JSON file, or
-    None if there are any datasets that don't have an associated timestamp key.
+    :return: A SplitFilterDict for the JSON file, or None if any dataset lacks a timestamp key.

53-62: Fix validate_dir docstring (stale parameter, clarity).

Remove non-existent param and tighten return description.

     """
     Determines whether or not the path exists and whether or not it is a directory. If either of
     those are false, logs error message with path and returns False.
 
     :param path:
     :param label:
-    :param logger:
-    :return: True if path exists and is a directory; False if either of those are not true.
+    :return: True if path exists and is a directory; False otherwise.
     """
📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 906a2d8 and 7bf80b3.

📒 Files selected for processing (1)
  • tools/deployment/presto-clp/scripts/generate-split-filter-file.py (1 hunks)
🧰 Additional context used
🧠 Learnings (6)
📓 Common learnings
Learnt from: quinntaylormitchell
PR: y-scope/clp#1170
File: tools/deployment/presto-clp/scripts/generate-metadata-filter-file.py:0-0
Timestamp: 2025-09-04T12:26:54.788Z
Learning: The `set-up-config.sh` script in tools/deployment/presto-clp/scripts/ ensures that the output directory for metadata-filter.json already exists before calling generate-metadata-filter-file.py, so directory creation in the Python script is not strictly necessary.
📚 Learning: 2025-09-04T12:26:54.788Z
Learnt from: quinntaylormitchell
PR: y-scope/clp#1170
File: tools/deployment/presto-clp/scripts/generate-metadata-filter-file.py:0-0
Timestamp: 2025-09-04T12:26:54.788Z
Learning: The `set-up-config.sh` script in tools/deployment/presto-clp/scripts/ ensures that the output directory for metadata-filter.json already exists before calling generate-metadata-filter-file.py, so directory creation in the Python script is not strictly necessary.

Applied to files:

  • tools/deployment/presto-clp/scripts/generate-split-filter-file.py
📚 Learning: 2025-08-17T15:26:22.790Z
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1100
File: integration-tests/tests/utils/config.py:0-0
Timestamp: 2025-08-17T15:26:22.790Z
Learning: In the CLP project, Yoda-style conditions (literal on the left) are only used for `== 0` checks, not for other comparison operators like `>` or `<`. This is specifically to catch accidental assignment operators (writing `=` instead of `==`) in equality comparisons against zero.

Applied to files:

  • tools/deployment/presto-clp/scripts/generate-split-filter-file.py
📚 Learning: 2025-08-19T14:41:28.901Z
Learnt from: junhaoliao
PR: y-scope/clp#1152
File: components/clp-package-utils/clp_package_utils/general.py:0-0
Timestamp: 2025-08-19T14:41:28.901Z
Learning: In the CLP codebase, prefer explicit failures over automatic directory creation in utility functions like dump_config. The user junhaoliao prefers to let file operations fail when parent directories don't exist, as this helps catch implementation errors during development rather than masking setup issues with automatic directory creation.

Applied to files:

  • tools/deployment/presto-clp/scripts/generate-split-filter-file.py
📚 Learning: 2024-11-15T16:21:52.122Z
Learnt from: haiqi96
PR: y-scope/clp#594
File: components/clp-package-utils/clp_package_utils/scripts/native/del_archives.py:104-110
Timestamp: 2024-11-15T16:21:52.122Z
Learning: In `clp_package_utils/scripts/native/del_archives.py`, when deleting archives, the `archive` variable retrieved from the database is controlled and is always a single string without path components. Therefore, it's acceptable to skip additional validation checks for directory traversal in this context.

Applied to files:

  • tools/deployment/presto-clp/scripts/generate-split-filter-file.py
📚 Learning: 2025-07-28T23:14:57.751Z
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1100
File: integration-tests/tests/utils/config.py:42-45
Timestamp: 2025-07-28T23:14:57.751Z
Learning: In the CLP project, the team prefers Yoda-style conditions (e.g., `0 == len(dataset_name)`) over standard Python idioms to help catch accidental assignment operators (writing `=` instead of `==`). Additionally, the team finds using `not` for string emptiness checks ambiguous and prefers explicit length comparisons.

Applied to files:

  • tools/deployment/presto-clp/scripts/generate-split-filter-file.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: package-image
🔇 Additional comments (1)
tools/deployment/presto-clp/scripts/generate-split-filter-file.py (1)

92-96: Directory validation behaviour LGTM and matches team preference.

Fail fast if parent dirs are missing; aligns with prior guidance to avoid auto‑creation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants