Skip to content

Conversation

@junhaoliao
Copy link
Member

@junhaoliao junhaoliao commented Oct 28, 2025

Description

This change containerizes the package runtime, moving all user-facing CLI wrappers to run inside a Docker Compose–managed container. It introduces a unified environment bootstrap for consistent configuration and reduces package bloat by eliminating duplication of assets already included in the container image.

Summary of changes (2 BREAKING changes):

  • Runtime execution
    • BREAKING: ${CLP_HOME}/var/data, ${CLP_HOME}/var/log, ${CLP_HOME}/var/tmp will now be created with ownership ${USER}:docker instead of ${USER}:<user's default group>.
    • Add tools/deployment/package/docker-compose.runtime.yaml defining the clp-runtime service.
      • Maps the running user's HOME into the runtime environment for ~ in clp-config.yaml to work.
    • New components/package-template/src/sbin/.common-env.sh that:
      • Resolves CLP_HOME, image ref (CLP_PACKAGE_CONTAINER_IMAGE_REF), and CLP_FIRST_PARTY_SERVICE_UID_GID.
      • Exposes CLP_PWD_HOST
      • Detects Docker Compose plugins & socket for in-container use.
  • CLI wrappers
    • Update all src/sbin/*.sh wrappers (admin-tools and user tools) to source .common-env.sh and invoke:
      docker compose -f "$CLP_HOME/docker-compose.runtime.yaml" run --rm clp-runtime python3 -m ...
  • Python scripts (package utils)
    • Remove mounts.clp_home from necessary_mounts in:
      • archive_manager.py, compress.py, compress_from_s3.py, dataset_manager.py, decompress.py (both handlers), search.py.
    • decompress.py: default --extraction-dir to CLP_PWD_HOST (falls back to ".").
  • Build system
    • BREAKING: Remove all binaries from clp-package/bin.
    • Restructure taskfile.yaml:
      • New python-libs internal task producing build/python-libs.
      • docker-images:package now depends on image and copies settings/webui artifacts
      • default now runs task package again.
    • Rework tools/docker-images/clp-package/Dockerfile to assemble image directly from build outputs:
      • Copy core binaries, spider workers, Node 22, Python libs to lib/python3/site-packages/, and webui to var/www/webui/.
  • Template/package layout
    • Package now ships docker-compose.runtime.yaml and runs without a locally managed PYTHONPATH.

Rationale:

  • Standardize execution environment across hosts (Linux/WSL, and possibly Mac) via container.
  • Eliminate brittle dependency on host Python layout by running modules within the packaged image.

Checklist

  • The PR satisfies the contribution guidelines.
  • This is a breaking change and that has been indicated in the PR title, OR this isn't a
    breaking change.
  • Necessary docs have been updated, OR no docs need to be updated.

Validation performed

Observed size reduction of the package

junhao@ASUS-X870E:~/workspace/6-clp/build/clp-package$ du -sh .
180K    .

  1. task to build the package.
  2. cd build/clp-package/ && ./sbin/start-clp.sh
  3. ./sbin/compress.sh ~/samples/hive-24hr
  4. ./sbin/search.sh 1 which returned results
  5. cd /tmp && ./sbin/decompress.sh x which extracted the logs into /tmp with correct hierarchy set up.
  6. Launched the webui, performed a search with string "1" and observed results returned; clicked the link in the results which restored the log event context in the log viewer without failure.

  1. repeated above with clp-s

Summary by CodeRabbit

  • New Features

    • CLI tools now run inside a containerised runtime with a shared environment and improved host-bind handling.
  • Chores

    • Packaging adds Python library artifacts and web UI assets; entry scripts source a common environment.
    • Runtime image now includes additional runtime components and web UI content.
  • Bug Fixes

    • Improved host-path resolution for validations, mounts, file I/O and cleanup.
  • Documentation

    • Quick‑start: removed Python install guidance and Docker Desktop host‑networking note.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 28, 2025

Walkthrough

All host filesystem paths used by package utilities are resolved into the container-mounted host path before I/O and validation. CLI/sbin scripts now source a shared .common-env.sh and invoke Python modules inside a new clp-runtime container. Packaging and CI were adjusted to produce python-libs and include WebUI assets.

Changes

Cohort / File(s) Change Summary
Core host-path utility
components/clp-py-utils/clp_py_utils/core.py
Added CONTAINER_DIR_FOR_HOST_ROOT and resolve_host_path_in_container(path: pathlib.Path) to map host paths into the container mount space with tolerant single-level symlink handling.
Config & validation
components/clp-py-utils/clp_py_utils/clp_config.py
Validation methods (data/logs/tmp/archive/stream/inputs/aws config) gained use_host_mount: bool = False; when true they resolve paths via resolve_host_path_in_container and validate/respect the resolved host paths.
General helpers & CLP package utils
components/clp-package-utils/clp_package_utils/general.py
Now resolves host paths before file I/O and validation; removed standalone validate_logs_input_config; writes/reads defaults using resolved host paths.
Controller / service setup
components/clp-package-utils/clp_package_utils/controller.py
All filesystem operations (mkdir/chown/open/read/write/unlink for data, logs, tmp, archive, webui settings, instance-id, etc.) now use paths resolved via resolve_host_path_in_container.
CLI scripts (host path resolution + mounts)
components/clp-package-utils/clp_package_utils/scripts/*.py
(archive_manager.py, compress.py, compress_from_s3.py, dataset_manager.py, decompress.py, search.py, start_clp.py, stop_clp.py)
Load/generate configs and temporary lists using host-resolved paths; call validation variants with True; remove clp_home from many required mounts; perform host-side cleanup on resolved paths; read/write generated config via resolved host paths.
Shared environment script
components/package-template/src/sbin/.common-env.sh
New script exports CLP_HOME, CLP_PWD_HOST, CLP_PACKAGE_CONTAINER_IMAGE_REF, CLP_FIRST_PARTY_SERVICE_UID_GID; discovers Docker CLI plugin dir and Docker socket path for the runtime environment.
Shell CLI → container runtime
components/package-template/src/sbin/*.sh, components/package-template/src/sbin/admin-tools/*.sh
Scripts now source .common-env.sh and run Python modules inside clp-runtime via docker compose run --rm clp-runtime python3 -m ... instead of invoking local PYTHONPATH/direct python -m.
Runtime docker-compose service
tools/deployment/package/docker-compose.runtime.yaml
Added clp-runtime service using CLP_PACKAGE_CONTAINER_IMAGE_REF, user mapping and volumes (CLP_HOME, host root → /mnt/host, Docker plugin dir, Docker socket, docker binary) to enable containerized CLI execution with host access.
Packaging & CI tasks
taskfile.yaml, taskfiles/docker-images.yaml, .github/workflows/clp-artifact-build.yaml
Introduced G_PYTHON_LIBS_DIR, checksum vars and G_WEBUI_CHECKSUM_FILE; added python-libs packaging and package-build-deps task; default dependency changed to package; CI flow split to run package-build-deps and separately build/publish runtime image.
Docker image contents
tools/docker-images/clp-package/Dockerfile
Expanded COPY steps to include package-template sources, CLI/server binaries, mariadb libs, scheduler/worker binaries, Node runtime, Python site-packages, and WebUI assets into the image.
sbin scripts & admin tools
components/package-template/src/sbin/*, components/package-template/src/sbin/admin-tools/*
Removed inline PYTHONPATH usage, now source .common-env.sh, and defer execution into the clp-runtime container via docker compose.
Docs & quick-start
docs/src/dev-docs/building-package.md, docs/src/user-docs/quick-start/index.md
Updated packaging/build instructions and removed Python requirement and Docker Desktop host-networking note from quick-start.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant Shell as CLI script
    participant Env as .common-env.sh
    participant Compose as docker compose
    participant Runtime as clp-runtime
    participant Module as Python module

    User->>Shell: invoke script (e.g., compress.sh)
    Shell->>Env: source .common-env.sh
    Env-->>Shell: export CLP_HOME, CLP_PWD_HOST, image ref, UID:GID, plugin/sock paths
    Shell->>Compose: docker compose run --rm clp-runtime python3 -m <module> ...
    Compose->>Runtime: start container (volumes & user mapping)
    Runtime->>Module: run module
    Module->>Module: call resolve_host_path_in_container() for host file ops (mkdir/open/chown/unlink)
    Module-->>Runtime: exit with status
    Runtime-->>Compose: container exits
    Compose-->>Shell: return status
    Shell-->>User: exit
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

  • Review focus:
    • correctness and edge-case handling in resolve_host_path_in_container (absolute vs relative, symlink resolution, OSError fallback)
    • consistent use of resolved paths across controller and scripts for mkdir/chown/open/read/write/unlink
    • updated clp_config method signatures and all callsites passing the new use_host_mount flag
    • removal of clp_home from mounts and any remaining assumptions about it being mounted
    • clp-runtime compose bindings, UID:GID mapping and volume permission/SELinux considerations
    • packaging/taskfile and Dockerfile changes to verify python-libs and WebUI artifacts are built and placed where the Dockerfile expects

Possibly related issues

Pre-merge checks and finishing touches

✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the primary change: containerization of sbin scripts via Docker Compose and removal of redundant package assets, directly matching the changeset.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

)

necessary_mounts: List[Optional[DockerMount]] = [
mounts.clp_home,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mounts like this have to be removed, or they conflict with the contents inside the docker image.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, can you explain more?

Copy link
Member Author

@junhaoliao junhaoliao Nov 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, that comment was so unclear that it took me time to recall why it was done -

on the host, clp_home is essentially the clp-package directory root. previously we were mounting the package root into the container so the python modules in lib/python3/site-packages can be accessed in the container. however, we have now removed such contents from the package (recall "now the package is only 180K"); if we mount an empty directory into /opt/clp in the container, anything in /opt/clp is essentially gone. we should just use the assets in the docker image and remove such mounts then

file_extraction_parser.add_argument(
"-f", "--files-from", help="A file listing all files to extract."
)
default_extraction_dir = pathlib.Path(os.environ.get("CLP_PWD_HOST", "."))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we don't explicitly set and read this, . resolves to /opt/clp

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't really have to use Docker Compose for this container definition, though Docker Compose helps it keep structured and avoids duplicate code.

@junhaoliao junhaoliao changed the title feat(deployment): Containerize CLP tools via Docker Compose. feat(clp-package)!: Containerize CLP tools via Docker Compose; Remove redundant assets from the package. Oct 28, 2025
…ainer environments.

Update components to utilize resolve_host_path for consistent path handling between host and container; adjust validations to support use of host-mounted paths.
@junhaoliao junhaoliao marked this pull request as ready for review October 29, 2025 01:02
@junhaoliao junhaoliao requested a review from a team as a code owner October 29, 2025 01:02
Copy link
Member Author

@junhaoliao junhaoliao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to make sure if we don't miss resolving any paths and to avoid showing the prefixes in the error messages which may confuse users, we only

  1. "resolve" any host path right before their consumptions (e.g., .mkdir(), open(), .exists(), isdir())
  2. "resolve" only the paths that can be outside of the CLP_HOME directory, since we map CLP_HOME into the runtime container anyways

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 18

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (8)
tools/docker-images/clp-package/Dockerfile (1)

20-24: Split ENV directives into separate lines for readability.

Per the learning from PR #1414, ENV directives should be split into separate lines rather than consolidated. This improves readability and aligns with project conventions, especially when later assignments depend on earlier ones (e.g., PATH references CLP_HOME).

 ENV CLP_HOME="/opt/clp"
-ENV LD_LIBRARY_PATH="${CLP_HOME}/lib" \
-    PATH="${CLP_HOME}/sbin:${CLP_HOME}/bin:${PATH}" \
-    PYTHONPATH="${CLP_HOME}/lib/python3/site-packages" \
-    USER="clp-user"
+ENV LD_LIBRARY_PATH="${CLP_HOME}/lib"
+ENV PATH="${CLP_HOME}/sbin:${CLP_HOME}/bin:${PATH}"
+ENV PYTHONPATH="${CLP_HOME}/lib/python3/site-packages"
+ENV USER="clp-user"
components/clp-py-utils/clp_py_utils/clp_config.py (1)

697-731: Use keyword for boolean param and preserve exception context.

Name the boolean argument for readability and chain original exceptions.

Apply this diff:

@@ def validate_archive_output_config(self):
-        try:
-            validate_path_could_be_dir(self.archive_output.get_directory(), True)
-        except ValueError as ex:
-            raise ValueError(f"archive_output.storage's directory is invalid: {ex}")
+        try:
+            validate_path_could_be_dir(self.archive_output.get_directory(), use_host_mount=True)
+        except ValueError as ex:
+            raise ValueError("archive_output.storage's directory is invalid") from ex
@@ def validate_stream_output_config(self):
-        try:
-            validate_path_could_be_dir(self.stream_output.get_directory(), True)
-        except ValueError as ex:
-            raise ValueError(f"stream_output.storage's directory is invalid: {ex}")
+        try:
+            validate_path_could_be_dir(self.stream_output.get_directory(), use_host_mount=True)
+        except ValueError as ex:
+            raise ValueError("stream_output.storage's directory is invalid") from ex
@@ def validate_data_dir(self):
-        try:
-            validate_path_could_be_dir(self.data_directory, True)
-        except ValueError as ex:
-            raise ValueError(f"data_directory is invalid: {ex}")
+        try:
+            validate_path_could_be_dir(self.data_directory, use_host_mount=True)
+        except ValueError as ex:
+            raise ValueError("data_directory is invalid") from ex
@@ def validate_logs_dir(self):
-        try:
-            validate_path_could_be_dir(self.logs_directory, True)
-        except ValueError as ex:
-            raise ValueError(f"logs_directory is invalid: {ex}")
+        try:
+            validate_path_could_be_dir(self.logs_directory, use_host_mount=True)
+        except ValueError as ex:
+            raise ValueError("logs_directory is invalid") from ex
@@ def validate_tmp_dir(self):
-        try:
-            validate_path_could_be_dir(self.tmp_directory, True)
-        except ValueError as ex:
-            raise ValueError(f"tmp_directory is invalid: {ex}")
+        try:
+            validate_path_could_be_dir(self.tmp_directory, use_host_mount=True)
+        except ValueError as ex:
+            raise ValueError("tmp_directory is invalid") from ex
components/clp-package-utils/clp_package_utils/scripts/compress_from_s3.py (3)

225-233: Ensure logs dir exists before writing temp files.

dump_container_config writes under logs_directory; if the directory is missing, writes fail.

Apply this diff near config load success:

@@
     try:
         config_file_path = pathlib.Path(parsed_args.config)
         clp_config = load_config_file(config_file_path, default_config_file_path, clp_home)
-        clp_config.validate_logs_dir()
+        clp_config.validate_logs_dir()
+        # Ensure host logs directory exists for generated files
+        resolve_host_path(clp_config.logs_directory).mkdir(parents=True, exist_ok=True)

305-314: Propagate AWS env vars when using env-vars auth.

docker run only receives envs explicitly passed. Forward standard AWS vars if present.

Apply this diff:

@@
-    extra_env_vars = {
+    extra_env_vars = {
         CLP_DB_USER_ENV_VAR_NAME: clp_config.database.username,
         CLP_DB_PASS_ENV_VAR_NAME: clp_config.database.password,
     }
+    # Forward AWS env vars when set (env_vars auth or general SDK usage)
+    import os
+    for key in (
+        "AWS_ACCESS_KEY_ID",
+        "AWS_SECRET_ACCESS_KEY",
+        "AWS_SESSION_TOKEN",
+        "AWS_PROFILE",
+        "AWS_DEFAULT_REGION",
+        "AWS_REGION",
+    ):
+        val = os.getenv(key)
+        if val:
+            extra_env_vars[key] = val

37-51: Param naming no longer reflects usage.

container_url_list_path now receives a host-resolved path. Rename to url_list_path for clarity.

-def _generate_url_list(
-    subcommand: str,
-    container_url_list_path: pathlib.Path,
+def _generate_url_list(
+    subcommand: str,
+    url_list_path: pathlib.Path,
@@
-    with open(container_url_list_path, "w") as url_list_file:
+    with open(url_list_path, "w") as url_list_file:
components/clp-package-utils/clp_package_utils/scripts/compress.py (2)

231-236: Ensure logs dir exists before writing logs-list/config.

Without creating the host logs directory, subsequent writes may fail on fresh installs.

Apply this diff after dump_container_config:

@@
     generated_config_path_on_container, generated_config_path_on_host = dump_container_config(
         container_clp_config, clp_config, get_container_config_filename(container_name)
     )
+    # Ensure host logs directory exists for generated files
+    resolve_host_path(clp_config.logs_directory).mkdir(parents=True, exist_ok=True)

37-50: Param naming mismatch.

container_logs_list_path now points to a host-resolved path. Rename for clarity.

-def _generate_logs_list(
-    container_logs_list_path: pathlib.Path,
+def _generate_logs_list(
+    logs_list_path: pathlib.Path,
@@
-    with open(container_logs_list_path, "w") as container_logs_list_file:
+    with open(logs_list_path, "w") as container_logs_list_file:
components/clp-package-utils/clp_package_utils/general.py (1)

775-792: Use keyword for boolean arg and chain exceptions in validators.

Improves readability and preserves root cause.

Apply this diff:

@@ def _validate_data_directory(data_dir: pathlib.Path, component_name: str) -> None:
-    try:
-        validate_path_could_be_dir(data_dir, True)
-    except ValueError as ex:
-        raise ValueError(f"{component_name} data directory is invalid: {ex}")
+    try:
+        validate_path_could_be_dir(data_dir, use_host_mount=True)
+    except ValueError as ex:
+        raise ValueError(f"{component_name} data directory is invalid") from ex
@@ def _validate_log_directory(logs_dir: pathlib.Path, component_name: str):
-    try:
-        validate_path_could_be_dir(logs_dir, True)
-    except ValueError as ex:
-        raise ValueError(f"{component_name} logs directory is invalid: {ex}")
+    try:
+        validate_path_could_be_dir(logs_dir, use_host_mount=True)
+    except ValueError as ex:
+        raise ValueError(f"{component_name} logs directory is invalid") from ex
♻️ Duplicate comments (1)
components/clp-package-utils/clp_package_utils/scripts/decompress.py (1)

305-312: Good: default extraction dir honours CLP_PWD_HOST.

This addresses the earlier issue where "." resolved to /opt/clp inside containers.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 61349e9 and 347b181.

📒 Files selected for processing (24)
  • components/clp-package-utils/clp_package_utils/controller.py (17 hunks)
  • components/clp-package-utils/clp_package_utils/general.py (10 hunks)
  • components/clp-package-utils/clp_package_utils/scripts/archive_manager.py (2 hunks)
  • components/clp-package-utils/clp_package_utils/scripts/compress.py (4 hunks)
  • components/clp-package-utils/clp_package_utils/scripts/compress_from_s3.py (4 hunks)
  • components/clp-package-utils/clp_package_utils/scripts/dataset_manager.py (3 hunks)
  • components/clp-package-utils/clp_package_utils/scripts/decompress.py (8 hunks)
  • components/clp-package-utils/clp_package_utils/scripts/search.py (3 hunks)
  • components/clp-package-utils/clp_package_utils/scripts/start_clp.py (2 hunks)
  • components/clp-py-utils/clp_py_utils/clp_config.py (5 hunks)
  • components/clp-py-utils/clp_py_utils/core.py (2 hunks)
  • components/package-template/src/sbin/.common-env.sh (1 hunks)
  • components/package-template/src/sbin/admin-tools/archive-manager.sh (1 hunks)
  • components/package-template/src/sbin/admin-tools/dataset-manager.sh (1 hunks)
  • components/package-template/src/sbin/compress-from-s3.sh (1 hunks)
  • components/package-template/src/sbin/compress.sh (1 hunks)
  • components/package-template/src/sbin/decompress.sh (1 hunks)
  • components/package-template/src/sbin/search.sh (1 hunks)
  • components/package-template/src/sbin/start-clp.sh (1 hunks)
  • components/package-template/src/sbin/stop-clp.sh (1 hunks)
  • taskfile.yaml (6 hunks)
  • taskfiles/docker-images.yaml (1 hunks)
  • tools/deployment/package/docker-compose.runtime.yaml (1 hunks)
  • tools/docker-images/clp-package/Dockerfile (1 hunks)
🧰 Additional context used
🧠 Learnings (4)
📚 Learning: 2025-08-13T14:48:49.020Z
Learnt from: haiqi96
PR: y-scope/clp#1144
File: components/clp-package-utils/clp_package_utils/scripts/dataset_manager.py:106-114
Timestamp: 2025-08-13T14:48:49.020Z
Learning: For the dataset manager scripts in components/clp-package-utils/clp_package_utils/scripts/, the native script (native/dataset_manager.py) is designed to only be called through the wrapper script (dataset_manager.py), so dataset validation is only performed at the wrapper level rather than duplicating it in the native script.

Applied to files:

  • components/package-template/src/sbin/admin-tools/dataset-manager.sh
📚 Learning: 2025-10-13T03:32:19.293Z
Learnt from: junhaoliao
PR: y-scope/clp#1414
File: tools/docker-images/clp-package/Dockerfile:20-24
Timestamp: 2025-10-13T03:32:19.293Z
Learning: In the clp repository's Dockerfiles (e.g., tools/docker-images/clp-package/Dockerfile), ENV directives should be split into separate lines for readability rather than consolidated to reduce layer count. This is especially true for PATH modifications, as agreed upon in PR #1166. Later ENV settings may depend on earlier ones (e.g., referencing CLP_HOME).

Applied to files:

  • tools/docker-images/clp-package/Dockerfile
📚 Learning: 2025-10-17T19:59:25.596Z
Learnt from: junhaoliao
PR: y-scope/clp#1178
File: components/clp-package-utils/clp_package_utils/controller.py:315-315
Timestamp: 2025-10-17T19:59:25.596Z
Learning: In components/clp-package-utils/clp_package_utils/controller.py, worker log directories (compression_worker, query_worker, reducer) created via `mkdir()` do not need `_chown_paths_if_root()` calls because directories are created with the same owner as the script caller. This differs from infrastructure service directories (database, queue, Redis, results cache) which do require explicit ownership changes.

Applied to files:

  • components/clp-package-utils/clp_package_utils/controller.py
📚 Learning: 2025-09-25T19:26:32.436Z
Learnt from: junhaoliao
PR: y-scope/clp#1335
File: taskfiles/docker-images.yaml:15-15
Timestamp: 2025-09-25T19:26:32.436Z
Learning: In the CLP project's Taskfile, the `:package` task creates the G_PACKAGE_BUILD_DIR directory structure, so any task that depends on `:package` (like `docker-images:package`) can safely assume this directory exists without needing additional mkdir commands.

Applied to files:

  • taskfiles/docker-images.yaml
🧬 Code graph analysis (10)
components/clp-package-utils/clp_package_utils/scripts/compress.py (1)
components/clp-py-utils/clp_py_utils/core.py (1)
  • resolve_host_path (68-85)
components/clp-package-utils/clp_package_utils/scripts/dataset_manager.py (1)
components/clp-py-utils/clp_py_utils/core.py (1)
  • resolve_host_path (68-85)
components/clp-package-utils/clp_package_utils/scripts/search.py (1)
components/clp-py-utils/clp_py_utils/core.py (1)
  • resolve_host_path (68-85)
components/clp-package-utils/clp_package_utils/scripts/decompress.py (1)
components/clp-py-utils/clp_py_utils/core.py (2)
  • resolve_host_path (68-85)
  • validate_path_could_be_dir (88-95)
components/clp-package-utils/clp_package_utils/controller.py (3)
components/clp-py-utils/clp_py_utils/core.py (1)
  • resolve_host_path (68-85)
components/clp-package-utils/clp_package_utils/general.py (1)
  • validate_webui_config (587-597)
components/clp-py-utils/clp_py_utils/clp_config.py (2)
  • get_directory (551-552)
  • get_directory (562-563)
components/clp-package-utils/clp_package_utils/scripts/start_clp.py (2)
components/clp-py-utils/clp_py_utils/core.py (1)
  • resolve_host_path (68-85)
components/clp-py-utils/clp_py_utils/clp_config.py (2)
  • get_directory (551-552)
  • get_directory (562-563)
components/clp-package-utils/clp_package_utils/scripts/archive_manager.py (1)
components/clp-py-utils/clp_py_utils/core.py (1)
  • resolve_host_path (68-85)
components/clp-py-utils/clp_py_utils/clp_config.py (1)
components/clp-py-utils/clp_py_utils/core.py (2)
  • resolve_host_path (68-85)
  • validate_path_could_be_dir (88-95)
components/clp-package-utils/clp_package_utils/scripts/compress_from_s3.py (1)
components/clp-py-utils/clp_py_utils/core.py (1)
  • resolve_host_path (68-85)
components/clp-package-utils/clp_package_utils/general.py (1)
components/clp-py-utils/clp_py_utils/core.py (4)
  • resolve_host_path (68-85)
  • read_yaml_config_file (59-65)
  • make_config_path_absolute (45-56)
  • validate_path_could_be_dir (88-95)
🪛 Checkov (3.2.334)
tools/docker-images/clp-package/Dockerfile

[low] 1-44: Ensure that HEALTHCHECK instructions have been added to container images

(CKV_DOCKER_2)

🪛 Ruff (0.14.2)
components/clp-package-utils/clp_package_utils/scripts/compress.py

61-61: Unnecessary mode argument

Remove mode argument

(UP015)

components/clp-package-utils/clp_package_utils/scripts/decompress.py

84-84: Boolean positional value in function call

(FBT003)

components/clp-package-utils/clp_package_utils/controller.py

835-835: Unnecessary mode argument

Remove mode argument

(UP015)

components/clp-py-utils/clp_py_utils/clp_config.py

678-678: Avoid specifying long messages outside the exception class

(TRY003)


697-697: Boolean positional value in function call

(FBT003)


711-711: Boolean positional value in function call

(FBT003)


713-713: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


713-713: Avoid specifying long messages outside the exception class

(TRY003)


717-717: Boolean positional value in function call

(FBT003)


719-719: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


719-719: Avoid specifying long messages outside the exception class

(TRY003)


723-723: Boolean positional value in function call

(FBT003)


725-725: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


725-725: Avoid specifying long messages outside the exception class

(TRY003)


729-729: Boolean positional value in function call

(FBT003)

components/clp-package-utils/clp_package_utils/scripts/compress_from_s3.py

61-61: Unnecessary mode argument

Remove mode argument

(UP015)

components/clp-py-utils/clp_py_utils/core.py

88-88: Boolean-typed positional argument in function definition

(FBT001)

components/clp-package-utils/clp_package_utils/general.py

479-479: Avoid specifying long messages outside the exception class

(TRY003)


776-776: Boolean positional value in function call

(FBT003)


790-790: Boolean positional value in function call

(FBT003)

🪛 Shellcheck (0.11.0)
components/package-template/src/sbin/compress.sh

[warning] 5-5: ShellCheck can't follow non-constant source. Use a directive to specify location.

(SC1090)

components/package-template/src/sbin/admin-tools/dataset-manager.sh

[warning] 5-5: ShellCheck can't follow non-constant source. Use a directive to specify location.

(SC1090)

components/package-template/src/sbin/.common-env.sh

[info] 23-23: Command appears to be unreachable. Check usage (or ignore if invoked indirectly).

(SC2317)


[warning] 30-30: Declare and assign separately to avoid masking return values.

(SC2155)

components/package-template/src/sbin/search.sh

[warning] 5-5: ShellCheck can't follow non-constant source. Use a directive to specify location.

(SC1090)

components/package-template/src/sbin/stop-clp.sh

[warning] 5-5: ShellCheck can't follow non-constant source. Use a directive to specify location.

(SC1090)

components/package-template/src/sbin/decompress.sh

[warning] 5-5: ShellCheck can't follow non-constant source. Use a directive to specify location.

(SC1090)

components/package-template/src/sbin/start-clp.sh

[warning] 5-5: ShellCheck can't follow non-constant source. Use a directive to specify location.

(SC1090)

components/package-template/src/sbin/admin-tools/archive-manager.sh

[warning] 5-5: ShellCheck can't follow non-constant source. Use a directive to specify location.

(SC1090)

components/package-template/src/sbin/compress-from-s3.sh

[warning] 5-5: ShellCheck can't follow non-constant source. Use a directive to specify location.

(SC1090)

🔇 Additional comments (30)
components/package-template/src/sbin/compress.sh (1)

4-10: Approved—wrapper pattern aligns with containerization design.

The sourcing of .common-env.sh and docker compose invocation follow the containerization pattern established across CLI tools. The ShellCheck warning SC1090 is a false positive; the script_dir assignment is necessarily dynamic.

components/package-template/src/sbin/search.sh (1)

4-10: Approved—consistent containerization pattern.

Mirrors the established pattern from compress.sh and other wrapper scripts. No additional concerns.

components/package-template/src/sbin/compress-from-s3.sh (1)

4-10: Approved—follows established wrapper pattern.

tools/docker-images/clp-package/Dockerfile (1)

30-44: Approved—explicit COPY structure supports maintainability.

The granular COPY commands for each component (binaries, dependencies, spider, node, Python libs, WebUI) provide clarity over consolidated COPYs and align with the build restructuring described in the PR objectives.

tools/deployment/package/docker-compose.runtime.yaml (1)

1-19: Approved—comprehensive runtime configuration with appropriate defaults.

The docker-compose service is well-structured with:

  • Sensible defaults for image reference and user mapping
  • Proper volume bindings for Docker-in-Docker capability and host file access
  • Environment variables propagated from the host
  • Read-only protections on critical mounts (plugin dir, docker binary)
components/package-template/src/sbin/.common-env.sh (2)

6-24: Approved—robust image reference resolution.

Image reference fallback chain (clp-package-image.id → VERSION → error) is correct. The dual-mode error handling on line 23 (return 1 2>/dev/null || exit 1) is intentional and necessary for scripts that may be sourced or executed directly; ShellCheck warning SC2317 is a false positive here.


32-53: Approved—defensive Docker plugin and socket discovery.

The plugin directory search is well-ordered, and the socket path detection correctly verifies the socket exists before exporting. The warning message appropriately informs users if the plugin directory cannot be found. Error handling for missing docker binary (redirected to /dev/null) is acceptable; docker compose will fail gracefully if unavailable.

components/package-template/src/sbin/stop-clp.sh (1)

4-10: Approved—follows established containerization pattern.

Consistent with other wrapper scripts (compress.sh, search.sh). No additional concerns.

components/package-template/src/sbin/admin-tools/archive-manager.sh (1)

4-10: Approved—correct relative path for subdirectory location.

The relative path $script_dir/../.common-env.sh correctly references the common environment script from the admin-tools/ subdirectory. Follows the established pattern with appropriate path adjustment for directory depth.

components/package-template/src/sbin/start-clp.sh (1)

4-10: LGTM! Clean containerization of the start-clp script.

The script now sources a common environment configuration and delegates execution to the Docker Compose-managed runtime. This is consistent with the PR's objective of containerizing CLP tools.

Note: The ShellCheck SC1090 warning about non-constant source is expected and can be safely ignored—this is a standard pattern for shell wrappers.

components/package-template/src/sbin/decompress.sh (1)

4-10: LGTM! Consistent containerization pattern.

The changes mirror the approach used in other scripts, sourcing the common environment and delegating to the containerized runtime. The implementation is clean and consistent with the PR's containerization objectives.

taskfile.yaml (3)

58-58: Verify the default task dependency change.

The default task now depends on package instead of docker-images:package. This shifts the top-level build workflow to produce the package directory structure rather than just the Docker image.

Please confirm this workflow change is intentional and aligns with the expected developer experience when running task without arguments.


106-152: LGTM! Package task properly orchestrates containerized artifacts.

The restructured package task now:

  • Depends on docker-images:package to ensure the container image is built
  • Copies webui settings and the image ID into the package structure
  • Uses rsync with proper flags for directory synchronization

This aligns well with the PR's objective to containerize tools while maintaining the package structure for deployment.


472-510: LGTM! Python libraries task is well-structured.

The new python-libs internal task:

  • Properly depends on all Python component builds and the package venv
  • Uses checksum-based incremental builds
  • Installs wheels to an isolated output directory with -t flag
  • Follows the same pattern as other build tasks in the file

This cleanly separates Python library packaging from the Docker image build.

taskfiles/docker-images.yaml (1)

4-13: LGTM! Docker image build properly depends on individual components.

The restructured task now:

  • Depends directly on :init, :core, :deps:spider, :python-libs, and :webui
  • Removes the circular dependency pattern where the package depended on the image
  • Delegates artifact assembly to build.sh

This inversion of dependencies is cleaner and allows the Docker image to be composed directly from build outputs, consistent with the PR's containerization approach.

components/package-template/src/sbin/admin-tools/dataset-manager.sh (1)

4-10: LGTM! Consistent containerization with correct relative path.

The script properly sources the common environment from the parent directory (../.common-env.sh) since it resides in the admin-tools subdirectory. The Docker Compose invocation follows the established pattern.

components/clp-package-utils/clp_package_utils/scripts/archive_manager.py (3)

17-17: LGTM! Host path resolution import added.

The resolve_host_path import is correctly used at line 284 to translate the generated config path before cleanup.


223-226: LGTM! Removed clp_home mount to avoid conflicts.

Removing mounts.clp_home from the necessary mounts list prevents conflicts with the Docker image contents, as noted in past review feedback. The container now only mounts the logs and archives directories that it needs to access.


284-285: LGTM! Host path resolution before cleanup.

The code correctly resolves the generated config path to its container-mount equivalent before unlinking, ensuring the cleanup operation targets the correct filesystem location in the containerized runtime.

components/clp-package-utils/clp_package_utils/scripts/search.py (3)

16-16: LGTM! Host path resolution import added.

The import is correctly used at line 184 for path resolution before cleanup.


132-132: LGTM! Streamlined mounts list.

Removing mounts.clp_home from the necessary mounts list follows the same pattern as other scripts in this PR, avoiding conflicts with containerized resources.


184-185: LGTM! Consistent host path resolution pattern.

The cleanup logic matches the pattern used in archive_manager.py and other scripts, ensuring correct path resolution in the containerized environment.

components/clp-package-utils/clp_package_utils/scripts/dataset_manager.py (3)

17-17: LGTM! Host path resolution import added.

The import is used consistently with other scripts at line 188 for cleanup operations.


135-135: LGTM! Consistent mount configuration.

The necessary mounts list initialization follows the same pattern as search.py and archive_manager.py, starting with only logs_dir and conditionally adding other mounts as needed.


188-189: LGTM! Standard cleanup pattern.

The cleanup logic uses the same host path resolution approach as the other Python scripts, ensuring consistency across the codebase.

components/clp-package-utils/clp_package_utils/scripts/start_clp.py (1)

7-7: LGTM on host-path resolution import.

Importing resolve_host_path here is consistent with the containerised runtime path model.

components/clp-py-utils/clp_py_utils/clp_config.py (1)

676-681: Correctly validating logs_input against the host-mounted path.

Using resolve_host_path for existence/type checks fixes container/host path mismatches.

components/clp-package-utils/clp_package_utils/controller.py (2)

446-452: Validation path usage is consistent.

Passing original paths to validate_webui_config is fine since that helper resolves to host‑mounted paths internally.


123-129: Verification confirms bind-mount configuration is correct; code pattern approved.

The clp-runtime service correctly bind-mounts the host root at /mnt/host, confirming that resolve_host_path() operations (mkdir, chown) properly affect the Docker host as intended. The selective application of _chown_paths_if_root() to infrastructure services only is appropriate and well-scoped.

components/clp-py-utils/clp_py_utils/core.py (1)

7-7: Mount configuration verified: /mnt/host binding is present and correct.

The docker-compose.runtime.yaml correctly defines the required volume mount at line 18: - "/:/mnt/host", which binds the entire host filesystem to /mnt/host in the clp-runtime container. The CONTAINER_HOST_ROOT_DIR constant is properly aligned with this runtime configuration, and resolve_host_path operations will function as intended.

Comment on lines +479 to 482
resolved_client_settings_json_path, client_settings_json_updates
)
with open(client_settings_json_path, "w") as client_settings_json_file:
with open(resolved_client_settings_json_path, "w") as client_settings_json_file:
client_settings_json_file.write(json.dumps(client_settings_json))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Prefer json.dump with indentation for readability.

Small DX improvement for troubleshooting settings.

-        with open(resolved_client_settings_json_path, "w") as client_settings_json_file:
-            client_settings_json_file.write(json.dumps(client_settings_json))
+        with open(resolved_client_settings_json_path, "w") as client_settings_json_file:
+            json.dump(client_settings_json, client_settings_json_file, indent=2, sort_keys=True)
@@
-        with open(resolved_server_settings_json_path, "w") as settings_json_file:
-            settings_json_file.write(json.dumps(server_settings_json))
+        with open(resolved_server_settings_json_path, "w") as settings_json_file:
+            json.dump(server_settings_json, settings_json_file, indent=2, sort_keys=True)

Also applies to: 536-537

🤖 Prompt for AI Agents
In components/clp-package-utils/clp_package_utils/controller.py around lines
479-482 (and also apply the same change at lines ~536-537), replace the manual
json.dumps write with json.dump using an indent parameter for readability; open
the file as currently done but call json.dump(client_settings_json,
client_settings_json_file, indent=2) (or another suitable indent) so the written
JSON is pretty-printed and easier to troubleshoot.

Comment on lines 831 to 842
instance_id_file_path = clp_config.logs_directory / "instance-id"
resolved_instance_id_file_path = resolve_host_path(instance_id_file_path)

if instance_id_file_path.exists():
with open(instance_id_file_path, "r") as f:
if resolved_instance_id_file_path.exists():
with open(resolved_instance_id_file_path, "r") as f:
instance_id = f.readline()
else:
instance_id = str(uuid.uuid4())[-4:]
with open(instance_id_file_path, "w") as f:
with open(resolved_instance_id_file_path, "w") as f:
f.write(instance_id)

return instance_id
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Trim minor style issue and reduce collision risk for instance IDs.

  • open(..., "r") mode is redundant; default is read (UP015).
  • Optional: use 8 hex chars to reduce collisions.
-    if resolved_instance_id_file_path.exists():
-        with open(resolved_instance_id_file_path, "r") as f:
+    if resolved_instance_id_file_path.exists():
+        with open(resolved_instance_id_file_path) as f:
             instance_id = f.readline()
     else:
-        instance_id = str(uuid.uuid4())[-4:]
+        instance_id = uuid.uuid4().hex[-8:]
         with open(resolved_instance_id_file_path, "w") as f:
             f.write(instance_id)
🧰 Tools
🪛 Ruff (0.14.2)

835-835: Unnecessary mode argument

Remove mode argument

(UP015)

🤖 Prompt for AI Agents
In components/clp-package-utils/clp_package_utils/controller.py around lines 831
to 842, remove the redundant "r" mode when opening the instance-id file and trim
the read value, and generate a longer 8-hex-character instance id to reduce
collision risk: when reading, open with default read mode (omit "r") and use
.readline().strip() to remove trailing newlines; when creating a new id, use
uuid.uuid4().hex and take the last 8 characters (or otherwise produce 8 hex
chars) before writing it to the resolved path using the default write mode.

Comment on lines 359 to 361
resolved_config_file_path_on_host = resolve_host_path(config_file_path_on_host)
with open(resolved_config_file_path_on_host, "w") as f:
yaml.safe_dump(container_clp_config.dump_to_primitive_dict(), f)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Create parent directory before writing the container config file.

Without this, first-run writes fail if the logs dir hierarchy is missing.

Apply this diff:

-    resolved_config_file_path_on_host = resolve_host_path(config_file_path_on_host)
-    with open(resolved_config_file_path_on_host, "w") as f:
+    resolved_config_file_path_on_host = resolve_host_path(config_file_path_on_host)
+    resolved_config_file_path_on_host.parent.mkdir(parents=True, exist_ok=True)
+    with open(resolved_config_file_path_on_host, "w") as f:
         yaml.safe_dump(container_clp_config.dump_to_primitive_dict(), f)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
resolved_config_file_path_on_host = resolve_host_path(config_file_path_on_host)
with open(resolved_config_file_path_on_host, "w") as f:
yaml.safe_dump(container_clp_config.dump_to_primitive_dict(), f)
resolved_config_file_path_on_host = resolve_host_path(config_file_path_on_host)
resolved_config_file_path_on_host.parent.mkdir(parents=True, exist_ok=True)
with open(resolved_config_file_path_on_host, "w") as f:
yaml.safe_dump(container_clp_config.dump_to_primitive_dict(), f)
🤖 Prompt for AI Agents
In components/clp-package-utils/clp_package_utils/general.py around lines 359 to
361, the code opens the resolved config file path for writing without ensuring
its parent directory exists; create the parent directory before opening the file
by resolving the path (use Path or os.path), computing parent_dir =
resolved_path.parent (or os.path.dirname), and calling os.makedirs(parent_dir,
exist_ok=True) (skip if parent_dir is empty or root), then open and write the
YAML as before so first-run writes won't fail when the logs dir hierarchy is
missing.

Comment on lines +471 to 478
if not resolved_credentials_file_path.exists():
if (
make_config_path_absolute(clp_home, CLP_DEFAULT_CREDENTIALS_FILE_PATH)
== credentials_file_path
and generate_default_file
):
generate_credentials_file(credentials_file_path)
generate_credentials_file(resolved_credentials_file_path)
else:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Also mkdir for generated default credentials file.

Generating the default file will fail if its parent dir does not exist.

Apply this diff:

-    resolved_credentials_file_path = resolve_host_path(credentials_file_path)
+    resolved_credentials_file_path = resolve_host_path(credentials_file_path)
@@
-            generate_credentials_file(resolved_credentials_file_path)
+            resolved_credentials_file_path.parent.mkdir(parents=True, exist_ok=True)
+            generate_credentials_file(resolved_credentials_file_path)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if not resolved_credentials_file_path.exists():
if (
make_config_path_absolute(clp_home, CLP_DEFAULT_CREDENTIALS_FILE_PATH)
== credentials_file_path
and generate_default_file
):
generate_credentials_file(credentials_file_path)
generate_credentials_file(resolved_credentials_file_path)
else:
if not resolved_credentials_file_path.exists():
if (
make_config_path_absolute(clp_home, CLP_DEFAULT_CREDENTIALS_FILE_PATH)
== credentials_file_path
and generate_default_file
):
resolved_credentials_file_path.parent.mkdir(parents=True, exist_ok=True)
generate_credentials_file(resolved_credentials_file_path)
else:
🤖 Prompt for AI Agents
In components/clp-package-utils/clp_package_utils/general.py around lines 471 to
478, the code generates a default credentials file but doesn't ensure its parent
directory exists; before calling
generate_credentials_file(resolved_credentials_file_path) create the parent
directory for resolved_credentials_file_path (e.g.,
resolved_credentials_file_path.parent.mkdir(parents=True, exist_ok=True)) so
file creation won't fail when the directory is missing.

Comment on lines 60 to 62
resolved_inputs_from_path = resolve_host_path(pathlib.Path(parsed_args.inputs_from))
with open(resolved_inputs_from_path, "r") as input_file:
for line in input_file:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Good: read inputs_from via resolved host path.

This aligns file I/O with the container/host mapping. Minor: open(..., "r") is redundant; default is read.

🧰 Tools
🪛 Ruff (0.14.2)

61-61: Unnecessary mode argument

Remove mode argument

(UP015)

🤖 Prompt for AI Agents
In components/clp-package-utils/clp_package_utils/scripts/compress_from_s3.py
around lines 60 to 62, the open(...) call uses an explicit "r" mode which is
redundant because open defaults to read mode; change the file open to omit the
mode argument (e.g., with open(resolved_inputs_from_path) as input_file:) to
simplify the code while keeping the resolved host path logic intact.

Comment on lines 68 to 86
def resolve_host_path(path: pathlib.Path) -> pathlib.Path:
"""
Translates a host path to its container-mount equivalent.
:param path: The host path.
:return: The translated container path.
"""
path = path.absolute()
resolved = CONTAINER_HOST_ROOT_DIR / path.relative_to("/")

try:
if resolved.is_symlink():
target_path = (resolved.parent / resolved.readlink()).resolve()
resolved = CONTAINER_HOST_ROOT_DIR / target_path.relative_to("/")
except OSError:
pass

return resolved

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Container‑only semantics and resilience of resolve_host_path.

  • Works assuming /mnt/host is mounted. Consider documenting this explicitly in the docstring.
  • Swallowing OSError hides resolution issues; log at debug at least.
  • If portability matters, consider making the mount root configurable via an env (e.g., CLP_HOST_MOUNT_ROOT), defaulting to /mnt/host.
 def resolve_host_path(path: pathlib.Path) -> pathlib.Path:
-    """
-    Translates a host path to its container-mount equivalent.
+    """
+    Translates a host path to its container-mount equivalent.
+    Requires the container to bind‑mount the Docker host root at CONTAINER_HOST_ROOT_DIR.
@@
-    try:
-        if resolved.is_symlink():
-            target_path = (resolved.parent / resolved.readlink()).resolve()
-            resolved = CONTAINER_HOST_ROOT_DIR / target_path.relative_to("/")
-    except OSError:
-        pass
+    try:
+        if resolved.is_symlink():
+            target_path = (resolved.parent / resolved.readlink()).resolve()
+            resolved = CONTAINER_HOST_ROOT_DIR / target_path.relative_to("/")
+    except OSError as e:
+        # Best‑effort: unresolved symlinks shouldn't break call sites
+        import logging
+        logging.getLogger(__name__).debug("resolve_host_path symlink check failed: %s", e)

Comment on lines 88 to 90
def validate_path_could_be_dir(path: pathlib.Path, use_host_mount: bool):
part = resolve_host_path(path) if use_host_mount else path
while True:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Avoid boolean positional; make keyword‑only parameter (fixes FBT001/FBT003).

Make use_host_mount keyword‑only (with default), and update callers to pass by name.

-def validate_path_could_be_dir(path: pathlib.Path, use_host_mount: bool):
+def validate_path_could_be_dir(path: pathlib.Path, *, use_host_mount: bool = False):
     part = resolve_host_path(path) if use_host_mount else path

Follow‑up in callers (e.g., decompress.py): validate_path_could_be_dir(extraction_dir, use_host_mount=True).

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def validate_path_could_be_dir(path: pathlib.Path, use_host_mount: bool):
part = resolve_host_path(path) if use_host_mount else path
while True:
def validate_path_could_be_dir(path: pathlib.Path, *, use_host_mount: bool = False):
part = resolve_host_path(path) if use_host_mount else path
while True:
🧰 Tools
🪛 Ruff (0.14.2)

88-88: Boolean-typed positional argument in function definition

(FBT001)

🤖 Prompt for AI Agents
In components/clp-py-utils/clp_py_utils/core.py around lines 88 to 90, the
function validate_path_could_be_dir currently accepts use_host_mount as a
positional boolean; change the signature to make use_host_mount a keyword-only
parameter with a default (e.g., *, use_host_mount: bool = False) so callers must
pass it by name. Update all call sites (e.g., decompress.py) to pass
use_host_mount=True or False by name (validate_path_could_be_dir(extraction_dir,
use_host_mount=True)) and run tests/lint to ensure FBT001/FBT003 violations are
resolved.

resolved = CONTAINER_HOST_ROOT_DIR / path.relative_to("/")

try:
if resolved.is_symlink():
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is needed because somehow my WSL Ubuntu has this

(clp) junhao@ASUS-X870E:~$ ls -l .aws
lrwxrwxrwx 1 junhao junhao 24 Aug 26 12:07 .aws -> /mnt/c/Users/junhao/.aws

/mnt/c/Users/junhao/.aws is certainly not a valid path in the runtime container, to access, the bootstrap scripts should use the prefix + /mnt/c/Users/junhao/.aws

Copy link
Member

@kirkrodrigues kirkrodrigues left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a complete review but some things I noticed while reading so far.

Some of the shell scripts have lines longer than 100 chars as well as shellcheck warnings. Can you address?

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (1)
components/clp-py-utils/clp_py_utils/core.py (1)

87-89: Log OSError at debug level instead of silently swallowing.

Silent exception handling masks resolution issues. Past review comments suggested logging at debug level to aid troubleshooting.

Apply this diff to add debug logging:

+    import logging
+    
     try:
         if not translated_path.is_symlink():
             return translated_path
         link_target = translated_path.readlink()
         if link_target.is_absolute():
             return CONTAINER_DIR_FOR_HOST_ROOT / link_target.relative_to("/")
         else:
             # If the symlink points to a relative path, resolve it relative to the symlink's parent.
             return (translated_path.parent / link_target).resolve()
-    except OSError:
+    except OSError as e:
         # Ignore if reading the symlink fails (e.g., broken link or permission error).
-        pass
+        logging.getLogger(__name__).debug(
+            "resolve_host_path_in_container: symlink resolution failed for %s: %s", 
+            translated_path, 
+            e
+        )
📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3af2e4d and 198f398.

📒 Files selected for processing (1)
  • components/clp-py-utils/clp_py_utils/core.py (2 hunks)
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1414
File: tools/docker-images/clp-package/Dockerfile:20-24
Timestamp: 2025-10-13T03:32:19.293Z
Learning: In the clp repository's Dockerfiles (e.g., tools/docker-images/clp-package/Dockerfile), ENV directives should be split into separate lines for readability rather than consolidated to reduce layer count. This is especially true for PATH modifications, as agreed upon in PR #1166. Later ENV settings may depend on earlier ones (e.g., referencing CLP_HOME).
Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1122
File: components/core/src/clp/clp/CMakeLists.txt:175-195
Timestamp: 2025-07-23T09:54:45.185Z
Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.
Learnt from: jackluo923
Repo: y-scope/clp PR: 1054
File: components/core/tools/docker-images/clp-env-base-musllinux_1_2-x86/build.sh:18-24
Timestamp: 2025-07-01T14:52:02.418Z
Learning: In the CLP project, consistency across platform build scripts is prioritized over defensive programming when it comes to git remote handling. All build.sh files in docker-images directories should follow the same pattern for git metadata injection.
Learnt from: jackluo923
Repo: y-scope/clp PR: 1054
File: components/core/tools/docker-images/clp-env-base-musllinux_1_2-x86/build.sh:18-24
Timestamp: 2025-07-01T14:52:02.418Z
Learning: In the CLP project, consistency across platform build scripts is prioritized over defensive programming when it comes to git remote handling. All build.sh files in docker-images directories should follow the same pattern for git metadata injection.
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1413
File: tools/docker-images/clp-package/Dockerfile:22-24
Timestamp: 2025-10-20T21:05:30.417Z
Learning: In the clp repository's Dockerfiles, ENV directives should be consolidated into multi-line ENV statements when possible to reduce image layers. ENV statements should only be split into separate commands when consolidation is not possible due to dependencies (e.g., when later variables must reference earlier ones that need to be set first, or when PATH must be modified sequentially).
📚 Learning: 2025-10-07T07:54:32.427Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1178
File: components/clp-py-utils/clp_py_utils/clp_config.py:47-47
Timestamp: 2025-10-07T07:54:32.427Z
Learning: In components/clp-py-utils/clp_py_utils/clp_config.py, the CONTAINER_AWS_CONFIG_DIRECTORY constant is intentionally set to pathlib.Path("/") / ".aws" (i.e., `/.aws`) rather than a user-specific home directory. This hardcoded path is part of the container orchestration design.

Applied to files:

  • components/clp-py-utils/clp_py_utils/core.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)
  • GitHub Check: build-macos (macos-14, false)
  • GitHub Check: build-macos (macos-14, true)
  • GitHub Check: build-macos (macos-15, false)
  • GitHub Check: ubuntu-jammy-deps-image
  • GitHub Check: musllinux_1_2-x86_64-deps-image
  • GitHub Check: centos-stream-9-deps-image
  • GitHub Check: manylinux_2_28-x86_64-deps-image
  • GitHub Check: rust-checks (ubuntu-22.04)
  • GitHub Check: rust-checks (ubuntu-24.04)
  • GitHub Check: lint-check (ubuntu-24.04)
  • GitHub Check: lint-check (macos-15)
🔇 Additional comments (2)
components/clp-py-utils/clp_py_utils/core.py (2)

6-7: LGTM! Clear constant definition.

The constant name follows the naming suggestion from past reviews and clearly indicates the container directory for the host root mount.


94-101: Verification complete — all call sites properly updated.

All eight call sites of validate_path_could_be_dir are using a single argument, and each correctly invokes resolve_host_path_in_container() before passing the path to the validation function. The removal of the use_host_mount parameter from the function signature has been fully propagated across the codebase:

  • In clp_config.py, the validation methods conditionally resolve paths based on their own use_host_mount parameter before calling the validator.
  • In general.py and decompress.py, paths are resolved unconditionally before validation.

The code changes are sound and complete.

Comment on lines 78 to 86
try:
if not translated_path.is_symlink():
return translated_path
link_target = translated_path.readlink()
if link_target.is_absolute():
return CONTAINER_DIR_FOR_HOST_ROOT / link_target.relative_to("/")
else:
# If the symlink points to a relative path, resolve it relative to the symlink's parent.
return (translated_path.parent / link_target).resolve()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Inconsistent symlink resolution depth.

The handling differs for absolute vs. relative symlink targets:

  • Absolute targets (lines 82-83): translated directly without following further symlinks.
  • Relative targets (line 86): .resolve() is called, which follows all symlinks in the target path.

This inconsistency means a relative symlink pointing to another symlink will be fully resolved, but an absolute symlink pointing to another symlink won't. Consider applying .resolve() consistently in both branches or documenting this behavior explicitly.

For consistent single-level resolution, you could refactor to avoid .resolve() and handle path normalization separately:

         if link_target.is_absolute():
             return CONTAINER_DIR_FOR_HOST_ROOT / link_target.relative_to("/")
         else:
-            # If the symlink points to a relative path, resolve it relative to the symlink's parent.
-            return (translated_path.parent / link_target).resolve()
+            # If the symlink points to a relative path, normalize it relative to the symlink's parent.
+            # Note: this doesn't follow further symlinks, only normalizes . and .. components.
+            normalized = (translated_path.parent / link_target)
+            # Since Path doesn't provide non-symlink-following normalization, we accept that
+            # relative symlinks may not be fully canonicalized or we accept full resolution.
+            return normalized.resolve()

Alternatively, document clearly that relative symlinks undergo full resolution while absolute ones do not.

@junhaoliao junhaoliao changed the title feat(clp-package)!: Containerize CLP tools via Docker Compose; Remove redundant assets from the package. feat(clp-package)!: Containerize CLP tools via Docker Compose; Remove redundant assets from the package (resolves #1358). Nov 4, 2025
# Validate and load config file.
config_file_path = pathlib.Path(parsed_args.config)
clp_config = load_config_file(config_file_path, default_config_file_path, clp_home)
clp_config = load_config_file(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inside load_config_file, we print one of the arguments in a log, which I think is wrong, right? I'm fine leaving it as is and opening an issue.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, the log would be confusing to general users when load_config_file is used in the "bootstrap" scripts. I didn't modify load_config_file directly because the "native" scripts also uses load_config_file - it's possible to use a flag to determine whether the method is called in a bootstrap script or a native script though.

@coderabbitai please create an issue for the inaccurate log and we will fix it as we clean up the bootstrap / native scripts in a future PR

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Create GitHub issue for inaccurate log in load_config_file

gh issue create \
  --repo y-scope/clp \
  --title "Inaccurate log message in load_config_file when used in bootstrap scripts" \
  --body "## Description
When \`load_config_file\` is called from bootstrap scripts (containerized CLI wrappers), the log messages print container paths instead of host paths, which can confuse users.

## Context
The bootstrap scripts now resolve host paths before passing them to \`load_config_file\`, but the function logs one of the arguments directly. This creates misleading output for users when the scripts run inside containers.

The native scripts also use \`load_config_file\`, so a flag could be introduced to determine whether the method is called in a bootstrap script or a native script to control the logging behaviour.

## References
- Pull Request: https://github.com/y-scope/clp/pull/1512
- Comment: https://github.com/y-scope/clp/pull/1512#discussion_r2490812330
- Requested by: @junhaoliao

## Follow-up
This will be addressed in a future PR when cleaning up the bootstrap/native scripts." \
  --assignee junhaoliao

Length of output: 42


@junhaoliao, I've created issue #1548 to track the inaccurate log message in load_config_file when used in bootstrap scripts. The issue has been assigned to you and includes references to this PR and comment thread.


🧠 Learnings used
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1178
File: components/clp-py-utils/clp_py_utils/clp_config.py:47-47
Timestamp: 2025-10-07T07:54:32.427Z
Learning: In components/clp-py-utils/clp_py_utils/clp_config.py, the CONTAINER_AWS_CONFIG_DIRECTORY constant is intentionally set to pathlib.Path("/") / ".aws" (i.e., `/.aws`) rather than a user-specific home directory. This hardcoded path is part of the container orchestration design.

Learnt from: junhaoliao
Repo: y-scope/clp PR: 1178
File: components/clp-package-utils/clp_package_utils/controller.py:315-315
Timestamp: 2025-10-17T19:59:25.596Z
Learning: In components/clp-package-utils/clp_package_utils/controller.py, worker log directories (compression_worker, query_worker, reducer) created via `mkdir()` do not need `_chown_paths_if_root()` calls because directories are created with the same owner as the script caller. This differs from infrastructure service directories (database, queue, Redis, results cache) which do require explicit ownership changes.

Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1122
File: components/core/src/clp/clp/CMakeLists.txt:175-195
Timestamp: 2025-07-23T09:54:45.185Z
Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.

Learnt from: junhaoliao
Repo: y-scope/clp PR: 1414
File: tools/docker-images/clp-package/Dockerfile:20-24
Timestamp: 2025-10-13T03:32:19.293Z
Learning: In the clp repository's Dockerfiles (e.g., tools/docker-images/clp-package/Dockerfile), ENV directives should be split into separate lines for readability rather than consolidated to reduce layer count. This is especially true for PATH modifications, as agreed upon in PR #1166. Later ENV settings may depend on earlier ones (e.g., referencing CLP_HOME).

Learnt from: junhaoliao
Repo: y-scope/clp PR: 0
File: :0-0
Timestamp: 2025-10-22T21:02:31.113Z
Learning: Repository y-scope/clp: Maintain deterministic CI/builds for Rust; add a check to verify Cargo.lock is in sync with Cargo.toml without updating dependencies (non-mutating verification in clp-rust-checks workflow).

Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1156
File: components/core/CMakeLists.txt:772-772
Timestamp: 2025-08-09T04:07:27.083Z
Learning: In the CLP project's CMakeLists.txt, when reviewing changes related to the ${zstd_TARGET} variable usage in test linking, the team is planning a refactoring PR to improve this mechanism. Guards for undefined target variables should be deferred to that separate PR rather than being added in focused dependency migration PRs.

)

necessary_mounts: List[Optional[DockerMount]] = [
mounts.clp_home,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, can you explain more?

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (6)
components/clp-package-utils/clp_package_utils/controller.py (5)

445-450: Consider moving these declarations closer to their use.

The resolved path variables are declared here but not used until lines 482 and 537, which reduces code clarity.


482-485: Prefer json.dump with indentation for readability.

Using json.dump with an indent parameter would make the output more human-readable for troubleshooting.


536-540: Prefer json.dump with indentation for readability.

Using json.dump with an indent parameter would make the output more human-readable for troubleshooting.


571-573: Remove unnecessary blank line for consistency.

Line 571 contains an unnecessary blank line. Other similar methods in this file don't have a blank line before resolving the logs directory.


841-852: Address the redundant mode, short instance ID, and missing strip.

Three issues:

  1. The "r" mode at line 845 is redundant (default is read)
  2. Using only 4 hex characters for instance ID increases collision risk; consider 8
  3. Should strip whitespace when reading the instance ID at line 846
components/clp-package-utils/clp_package_utils/scripts/archive_manager.py (1)

179-179: Pass boolean as keyword argument for clarity.

Using a keyword argument makes the intent explicit and avoids the FBT003 style violation.

Apply this diff:

-        clp_config.validate_logs_dir(True)
+        clp_config.validate_logs_dir(use_host_mount=True)

As per coding guidelines.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1d02051 and 3063e85.

📒 Files selected for processing (2)
  • components/clp-package-utils/clp_package_utils/controller.py (16 hunks)
  • components/clp-package-utils/clp_package_utils/scripts/archive_manager.py (3 hunks)
🧰 Additional context used
🧠 Learnings (9)
📓 Common learnings
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1414
File: tools/docker-images/clp-package/Dockerfile:20-24
Timestamp: 2025-10-13T03:32:19.293Z
Learning: In the clp repository's Dockerfiles (e.g., tools/docker-images/clp-package/Dockerfile), ENV directives should be split into separate lines for readability rather than consolidated to reduce layer count. This is especially true for PATH modifications, as agreed upon in PR #1166. Later ENV settings may depend on earlier ones (e.g., referencing CLP_HOME).
Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1122
File: components/core/src/clp/clp/CMakeLists.txt:175-195
Timestamp: 2025-07-23T09:54:45.185Z
Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.
Learnt from: junhaoliao
Repo: y-scope/clp PR: 0
File: :0-0
Timestamp: 2025-10-22T21:02:31.113Z
Learning: Repository y-scope/clp: Maintain deterministic CI/builds for Rust; add a check to verify Cargo.lock is in sync with Cargo.toml without updating dependencies (non-mutating verification in clp-rust-checks workflow).
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1413
File: tools/docker-images/clp-package/Dockerfile:22-24
Timestamp: 2025-10-20T21:05:30.417Z
Learning: In the clp repository's Dockerfiles, ENV directives should be consolidated into multi-line ENV statements when possible to reduce image layers. ENV statements should only be split into separate commands when consolidation is not possible due to dependencies (e.g., when later variables must reference earlier ones that need to be set first, or when PATH must be modified sequentially).
Learnt from: jackluo923
Repo: y-scope/clp PR: 1054
File: components/core/tools/docker-images/clp-env-base-musllinux_1_2-x86/build.sh:18-24
Timestamp: 2025-07-01T14:52:02.418Z
Learning: In the CLP project, consistency across platform build scripts is prioritized over defensive programming when it comes to git remote handling. All build.sh files in docker-images directories should follow the same pattern for git metadata injection.
Learnt from: jackluo923
Repo: y-scope/clp PR: 1054
File: components/core/tools/docker-images/clp-env-base-musllinux_1_2-x86/build.sh:18-24
Timestamp: 2025-07-01T14:52:02.418Z
Learning: In the CLP project, consistency across platform build scripts is prioritized over defensive programming when it comes to git remote handling. All build.sh files in docker-images directories should follow the same pattern for git metadata injection.
📚 Learning: 2025-10-17T19:59:25.596Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1178
File: components/clp-package-utils/clp_package_utils/controller.py:315-315
Timestamp: 2025-10-17T19:59:25.596Z
Learning: In components/clp-package-utils/clp_package_utils/controller.py, worker log directories (compression_worker, query_worker, reducer) created via `mkdir()` do not need `_chown_paths_if_root()` calls because directories are created with the same owner as the script caller. This differs from infrastructure service directories (database, queue, Redis, results cache) which do require explicit ownership changes.

Applied to files:

  • components/clp-package-utils/clp_package_utils/controller.py
  • components/clp-package-utils/clp_package_utils/scripts/archive_manager.py
📚 Learning: 2025-10-07T07:54:32.427Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1178
File: components/clp-py-utils/clp_py_utils/clp_config.py:47-47
Timestamp: 2025-10-07T07:54:32.427Z
Learning: In components/clp-py-utils/clp_py_utils/clp_config.py, the CONTAINER_AWS_CONFIG_DIRECTORY constant is intentionally set to pathlib.Path("/") / ".aws" (i.e., `/.aws`) rather than a user-specific home directory. This hardcoded path is part of the container orchestration design.

Applied to files:

  • components/clp-package-utils/clp_package_utils/controller.py
  • components/clp-package-utils/clp_package_utils/scripts/archive_manager.py
📚 Learning: 2025-08-15T21:48:40.228Z
Learnt from: quinntaylormitchell
Repo: y-scope/clp PR: 1181
File: docs/src/user-guide/guides-retention.md:68-73
Timestamp: 2025-08-15T21:48:40.228Z
Learning: In documentation for the CLP project, when suggesting formatting improvements for variables in explanatory text, the user quinntaylormitchell prefers to maintain existing sentence structures (like keeping "that" in "i.e., that the difference...") while applying monospace formatting to technical terms and variables for consistency.

Applied to files:

  • components/clp-package-utils/clp_package_utils/controller.py
📚 Learning: 2025-08-08T21:15:10.905Z
Learnt from: haiqi96
Repo: y-scope/clp PR: 1100
File: integration-tests/tests/test_identity_transformation.py:87-95
Timestamp: 2025-08-08T21:15:10.905Z
Learning: In the CLP project's integration tests (Python code), variable names should use "logs" (plural) rather than "log" (singular) when referring to test logs or log-related entities, as this aligns with the naming conventions used throughout the codebase.

Applied to files:

  • components/clp-package-utils/clp_package_utils/controller.py
📚 Learning: 2024-11-15T16:21:52.122Z
Learnt from: haiqi96
Repo: y-scope/clp PR: 594
File: components/clp-package-utils/clp_package_utils/scripts/native/del_archives.py:104-110
Timestamp: 2024-11-15T16:21:52.122Z
Learning: In `clp_package_utils/scripts/native/del_archives.py`, when deleting archives, the `archive` variable retrieved from the database is controlled and is always a single string without path components. Therefore, it's acceptable to skip additional validation checks for directory traversal in this context.

Applied to files:

  • components/clp-package-utils/clp_package_utils/scripts/archive_manager.py
📚 Learning: 2025-10-20T20:19:40.504Z
Learnt from: 20001020ycx
Repo: y-scope/clp PR: 1436
File: components/clp-mcp-server/clp_mcp_server/clp_mcp_server.py:22-27
Timestamp: 2025-10-20T20:19:40.504Z
Learning: In the clp-mcp-server project (components/clp-mcp-server/clp_mcp_server/clp_mcp_server.py), the --config Click option should use exists=True to validate the configuration file path at option processing time, even though this requires the default path to exist on all machines.

Applied to files:

  • components/clp-package-utils/clp_package_utils/scripts/archive_manager.py
📚 Learning: 2025-07-07T17:41:15.655Z
Learnt from: jackluo923
Repo: y-scope/clp PR: 1054
File: components/core/tools/scripts/lib_install/musllinux_1_2/install-prebuilt-packages.sh:27-32
Timestamp: 2025-07-07T17:41:15.655Z
Learning: In CLP installation scripts, consistency across platform scripts is prioritized over defensive programming improvements. For example, when extracting Task binaries with tar in `install-prebuilt-packages.sh`, the extraction pattern should remain consistent with other platform scripts rather than adding defensive flags like `--strip-components=1` to handle potential tarball layout changes.

Applied to files:

  • components/clp-package-utils/clp_package_utils/scripts/archive_manager.py
📚 Learning: 2025-10-13T03:32:19.293Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1414
File: tools/docker-images/clp-package/Dockerfile:20-24
Timestamp: 2025-10-13T03:32:19.293Z
Learning: In the clp repository's Dockerfiles (e.g., tools/docker-images/clp-package/Dockerfile), ENV directives should be split into separate lines for readability rather than consolidated to reduce layer count. This is especially true for PATH modifications, as agreed upon in PR #1166. Later ENV settings may depend on earlier ones (e.g., referencing CLP_HOME).

Applied to files:

  • components/clp-package-utils/clp_package_utils/scripts/archive_manager.py
🧬 Code graph analysis (2)
components/clp-package-utils/clp_package_utils/controller.py (2)
components/clp-py-utils/clp_py_utils/core.py (1)
  • resolve_host_path_in_container (67-91)
components/clp-package-utils/clp_package_utils/general.py (1)
  • validate_webui_config (581-591)
components/clp-package-utils/clp_package_utils/scripts/archive_manager.py (2)
components/clp-py-utils/clp_py_utils/core.py (1)
  • resolve_host_path_in_container (67-91)
components/clp-py-utils/clp_py_utils/clp_config.py (1)
  • validate_logs_dir (737-743)
🪛 Ruff (0.14.3)
components/clp-package-utils/clp_package_utils/controller.py

845-845: Unnecessary mode argument

Remove mode argument

(UP015)

components/clp-package-utils/clp_package_utils/scripts/archive_manager.py

179-179: Boolean positional value in function call

(FBT003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (12)
  • GitHub Check: manylinux_2_28-x86_64-deps-image
  • GitHub Check: ubuntu-jammy-deps-image
  • GitHub Check: musllinux_1_2-x86_64-deps-image
  • GitHub Check: centos-stream-9-deps-image
  • GitHub Check: build-macos (macos-15, true)
  • GitHub Check: build-macos (macos-14, false)
  • GitHub Check: build (macos-15)
  • GitHub Check: lint-check (macos-15)
  • GitHub Check: rust-checks (ubuntu-24.04)
  • GitHub Check: lint-check (ubuntu-24.04)
  • GitHub Check: rust-checks (ubuntu-22.04)
  • GitHub Check: rust-checks (macos-15)
🔇 Additional comments (13)
components/clp-package-utils/clp_package_utils/controller.py (11)

40-40: LGTM!

The import of resolve_host_path_in_container is necessary for the containerization changes and is used consistently throughout the file.


123-128: LGTM!

The host-path resolution pattern is correctly applied for the database component: paths are resolved, directories created on resolved paths, and ownership adjusted for the infrastructure service.


173-175: LGTM!

The host-path resolution pattern is correctly applied for the queue component.


212-217: LGTM!

The host-path resolution pattern is correctly applied for the Redis component.


263-268: LGTM!

The host-path resolution pattern is correctly applied for the results cache component.


305-306: LGTM!

The host-path resolution pattern is correctly applied for the compression scheduler component. No ownership change is needed for worker log directories.


329-330: LGTM!

The host-path resolution pattern is correctly applied for the query scheduler component.


352-353: LGTM!

The host-path resolution pattern is correctly applied for the compression worker component.


382-383: LGTM!

The host-path resolution pattern is correctly applied for the query worker component.


410-411: LGTM!

The host-path resolution pattern is correctly applied for the reducer component.


610-611: LGTM!

The host-path resolution pattern is correctly applied for the garbage collector component.

components/clp-package-utils/clp_package_utils/scripts/archive_manager.py (2)

225-228: Correct removal of clp_home from mounts.

The exclusion of clp_home from necessary mounts aligns with the PR's containerization strategy, where CLP assets are embedded in the Docker image rather than mounted from the host.


286-289: Host path resolution for cleanup is correct.

Resolving the generated config path before unlinking ensures the cleanup operation targets the correct mounted location when running inside the container.

Comment on lines +175 to 178
resolve_host_path_in_container(config_file_path),
resolve_host_path_in_container(default_config_file_path),
clp_home,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Guard config path resolution to preserve container-local defaults.

The current implementation unconditionally translates both config_file_path and default_config_file_path to host-mounted paths. This breaks when the default config exists only inside the container (e.g., /opt/clp/etc/clp-config.yml), because translating it yields a non-existent /mnt/host/opt/clp/etc/clp-config.yml path, causing load_config_file to fail.

For config_file_path: check whether the path exists in the container before resolving; if it exists, use it as-is (supports both container-local defaults and host overrides).
For default_config_file_path: apply the same pattern—preserve the path if it exists in the container, otherwise resolve it.

Apply this diff:

+    import os
+
     # Validate and load config file
     try:
         config_file_path: Path = Path(parsed_args.config)
+        # Check if config_file_path exists in container; if not, resolve it as a host path
+        if config_file_path.exists():
+            resolved_config_file_path = config_file_path
+        else:
+            resolved_config_file_path = resolve_host_path_in_container(config_file_path)
+
+        # For default config, use container path if it exists; otherwise resolve as host path
+        if default_config_file_path.exists():
+            resolved_default_config_file_path = default_config_file_path
+        else:
+            resolved_default_config_file_path = resolve_host_path_in_container(
+                default_config_file_path
+            )
+
         clp_config: CLPConfig = load_config_file(
-            resolve_host_path_in_container(config_file_path),
-            resolve_host_path_in_container(default_config_file_path),
+            resolved_config_file_path,
+            resolved_default_config_file_path,
             clp_home,
         )

Based on learnings.

🤖 Prompt for AI Agents
In components/clp-package-utils/clp_package_utils/scripts/archive_manager.py
around lines 175 to 178, the code always calls resolve_host_path_in_container on
both config_file_path and default_config_file_path which breaks when a default
config only exists inside the container; instead, check whether each path exists
in the container first (e.g., os.path.exists(container_path)); if it exists,
pass the original container-local path through, otherwise call
resolve_host_path_in_container to convert it to the host-mounted path; apply
this logic to both config_file_path and default_config_file_path before passing
them (and leave clp_home unchanged).

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 75ada35 and 969fde0.

📒 Files selected for processing (1)
  • tools/deployment/package/docker-compose.runtime.yaml (1 hunks)
🧰 Additional context used
🧠 Learnings (4)
📓 Common learnings
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1414
File: tools/docker-images/clp-package/Dockerfile:20-24
Timestamp: 2025-10-13T03:32:19.293Z
Learning: In the clp repository's Dockerfiles (e.g., tools/docker-images/clp-package/Dockerfile), ENV directives should be split into separate lines for readability rather than consolidated to reduce layer count. This is especially true for PATH modifications, as agreed upon in PR #1166. Later ENV settings may depend on earlier ones (e.g., referencing CLP_HOME).
Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1122
File: components/core/src/clp/clp/CMakeLists.txt:175-195
Timestamp: 2025-07-23T09:54:45.185Z
Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.
Learnt from: junhaoliao
Repo: y-scope/clp PR: 0
File: :0-0
Timestamp: 2025-10-22T21:02:31.113Z
Learning: Repository y-scope/clp: Maintain deterministic CI/builds for Rust; add a check to verify Cargo.lock is in sync with Cargo.toml without updating dependencies (non-mutating verification in clp-rust-checks workflow).
Learnt from: jackluo923
Repo: y-scope/clp PR: 1054
File: components/core/tools/scripts/lib_install/musllinux_1_2/install-prebuilt-packages.sh:27-32
Timestamp: 2025-07-07T17:41:15.655Z
Learning: In CLP installation scripts, consistency across platform scripts is prioritized over defensive programming improvements. For example, when extracting Task binaries with tar in `install-prebuilt-packages.sh`, the extraction pattern should remain consistent with other platform scripts rather than adding defensive flags like `--strip-components=1` to handle potential tarball layout changes.
Learnt from: jackluo923
Repo: y-scope/clp PR: 1054
File: components/core/tools/scripts/lib_install/musllinux_1_2/install-prebuilt-packages.sh:6-15
Timestamp: 2025-07-01T14:52:15.217Z
Learning: For installation scripts in the CLP project, maintain consistency in command patterns across different platforms (e.g., using separate update and install commands like `apk update && apk add`, `apt update && apt install`, `yum update && yum install`) rather than platform-specific optimizations, to ensure uniform script structure and readability.
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1178
File: components/clp-py-utils/clp_py_utils/clp_config.py:47-47
Timestamp: 2025-10-07T07:54:32.427Z
Learning: In components/clp-py-utils/clp_py_utils/clp_config.py, the CONTAINER_AWS_CONFIG_DIRECTORY constant is intentionally set to pathlib.Path("/") / ".aws" (i.e., `/.aws`) rather than a user-specific home directory. This hardcoded path is part of the container orchestration design.
Learnt from: jackluo923
Repo: y-scope/clp PR: 1054
File: components/core/tools/scripts/lib_install/musllinux_1_2/install-prebuilt-packages.sh:6-15
Timestamp: 2025-07-01T14:52:15.217Z
Learning: For installation scripts in the CLP project, maintain consistency in command patterns across platforms rather than applying platform-specific optimizations. When a platform follows a pattern of separate update and install commands (like `apt-get update && apt-get install` or `apk update && apk add`), preserve this pattern for uniform script structure.
Learnt from: jackluo923
Repo: y-scope/clp PR: 1054
File: components/core/tools/scripts/lib_install/musllinux_1_2/install-prebuilt-packages.sh:6-15
Timestamp: 2025-07-01T14:52:15.217Z
Learning: For installation scripts in the CLP project, maintain consistency in command patterns across platforms rather than applying platform-specific optimizations. When a platform follows a pattern of separate update and install commands (like `apt-get update && apt-get install` or `apk update && apk add`), preserve this pattern for uniform script structure.
Learnt from: jackluo923
Repo: y-scope/clp PR: 1054
File: components/core/tools/scripts/lib_install/musllinux_1_2/install-prebuilt-packages.sh:6-15
Timestamp: 2025-07-01T14:52:15.217Z
Learning: For installation scripts in the CLP project, maintain consistency in command patterns across platforms rather than applying platform-specific optimizations. When a platform follows a pattern of separate update and install commands (like `apt-get update && apt-get install` or `apk update && apk add`), preserve this pattern for uniform script structure.
Learnt from: haiqi96
Repo: y-scope/clp PR: 651
File: components/clp-package-utils/clp_package_utils/scripts/compress.py:0-0
Timestamp: 2025-01-16T16:58:43.190Z
Learning: In the clp-package compression flow, path validation and error handling is performed at the scheduler level rather than in the compress.py script to maintain simplicity and avoid code duplication.
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1413
File: tools/docker-images/clp-package/Dockerfile:22-24
Timestamp: 2025-10-20T21:05:30.417Z
Learning: In the clp repository's Dockerfiles, ENV directives should be consolidated into multi-line ENV statements when possible to reduce image layers. ENV statements should only be split into separate commands when consolidation is not possible due to dependencies (e.g., when later variables must reference earlier ones that need to be set first, or when PATH must be modified sequentially).
Learnt from: jackluo923
Repo: y-scope/clp PR: 1054
File: components/core/tools/scripts/lib_install/musllinux_1_2/install-packages-from-source.sh:6-8
Timestamp: 2025-07-01T14:51:19.172Z
Learning: In CLP installation scripts within `components/core/tools/scripts/lib_install/`, maintain consistency with existing variable declaration patterns across platforms rather than adding individual improvements like `readonly` declarations.
📚 Learning: 2025-10-13T03:32:19.293Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1414
File: tools/docker-images/clp-package/Dockerfile:20-24
Timestamp: 2025-10-13T03:32:19.293Z
Learning: In the clp repository's Dockerfiles (e.g., tools/docker-images/clp-package/Dockerfile), ENV directives should be split into separate lines for readability rather than consolidated to reduce layer count. This is especially true for PATH modifications, as agreed upon in PR #1166. Later ENV settings may depend on earlier ones (e.g., referencing CLP_HOME).

Applied to files:

  • tools/deployment/package/docker-compose.runtime.yaml
📚 Learning: 2025-09-25T05:13:13.298Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1178
File: components/clp-package-utils/clp_package_utils/controller.py:217-223
Timestamp: 2025-09-25T05:13:13.298Z
Learning: The compression scheduler service in CLP runs with CLP_UID_GID (current user's UID:GID) rather than CLP_SERVICE_CONTAINER_UID_GID (999:999), unlike infrastructure services such as database, queue, redis, and results cache which run with the service container UID:GID.

Applied to files:

  • tools/deployment/package/docker-compose.runtime.yaml
📚 Learning: 2025-10-07T07:54:32.427Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1178
File: components/clp-py-utils/clp_py_utils/clp_config.py:47-47
Timestamp: 2025-10-07T07:54:32.427Z
Learning: In components/clp-py-utils/clp_py_utils/clp_config.py, the CONTAINER_AWS_CONFIG_DIRECTORY constant is intentionally set to pathlib.Path("/") / ".aws" (i.e., `/.aws`) rather than a user-specific home directory. This hardcoded path is part of the container orchestration design.

Applied to files:

  • tools/deployment/package/docker-compose.runtime.yaml
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)
  • GitHub Check: ubuntu-jammy-deps-image
  • GitHub Check: manylinux_2_28-x86_64-deps-image
  • GitHub Check: musllinux_1_2-x86_64-deps-image
  • GitHub Check: centos-stream-9-deps-image
  • GitHub Check: rust-checks (macos-15)
  • GitHub Check: rust-checks (ubuntu-22.04)
  • GitHub Check: rust-checks (ubuntu-24.04)
  • GitHub Check: build-macos (macos-14, false)
  • GitHub Check: build-macos (macos-15, false)
  • GitHub Check: lint-check (macos-15)
  • GitHub Check: lint-check (ubuntu-24.04)
🔇 Additional comments (1)
tools/deployment/package/docker-compose.runtime.yaml (1)

1-38: Well-structured clp-runtime service definition.

The Docker Compose service is clearly configured with appropriate environment variable forwarding (HOME, CLP_HOME, CLP_PWD_HOST) and defensive defaults/validation. The volume strategy correctly forwards the Docker daemon bridge for in-container compose operations and binds CLP_HOME for state and configs.

Comment on lines +36 to +38
- type: "bind"
source: "/"
target: "/mnt/host"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Consider documenting the root filesystem mount strategy.

Mounting the entire root filesystem (/) to /mnt/host is a significant security and design choice that enables flexible host path resolution but provides broad container access. The PR objectives indicate this is intentional for resolving arbitrary host paths during I/O operations, but consider adding a deployment-guide note or inline comment clarifying that paths must be validated in calling code to prevent unintended access.

For future clarity, consider adding a comment like:

      # Host filesystem
+     # NOTE: Mounting "/" → "/mnt/host" enables flexible host path access within the container.
+     # Calling code must validate and sanitize paths to prevent unintended cross-mount access.
      - type: "bind"
        source: "/"
        target: "/mnt/host"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
- type: "bind"
source: "/"
target: "/mnt/host"
# Host filesystem
# NOTE: Mounting "/" → "/mnt/host" enables flexible host path access within the container.
# Calling code must validate and sanitize paths to prevent unintended cross-mount access.
- type: "bind"
source: "/"
target: "/mnt/host"
🤖 Prompt for AI Agents
In tools/deployment/package/docker-compose.runtime.yaml around lines 36-38, add
an inline comment next to the bind mount explaining why the host root (/) is
mounted, the intended use (resolving arbitrary host paths for I/O), and the
security implications; explicitly instruct callers to validate/whitelist host
paths before using them and reference the repo deployment guide or README where
a short section documents the mount strategy, risk mitigation, and recommended
validation patterns.

Copy link
Member

@kirkrodrigues kirkrodrigues left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the PR title, how about:

feat(clp-package)!: Containerize the scripts in `sbin` via Docker Compose; Remove redundant assets from the package (resolves #1358).

@junhaoliao junhaoliao changed the title feat(clp-package)!: Containerize CLP tools via Docker Compose; Remove redundant assets from the package (resolves #1358). feat(clp-package)!: Containerize the scripts in sbin via Docker Compose; Remove redundant assets from the package (resolves #1358). Nov 5, 2025
@junhaoliao junhaoliao merged commit d4168d7 into y-scope:main Nov 5, 2025
31 checks passed
Comment on lines +34 to +49
for compose_plugin_dir in \
"$HOME/.docker/cli-plugins" \
"/mnt/wsl/docker-desktop/cli-tools/usr/local/lib/docker/cli-plugins" \
"/usr/local/lib/docker/cli-plugins" \
"/usr/libexec/docker/cli-plugins"; do

compose_plugin_path="$compose_plugin_dir/docker-compose"
if [[ -f "$compose_plugin_path" ]]; then
export CLP_DOCKER_PLUGIN_DIR="$compose_plugin_dir"
break
fi
done
if [[ -z "${CLP_DOCKER_PLUGIN_DIR:-}" ]]; then
echo >&2 "Warning: Docker plugin directory not found;" \
"Docker Compose may not work inside container."
fi
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Plugin upgrades that are installed through Docker Desktop on Windows seems to create symlinks in $HOME/.docker/cli-plugins. The symlinks can link to some Windows host path, such as /mnt/c/Users/junhao/.docker/modules/cli-plugins-wsl/docker-compose. If we don't map the link target's parent path into the container, the compose plugin would not be found in the container.

@coderabbitai create an issue to track this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Create a GitHub issue to track the Docker Compose plugin symlink resolution problem

gh issue create \
  --repo y-scope/clp \
  --title "Docker Compose plugin symlinks not resolved when link targets are outside mounted paths" \
  --body "## Problem Description

When using Docker Desktop on Windows with WSL, Docker Compose plugin upgrades create symlinks in \`\$HOME/.docker/cli-plugins\` that point to Windows host paths (e.g., \`/mnt/c/Users/junhao/.docker/modules/cli-plugins-wsl/docker-compose\`).

Currently, if the symlink target's parent path is not mapped into the container, the compose plugin will not be found inside the container runtime.

## Context

This issue was discovered during the containerization of CLP tools in PR #1512.

## Example

- Symlink location: \`\$HOME/.docker/cli-plugins/docker-compose\`
- Symlink target: \`/mnt/c/Users/junhao/.docker/modules/cli-plugins-wsl/docker-compose\`
- Issue: The target path \`/mnt/c/Users/junhao/.docker/modules/cli-plugins-wsl/\` needs to be mounted for the plugin to work inside the container.

## Impact

The Docker Compose plugin may not be available inside the container when users have upgraded their plugins through Docker Desktop on Windows.

---
Reported by @junhaoliao in https://github.com/y-scope/clp/pull/1512#discussion_r2493058818" \
  --assignee junhaoliao

Length of output: 42


@junhaoliao, I've created issue #1554 to track the Docker Compose plugin symlink resolution problem on Windows with Docker Desktop. The issue documents the scenario where symlinks in $HOME/.docker/cli-plugins point to Windows host paths that need to be mounted for the plugin to work inside the container.


🧠 Learnings used
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1414
File: tools/docker-images/clp-package/Dockerfile:20-24
Timestamp: 2025-10-13T03:32:19.293Z
Learning: In the clp repository's Dockerfiles (e.g., tools/docker-images/clp-package/Dockerfile), ENV directives should be split into separate lines for readability rather than consolidated to reduce layer count. This is especially true for PATH modifications, as agreed upon in PR #1166. Later ENV settings may depend on earlier ones (e.g., referencing CLP_HOME).

Learnt from: jackluo923
Repo: y-scope/clp PR: 1054
File: components/core/tools/docker-images/clp-env-base-musllinux_1_2-aarch64/build.sh:3-5
Timestamp: 2025-07-07T17:43:04.349Z
Learning: In CLP project build scripts (specifically build.sh files in docker-images directories), maintain consistency with the established pattern of using separate `set -eu` and `set -o pipefail` commands rather than combining them into `set -euo pipefail`, to ensure uniform script structure across all platform build scripts.

Learnt from: junhaoliao
Repo: y-scope/clp PR: 1413
File: tools/docker-images/clp-package/Dockerfile:22-24
Timestamp: 2025-10-20T21:05:30.417Z
Learning: In the clp repository's Dockerfiles, ENV directives should be consolidated into multi-line ENV statements when possible to reduce image layers. ENV statements should only be split into separate commands when consolidation is not possible due to dependencies (e.g., when later variables must reference earlier ones that need to be set first, or when PATH must be modified sequentially).

Learnt from: jackluo923
Repo: y-scope/clp PR: 1054
File: components/core/tools/docker-images/clp-env-base-musllinux_1_2-x86/build.sh:18-24
Timestamp: 2025-07-01T14:52:02.418Z
Learning: In the CLP project, consistency across platform build scripts is prioritized over defensive programming when it comes to git remote handling. All build.sh files in docker-images directories should follow the same pattern for git metadata injection.

Learnt from: jackluo923
Repo: y-scope/clp PR: 1054
File: components/core/tools/docker-images/clp-env-base-musllinux_1_2-x86/build.sh:18-24
Timestamp: 2025-07-01T14:52:02.418Z
Learning: In the CLP project, consistency across platform build scripts is prioritized over defensive programming when it comes to git remote handling. All build.sh files in docker-images directories should follow the same pattern for git metadata injection.

Learnt from: haiqi96
Repo: y-scope/clp PR: 594
File: components/package-template/src/sbin/del-archives.sh:6-9
Timestamp: 2024-11-15T16:28:08.644Z
Learning: For wrapper scripts in the `components/package-template/src/sbin/` directory, keep them simple and avoid adding additional validation code.

Learnt from: jackluo923
Repo: y-scope/clp PR: 1054
File: components/core/tools/scripts/lib_install/musllinux_1_2/install-packages-from-source.sh:6-8
Timestamp: 2025-07-01T14:51:19.172Z
Learning: In CLP installation scripts within `components/core/tools/scripts/lib_install/`, maintain consistency with existing variable declaration patterns across platforms rather than adding individual improvements like `readonly` declarations.

Learnt from: jackluo923
Repo: y-scope/clp PR: 1054
File: components/core/tools/scripts/lib_install/musllinux_1_2/install-packages-from-source.sh:6-8
Timestamp: 2025-07-01T14:51:19.172Z
Learning: In CLP installation scripts within `components/core/tools/scripts/lib_install/`, maintain consistency with existing variable declaration patterns across platforms rather than adding individual improvements like `readonly` declarations.

Learnt from: haiqi96
Repo: y-scope/clp PR: 594
File: components/clp-package-utils/clp_package_utils/scripts/del_archives.py:56-65
Timestamp: 2024-11-18T16:49:20.248Z
Learning: When reviewing wrapper scripts in `components/clp-package-utils/clp_package_utils/scripts/`, note that it's preferred to keep error handling simple without adding extra complexity.

Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1122
File: components/core/src/clp/clp/CMakeLists.txt:175-195
Timestamp: 2025-07-23T09:54:45.185Z
Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.

Learnt from: junhaoliao
Repo: y-scope/clp PR: 0
File: :0-0
Timestamp: 2025-10-22T21:02:31.113Z
Learning: Repository y-scope/clp: Maintain deterministic CI/builds for Rust; add a check to verify Cargo.lock is in sync with Cargo.toml without updating dependencies (non-mutating verification in clp-rust-checks workflow).

@@ -1,9 +1,12 @@
#!/usr/bin/env bash
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should

set -o errexit
set -o nounset
set -o pipefail

or

set -eu
set -o pipefail

so the scripts early exit on any error or on any unset variables, and make a pipeline fail if any command in it fails

@coderabbitai create an issue to track the improvement

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash

# Create GitHub issue to track adding error handling flags to shell scripts
gh issue create \
  --title "Add error handling flags to shell scripts in package" \
  --body "## Context

User @junhaoliao requested tracking an improvement to add error handling flags to shell scripts in the package template.

## Description

Shell scripts in \`components/package-template/src/sbin/\` should include error handling flags at the beginning to make them more robust:

\`\`\`bash
set -eu
set -o pipefail
\`\`\`

These flags will:
- Exit early on any error (\`-e\` / \`errexit\`)
- Exit on unset variables (\`-u\` / \`nounset\`)
- Make pipelines fail if any command in them fails (\`pipefail\`)

Note: Use separate \`set -eu\` and \`set -o pipefail\` commands for consistency with the established pattern in CLP project build scripts.

## Affected Scripts

Scripts in \`components/package-template/src/sbin/\` directory, including:
- admin-tools/archive-manager.sh
- admin-tools/dataset-manager.sh
- compress-from-s3.sh
- compress.sh
- decompress.sh
- search.sh
- start-clp.sh
- stop-clp.sh
- .common-env.sh

## References

- PR: https://github.com/y-scope/clp/pull/1512
- Comment: https://github.com/y-scope/clp/pull/1512#discussion_r2493126446
- Requested by: @junhaoliao" \
  --assignee junhaoliao \
  --repo y-scope/clp

Length of output: 42


@junhaoliao, I've created issue #1556 to track adding error handling flags to the shell scripts in the package template. The issue covers all the scripts in components/package-template/src/sbin/ and follows the established CLP project pattern of using separate set -eu and set -o pipefail commands for consistency with existing build scripts.


🧠 Learnings used
Learnt from: jackluo923
Repo: y-scope/clp PR: 1054
File: components/core/tools/docker-images/clp-env-base-musllinux_1_2-aarch64/build.sh:3-5
Timestamp: 2025-07-07T17:43:04.349Z
Learning: In CLP project build scripts (specifically build.sh files in docker-images directories), maintain consistency with the established pattern of using separate `set -eu` and `set -o pipefail` commands rather than combining them into `set -euo pipefail`, to ensure uniform script structure across all platform build scripts.

Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1261
File: .github/workflows/clp-core-build.yaml:294-332
Timestamp: 2025-08-25T06:29:59.610Z
Learning: In the CLP project, Bill-hbrhbr prefers a "fail fast" approach for CI workflows - allowing potential command availability issues (like getconf in musllinux) to surface through CI failures rather than preemptively adding fallback logic, as they will fix issues when they occur.

Learnt from: junhaoliao
Repo: y-scope/clp PR: 1152
File: components/clp-package-utils/clp_package_utils/general.py:0-0
Timestamp: 2025-08-19T14:41:28.901Z
Learning: In the CLP codebase, prefer explicit failures over automatic directory creation in utility functions like dump_config. The user junhaoliao prefers to let file operations fail when parent directories don't exist, as this helps catch implementation errors during development rather than masking setup issues with automatic directory creation.

Learnt from: haiqi96
Repo: y-scope/clp PR: 594
File: components/clp-package-utils/clp_package_utils/scripts/del_archives.py:56-65
Timestamp: 2024-11-18T16:49:20.248Z
Learning: When reviewing wrapper scripts in `components/clp-package-utils/clp_package_utils/scripts/`, note that it's preferred to keep error handling simple without adding extra complexity.

Learnt from: junhaoliao
Repo: y-scope/clp PR: 1414
File: tools/docker-images/clp-package/Dockerfile:20-24
Timestamp: 2025-10-13T03:32:19.293Z
Learning: In the clp repository's Dockerfiles (e.g., tools/docker-images/clp-package/Dockerfile), ENV directives should be split into separate lines for readability rather than consolidated to reduce layer count. This is especially true for PATH modifications, as agreed upon in PR #1166. Later ENV settings may depend on earlier ones (e.g., referencing CLP_HOME).

Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1122
File: components/core/src/clp/clp/CMakeLists.txt:175-195
Timestamp: 2025-07-23T09:54:45.185Z
Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.

Learnt from: haiqi96
Repo: y-scope/clp PR: 594
File: components/clp-package-utils/clp_package_utils/scripts/native/del_archives.py:104-110
Timestamp: 2024-11-15T16:21:52.122Z
Learning: In `clp_package_utils/scripts/native/del_archives.py`, when deleting archives, the `archive` variable retrieved from the database is controlled and is always a single string without path components. Therefore, it's acceptable to skip additional validation checks for directory traversal in this context.

Learnt from: junhaoliao
Repo: y-scope/clp PR: 1413
File: tools/docker-images/clp-package/Dockerfile:22-24
Timestamp: 2025-10-20T21:05:30.417Z
Learning: In the clp repository's Dockerfiles, ENV directives should be consolidated into multi-line ENV statements when possible to reduce image layers. ENV statements should only be split into separate commands when consolidation is not possible due to dependencies (e.g., when later variables must reference earlier ones that need to be set first, or when PATH must be modified sequentially).

Learnt from: junhaoliao
Repo: y-scope/clp PR: 0
File: :0-0
Timestamp: 2025-10-22T21:02:31.113Z
Learning: Repository y-scope/clp: Maintain deterministic CI/builds for Rust; add a check to verify Cargo.lock is in sync with Cargo.toml without updating dependencies (non-mutating verification in clp-rust-checks workflow).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to make sure the .resolve() calls in this script can properly resolve relative and symlinks in the host path space; otherwise, the "mounted_path"s might point to some invalid file in the Package services containers

@coderabbitai create an issue to track this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Create a GitHub issue to track the path resolution concern

gh issue create \
  --repo y-scope/clp \
  --title "Ensure .resolve() calls in compress.py properly handle host path resolution" \
  --body "## Description

In the containerized runtime environment introduced in PR #1512, the \`.resolve()\` calls in \`components/clp-package-utils/clp_package_utils/scripts/compress.py\` need to properly resolve relative paths and symlinks in the host path space. Otherwise, the \`mounted_path\`s might point to invalid files in the Package services containers.

## Context

The \`_generate_logs_list\` function uses \`pathlib.Path(...).resolve()\` to convert user-provided paths into absolute paths before mounting them in containers. However, in the containerized runtime, these resolve calls may operate in the container's filesystem space rather than the host's filesystem space.

## Related

- PR: https://github.com/y-scope/clp/pull/1512
- Comment: https://github.com/y-scope/clp/pull/1512#discussion_r2495876730
- Requested by: @junhaoliao" \
  --assignee junhaoliao

Length of output: 42


@junhaoliao, I've created issue #1558 to track this concern about ensuring .resolve() calls in compress.py properly handle host path resolution.

You're right to flag this—the .resolve() calls at lines 48 and 69 (in _generate_logs_list) currently resolve paths in whatever filesystem context the script runs in. Since the script now executes inside the container, these calls would resolve symlinks and relative paths against the container's filesystem instead of the host's, which could lead to incorrect mounted_path values that don't correspond to valid files in the container mounts.


🧬 Code Graph Analysis Results

[content below identifies relevant snippets to address the user comment; do not perform the user comment analysis here]

  1. File: components/clp-py-utils/clp_py_utils/core.py
    Lines: 67-92
    Content:
    def resolve_host_path_in_container(host_path: pathlib.Path) -> pathlib.Path:
    """
    Translates a host path to its container-mount equivalent. It also resolves a single level of
    symbolic link if the host path itself is a symlink.

    :param host_path: The host path.
    :return: The translated path.
    """
    host_path = host_path.absolute()
    translated_path = CONTAINER_DIR_FOR_HOST_ROOT / host_path.relative_to("/")
    
    try:
        if not translated_path.is_symlink():
            return translated_path
    
        link_target = translated_path.readlink()
        if link_target.is_absolute():
            return CONTAINER_DIR_FOR_HOST_ROOT / link_target.relative_to("/")
        else:
            # If the symlink points to a relative path, resolve it relative to the symlink's parent.
            return (translated_path.parent / link_target).resolve()
    except OSError:
        # Ignore if reading the symlink fails (e.g., broken link or permission error).
        pass
    
    return translated_path
    
  2. File: components/clp-package-utils/clp_package_utils/scripts/compress.py
    Lines: 31-54 (approximate)
    Content (relevant excerpt):
    def _generate_logs_list(
    container_logs_list_path: pathlib.Path,
    parsed_args: argparse.Namespace,
    ) -> bool:
    """
    Generates logs list file for the native compression script.

    :param container_logs_list_path: Path to write logs list.
    :param parsed_args: Parsed command-line arguments.
    :return: Whether any paths were written to the logs list.
    """
    host_logs_list_path = parsed_args.path_list
    with open(container_logs_list_path, "w") as container_logs_list_file:
        if host_logs_list_path is None:
            for path in parsed_args.paths:
                resolved_path = pathlib.Path(path).resolve()
                mounted_path = CONTAINER_INPUT_LOGS_ROOT_DIR / resolved_path.relative_to(
                    resolved_path.anchor
                )
                container_logs_list_file.write(f"{mounted_path}\n")
            return len(parsed_args.paths) != 0
    
        no_path_found = True
        resolved_host_logs_list_path = resolve_host_path_in_container(
            pathlib.Path(host_logs_list_path)
        )
        with open(resolved_host_logs_list_path, "r") as host_logs_list_file:
            for line in host_logs_list_file:
                stripped_path_str = line.rstrip()
                if "" == stripped_path_str:
                    # Skip empty paths
                    continue
                no_path_found = False
                resolved_path = pathlib.Path(stripped_path_str).resolve()
                mounted_path = CONTAINER_INPUT_LOGS_ROOT_DIR / resolved_path.relative_to(
                    resolved_path.anchor
                )
                container_logs_list_file.write(f"{mounted_path}\n")
        return not no_path_found
    

Note: The user comment mentions ensuring .resolve() calls properly resolve relative and symlinks in host path space so that "mounted_path"s are valid inside containers. The above snippets illustrate:

  • How host paths are translated to container paths with potential single-level symlink handling (resolve_host_path_in_container).
  • How the script resolves host paths on the host side and computes mounted container paths (Path.resolve() + relative_to("/")) to form the mounted paths written to the logs list. These are the relevant points to audit and adjust for host-space resolution behavior.

🧠 Learnings used
Learnt from: haiqi96
Repo: y-scope/clp PR: 651
File: components/clp-package-utils/clp_package_utils/scripts/compress.py:0-0
Timestamp: 2025-01-16T16:58:43.190Z
Learning: In the clp-package compression flow, path validation and error handling is performed at the scheduler level rather than in the compress.py script to maintain simplicity and avoid code duplication.

Learnt from: LinZhihao-723
Repo: y-scope/clp PR: 1340
File: components/job-orchestration/job_orchestration/executor/compress/compression_task.py:528-528
Timestamp: 2025-09-28T15:00:22.170Z
Learning: In components/job-orchestration/job_orchestration/executor/compress/compression_task.py, there is a suggestion to refactor from passing logger as a parameter through multiple functions to creating a ClpCompressor class that takes the logger as a class member, with current helper functions becoming private member functions.

Learnt from: junhaoliao
Repo: y-scope/clp PR: 1414
File: tools/docker-images/clp-package/Dockerfile:20-24
Timestamp: 2025-10-13T03:32:19.293Z
Learning: In the clp repository's Dockerfiles (e.g., tools/docker-images/clp-package/Dockerfile), ENV directives should be split into separate lines for readability rather than consolidated to reduce layer count. This is especially true for PATH modifications, as agreed upon in PR #1166. Later ENV settings may depend on earlier ones (e.g., referencing CLP_HOME).

Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1122
File: components/core/src/clp/clp/CMakeLists.txt:175-195
Timestamp: 2025-07-23T09:54:45.185Z
Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.

Learnt from: junhaoliao
Repo: y-scope/clp PR: 0
File: :0-0
Timestamp: 2025-10-22T21:02:31.113Z
Learning: Repository y-scope/clp: Maintain deterministic CI/builds for Rust; add a check to verify Cargo.lock is in sync with Cargo.toml without updating dependencies (non-mutating verification in clp-rust-checks workflow).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

host network mode should be used or the start_clp.py script cannot properly check what IP addresses it can bind to on the host.

@coderabbitai file an issue

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Create a GitHub issue for enabling host network mode in docker-compose.runtime.yaml

gh issue create \
  --repo y-scope/clp \
  --title "Enable host network mode for clp-runtime service in docker-compose.runtime.yaml" \
  --body "## Problem

The \`clp-runtime\` service in \`tools/deployment/package/docker-compose.runtime.yaml\` should use host network mode to allow the \`start_clp.py\` script to properly check what IP addresses it can bind to on the host.

## Context

Currently, the service runs with bridge networking, which isolates the container's network stack from the host. This prevents \`start_clp.py\` from correctly detecting available host IP addresses for binding.

## Backlinks

- PR: https://github.com/y-scope/clp/pull/1512
- Comment: https://github.com/y-scope/clp/pull/1512#discussion_r2496275743
- Reported by: @junhaoliao" \
  --assignee junhaoliao

Length of output: 42


@junhaoliao, I've created issue #1560 to track enabling host network mode for the clp-runtime service. The issue includes the context about start_clp.py needing proper access to host IP addresses and references this PR and comment.


🧠 Learnings used
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1178
File: components/clp-package-utils/clp_package_utils/controller.py:217-223
Timestamp: 2025-09-25T05:13:13.298Z
Learning: The compression scheduler service in CLP runs with CLP_UID_GID (current user's UID:GID) rather than CLP_SERVICE_CONTAINER_UID_GID (999:999), unlike infrastructure services such as database, queue, redis, and results cache which run with the service container UID:GID.

Learnt from: junhaoliao
Repo: y-scope/clp PR: 1414
File: tools/docker-images/clp-package/Dockerfile:20-24
Timestamp: 2025-10-13T03:32:19.293Z
Learning: In the clp repository's Dockerfiles (e.g., tools/docker-images/clp-package/Dockerfile), ENV directives should be split into separate lines for readability rather than consolidated to reduce layer count. This is especially true for PATH modifications, as agreed upon in PR #1166. Later ENV settings may depend on earlier ones (e.g., referencing CLP_HOME).

Learnt from: junhaoliao
Repo: y-scope/clp PR: 1413
File: tools/docker-images/clp-package/Dockerfile:22-24
Timestamp: 2025-10-20T21:05:30.417Z
Learning: In the clp repository's Dockerfiles, ENV directives should be consolidated into multi-line ENV statements when possible to reduce image layers. ENV statements should only be split into separate commands when consolidation is not possible due to dependencies (e.g., when later variables must reference earlier ones that need to be set first, or when PATH must be modified sequentially).

Learnt from: 20001020ycx
Repo: y-scope/clp PR: 1368
File: components/clp-mcp-server/clp_mcp_server/__init__.py:11-15
Timestamp: 2025-10-02T15:48:58.961Z
Learning: In the clp-mcp-server component (components/clp-mcp-server/clp_mcp_server/__init__.py), the default host binding of 0.0.0.0 is intentional because the server is designed to be deployed in Docker containers where this binding is necessary to accept external connections.

Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1122
File: components/core/src/clp/clp/CMakeLists.txt:175-195
Timestamp: 2025-07-23T09:54:45.185Z
Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.

Learnt from: junhaoliao
Repo: y-scope/clp PR: 0
File: :0-0
Timestamp: 2025-10-22T21:02:31.113Z
Learning: Repository y-scope/clp: Maintain deterministic CI/builds for Rust; add a check to verify Cargo.lock is in sync with Cargo.toml without updating dependencies (non-mutating verification in clp-rust-checks workflow).

Comment on lines +7 to +8
stdin_open: true
tty: true
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's found that when requesting to detach stdin from the shell using &, no output is seen in the console. e.g.,

cd build/clp-package
./sbin/start-clp.sh

# this command prints compression speed to the console
./sbin/compress.sh ~/samples/hive-24hr

# with the `&` added, no command output is observed in the console
./sbin/compress.sh ~/samples/hive-24hr &

i believe we should simply remove tty: true and stdin_open: true from the docker-compose.runtime.yaml to avoid allocating a TTY. however, removing those lines have no effects - when & is used, there's still no output printed to the console.

It was later discovered that the docker compose run command overrides the tty and stdin_open options defined in service configurations (source), even when neither the --no-TTY nor the --interactive flag is specified. In other words, if these flags are omitted, docker compose run will automatically set tty: true when it detects a TTY-capable terminal, and will always set stdin_open: true, regardless of the values defined in docker-compose.yaml. (The official documentation is unfortunately confusing and misleading on this behavior.)

then i think we should remove these two lines (rather than changing them to false) in docker-compose.runtime.yaml to avoid confusions, and add --no-TTY to all of the sbin scripts.

@coderabbitai create an issue to track the fix

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Create a GitHub issue to track the TTY handling fix

gh issue create \
  --repo y-scope/clp \
  --title "Fix TTY allocation causing missing output when backgrounding commands" \
  --body "## Issue Description

When running CLP package commands in the background using \`&\`, no output is printed to the console. For example:

\`\`\`bash
./sbin/compress.sh ~/samples/hive-24hr &
\`\`\`

## Root Cause

The \`docker compose run\` command overrides the \`tty\` and \`stdin_open\` options defined in service configurations ([source](https://github.com/docker/compose/blob/c37ede62dbb6485c4d5e297db9134631e9b2095e/cmd/compose/run.go#L88-L89)), even when neither the \`--no-TTY\` nor the \`--interactive\` flag is specified. This means \`docker compose run\` automatically sets \`tty: true\` when it detects a TTY-capable terminal and always sets \`stdin_open: true\`, regardless of the values defined in \`docker-compose.yaml\`.

## Proposed Solution

1. Remove \`tty: true\` and \`stdin_open: true\` from \`tools/deployment/package/docker-compose.runtime.yaml\` to avoid confusion
2. Add \`--no-TTY\` flag to all \`docker compose run\` commands in the sbin scripts

## Context

Reported by @junhaoliao in PR #1512: https://github.com/y-scope/clp/pull/1512#discussion_r2501082935" \
  --assignee junhaoliao

Length of output: 42


@junhaoliao I've created issue #1566 to track this fix. The issue documents the root cause (docker compose run overriding TTY settings) and the proposed solution (removing the lines from docker-compose.runtime.yaml and adding --no-TTY to all sbin scripts).


🧠 Learnings used
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1178
File: components/clp-package-utils/clp_package_utils/controller.py:217-223
Timestamp: 2025-09-25T05:13:13.298Z
Learning: The compression scheduler service in CLP runs with CLP_UID_GID (current user's UID:GID) rather than CLP_SERVICE_CONTAINER_UID_GID (999:999), unlike infrastructure services such as database, queue, redis, and results cache which run with the service container UID:GID.

Learnt from: junhaoliao
Repo: y-scope/clp PR: 1414
File: tools/docker-images/clp-package/Dockerfile:20-24
Timestamp: 2025-10-13T03:32:19.293Z
Learning: In the clp repository's Dockerfiles (e.g., tools/docker-images/clp-package/Dockerfile), ENV directives should be split into separate lines for readability rather than consolidated to reduce layer count. This is especially true for PATH modifications, as agreed upon in PR #1166. Later ENV settings may depend on earlier ones (e.g., referencing CLP_HOME).

Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1122
File: components/core/src/clp/clp/CMakeLists.txt:175-195
Timestamp: 2025-07-23T09:54:45.185Z
Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.

Learnt from: junhaoliao
Repo: y-scope/clp PR: 0
File: :0-0
Timestamp: 2025-10-22T21:02:31.113Z
Learning: Repository y-scope/clp: Maintain deterministic CI/builds for Rust; add a check to verify Cargo.lock is in sync with Cargo.toml without updating dependencies (non-mutating verification in clp-rust-checks workflow).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants