refactor(clp-package): Migrate config types and validation to Pydantic V2 (fixes #1342). #1356

junhaoliao · 2025-09-30T06:47:11Z

Description

This pull request migrates the type system and validation in clp-py-utils/clp_config.py and related modules to use Pydantic V2's new Annotated syntax and stronger type checking. Key changes include:

Replacing legacy validator functions with Pydantic Field constraints and Annotated type aliases like NonEmptyStr, PositiveInt, PositiveFloat, Host, Port, and ZstdCompressionLevel.
Introducing strict enum for DatabaseEngine using KebabCaseStrEnum.
Adding LoggingLevel Literal which replaces the previous logging level validation method with a map.
Refactoring constructor arguments and class attributes to these stricter types, improving type safety across all config objects.
Removing legacy field_validator and model_validator code that can now be handled by Pydantic's built-in validation.
Updating all classes (Database, Package, CompressionScheduler, QueryScheduler, Redis, ResultsCache, etc.) to use new type aliases and enums.
Adjusting utility and dump methods to work with the new type structure.

Checklist

The PR satisfies the contribution guidelines.
This is a breaking change and that has been indicated in the PR title, OR this isn't a
breaking change.
Necessary docs have been updated, OR no docs need to be updated.

Validation performed

To test the new logging level validation / setting utilities:
1. Set logging_level to "DEBUG" in all applicable components in clp-config.yml. Started the package successfully and observed in the clp-compression_scheduler docker container logs that debug logs were printed. e.g.,
```
2025-09-30 06:20:48,808 compression_scheduler [DEBUG] Search and schedule new tasks
2025-09-30 06:20:48,808 compression_scheduler [DEBUG] Poll running jobs
```
1. Set all logging_levels to "INVALID" in all applicable components and observed the package failed to start due to validation error.
Set out of bound / invalid values, ran start-clp.sh and observed the script fails with validation errors:
NonEmptyStr: ""
PositiveFloat: 0, -1, -0.1, false. (Also tried 1 which is an integer and the input was considered valid, which is expected)
PositiveInt: 0, -1, -0.1, false
Host: ""
Port: 0, -1, 65536. (also tried 1023 with non-root user - the script was able to start but port binding fails due to lack of permission)
ZstdCompressionLevel: 0, 20

Summary by CodeRabbit

New Features
- Added PRESTO support for the query engine.
- Standardized config serialisation to primitive values (enums emit their values).
- Strongly typed config fields (e.g., host, port, compression level) for clearer validation.
Refactor
- Simplified logging: set the logging level directly with a level string; invalid inputs now fall back to INFO.
- Removed legacy logging helpers and in-model validators in favour of type-based constraints across configs.

…s and update serialization.

…t directly to model_dump.

…drop custom port validators.

…ost type and removing duplicated validators.

…type and adjust serialization accordingly.

…red NonEmptyStr type.

…n scheduler configs.

…heduler configs and update references.

…ndant validators.

…remove its validator.

…el type and remove validation helpers.

coderabbitai · 2025-09-30T06:47:20Z

Walkthrough

Refactors configuration and logging utilities. Adds strong-typed aliases, enums, and consistent primitive-serialisation methods across config models; extends QueryEngine with PRESTO and introduces DatabaseEngine. Updates validation to type-driven constraints. Simplifies logging by introducing a LoggingLevel Literal and removing legacy validation helpers.

Changes

Cohort / File(s)	Summary
Typed config, enums, and serialisation `components/clp-py-utils/clp_py_utils/clp_config.py`	Adds NonEmptyStr/PositiveInt/PositiveFloat aliases; Host, Port, ZstdCompressionLevel; DatabaseEngine enum; extends QueryEngine with PRESTO. Updates models (Package, Database, S3Config, Redis, Queue, etc.) to use typed fields and adds dump_to_primitive_dict emitting enum values. Adjusts validators to model/type-based constraints; execution_container now Optional[NonEmptyStr].
Logging level literal and validation simplification `components/clp-py-utils/clp_py_utils/clp_logging.py`	Introduces LoggingLevel Literal; removes LEVEL map, get_valid_logging_level, is_valid_logging_level. Updates set_logging_level to validate against Literal and default to INFO on invalid input; sets logger level directly from provided string.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant U as Caller
  participant CFG as Config Model
  participant E as Enum Field(s)

  U->>CFG: dump_to_primitive_dict()
  activate CFG
  loop For each field
    alt Field is Enum
      CFG->>E: .value
      E-->>CFG: primitive value
    else Field is nested model
      CFG->>CFG: recurse dump_to_primitive_dict()
    else Primitive/alias
      CFG->>CFG: include as-is
    end
  end
  CFG-->>U: dict of primitives (enums as values)
  deactivate CFG

sequenceDiagram
  autonumber
  participant U as Caller
  participant L as set_logging_level(level)
  participant T as LoggingLevel Literal

  U->>L: level (str)
  L->>T: validate level in get_args(T)
  alt valid
    L-->>U: set logger level to level
  else invalid
    Note over L: Fallback behaviour
    L-->>U: set logger level to "INFO"
  end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 5.56% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check	✅ Passed	The title accurately summarizes the core refactoring of the configuration types and validation logic in the clp-package to use Pydantic V2, and it clearly references the related issue number for context.

✨ Finishing touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 Ruff (0.13.1)

components/clp-py-utils/clp_py_utils/clp_config.py

�[1;31mruff failed�[0m
�[1mCause:�[0m Failed to load configuration /ruff.toml
�[1mCause:�[0m Failed to parse /ruff.toml
�[1mCause:�[0m TOML parse error at line 26, column 3
|
26 | "RSE100", # Use of assert detected
| ^^^^^^^^
Unknown rule selector: RSE100

components/clp-py-utils/clp_py_utils/clp_logging.py

�[1;31mruff failed�[0m
�[1mCause:�[0m Failed to load configuration /ruff.toml
�[1mCause:�[0m Failed to parse /ruff.toml
�[1mCause:�[0m TOML parse error at line 26, column 3
|
26 | "RSE100", # Use of assert detected
| ^^^^^^^^
Unknown rule selector: RSE100

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🧪 Early access (Sonnet 4.5): enabled

We are currently testing the Sonnet 4.5 model, which is expected to improve code review quality. However, this model may lead to increased noise levels in the review comments. Please disable the early access features if the noise level causes any inconvenience.

Note:

Public repositories are always opted into early access features.
You can enable or disable early access features from the CodeRabbit UI or by updating the CodeRabbit configuration file.

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

components/clp-py-utils/clp_py_utils/clp_config.py (1)

210-231: Align CLP connection “type” with configured engine or document override
Hardcoding "type": DatabaseEngine.MYSQL.value in get_clp_connection_params_and_type ignores the configured self.type; either switch to self.type.value or add a note explaining why CLP core only recognizes MySQL.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 62bea59 and 6a584e0.

📒 Files selected for processing (2)

components/clp-py-utils/clp_py_utils/clp_config.py (17 hunks)
components/clp-py-utils/clp_py_utils/clp_logging.py (2 hunks)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)

GitHub Check: package-image
GitHub Check: lint-check (ubuntu-24.04)
GitHub Check: lint-check (macos-15)

🔇 Additional comments (12)

components/clp-py-utils/clp_py_utils/clp_config.py (11)

102-109: LGTM! Well-defined type aliases with appropriate constraints.

The type aliases provide clear, reusable constraints:

NonEmptyStr, PositiveFloat, PositiveInt for generic validation

Host, Port, ZstdCompressionLevel for domain-specific constraints

The port constraint gt=0, lt=2**16 correctly validates the range [1, 65535], and the Zstd compression level range [1, 19] aligns with the library's valid levels.

117-120: LGTM! DatabaseEngine enum properly defined.

The DatabaseEngine enum using KebabCaseStrEnum correctly defines MARIADB and MYSQL as the supported database types, consistent with the PR's migration to stronger type validation.

144-164: LGTM! Comprehensive engine compatibility validation.

The validate_query_engine_package_compatibility validator ensures:

CLP/CLP_S query engines match their corresponding storage engines

PRESTO query engine only works with CLP_S storage

Unsupported combinations are rejected with clear error messages

This prevents invalid configuration combinations at validation time.

166-170: LGTM! Consistent primitive serialization.

The dump_to_primitive_dict method correctly serializes enum fields to their primitive string values, enabling JSON/YAML export without Pydantic-specific types.

258-277: LGTM! Scheduler and worker classes properly migrated.

All scheduler and worker classes correctly use:

PositiveFloat for polling delays

PositiveInt for numeric parameters

LoggingLevel for logging level validation

Host and Port where applicable

The type migration is consistent and well-structured.

279-353: LGTM! Redis and Queue classes properly migrated.

Both classes correctly:

Use Host and Port type aliases for network parameters

Apply NonEmptyStr to username fields where appropriate

Exclude credentials from dump_to_primitive_dict for security

Maintain credential loading logic from files and environment

The migration preserves security-sensitive serialization logic.

356-404: LGTM! AWS authentication classes properly migrated.

The AWS-related classes correctly:

Use NonEmptyStr for credentials and config fields to prevent empty strings

Maintain comprehensive validation logic in validate_authentication (lines 372-397)

Ensure mutual exclusivity and conditional requirements for auth types

The validation logic properly handles all auth type combinations.

508-542: LGTM! Output configuration classes properly migrated.

Both ArchiveOutput and StreamOutput correctly:

Use PositiveInt for size parameters

Apply ZstdCompressionLevel for compression validation (line 514)

Use Optional[PositiveInt] for nullable numeric fields

Implement dump_to_primitive_dict for nested storage serialization

The type constraints ensure valid compression and sizing parameters.

545-567: LGTM! UI and service classes properly migrated.

The WebUi, SweepInterval, GarbageCollector, and Presto classes correctly apply:

Host and Port for network configuration

NonEmptyStr for collection names

PositiveInt for rates and intervals

LoggingLevel for logging configuration

All type migrations are appropriate for their use cases.

734-757: LGTM! CLPConfig serialization properly handles nested objects.

The dump_to_primitive_dict method correctly:

Identifies fields requiring custom serialization (enums, nested configs)

Delegates to nested dump_to_primitive_dict methods for complex objects

Converts pathlib.Path objects to strings for primitive export

Handles optional paths appropriately

This provides consistent primitive serialization across the entire config hierarchy.

759-767: LGTM! Presto configuration validation enforces requirement.

The validate_presto_config validator correctly ensures that the presto configuration is non-null when the query engine is set to PRESTO, preventing incomplete configuration at validation time.

components/clp-py-utils/clp_py_utils/clp_logging.py (1)

4-11: No issue with "WARN" literal. Testing confirms logger.setLevel("WARN") is valid and maps to the WARNING level.

Likely an incorrect or invalid review comment.

coderabbitai · 2025-09-30T06:52:51Z

components/clp-py-utils/clp_py_utils/clp_logging.py

 def set_logging_level(logger: logging.Logger, level: str):
-    if not is_valid_logging_level(level):
+    if level not in get_args(LoggingLevel):
        logger.warning(f"Invalid logging level: {level}, using INFO as default")
        logger.setLevel(logging.INFO)
        return
-    logger.setLevel(LOGGING_LEVEL_MAPPING[level])
+
+    logger.setLevel(level)


🧹 Nitpick | 🔵 Trivial

Consider mapping string levels to logging constants.

The current implementation passes the level string directly to logger.setLevel(). While this works for standard level names, it relies on Python's internal string-to-level mapping. For robustness and clarity, consider explicitly mapping to logging constants.

Apply this diff to use explicit constant mapping:

def set_logging_level(logger: logging.Logger, level: str): if level not in get_args(LoggingLevel): logger.warning(f"Invalid logging level: {level}, using INFO as default") logger.setLevel(logging.INFO) return - - logger.setLevel(level) + + level_map = { + "DEBUG": logging.DEBUG, + "INFO": logging.INFO, + "WARN": logging.WARNING, + "WARNING": logging.WARNING, + "ERROR": logging.ERROR, + "CRITICAL": logging.CRITICAL, + } + logger.setLevel(level_map[level])

This approach provides explicit control over level mapping and handles the "WARN" → logging.WARNING conversion explicitly.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def set_logging_level(logger: logging.Logger, level: str):

if not is_valid_logging_level(level):

if level not in get_args(LoggingLevel):

logger.warning(f"Invalid logging level: {level}, using INFO as default")

logger.setLevel(logging.INFO)

return

logger.setLevel(LOGGING_LEVEL_MAPPING[level])

logger.setLevel(level)

def set_logging_level(logger: logging.Logger, level: str):

if level not in get_args(LoggingLevel):

logger.warning(f"Invalid logging level: {level}, using INFO as default")

logger.setLevel(logging.INFO)

return

level_map = {

"DEBUG": logging.DEBUG,

"INFO": logging.INFO,

"WARN": logging.WARNING,

"WARNING": logging.WARNING,

"ERROR": logging.ERROR,

"CRITICAL": logging.CRITICAL,

}

logger.setLevel(level_map[level])

🤖 Prompt for AI Agents

In components/clp-py-utils/clp_py_utils/clp_logging.py around lines 29 to 35, the function currently calls logger.setLevel(level) with a string; change this to explicitly map the validated string to the logging module constants (e.g., {"CRITICAL": logging.CRITICAL, "ERROR": logging.ERROR, "WARNING": logging.WARNING, "WARN": logging.WARNING, "INFO": logging.INFO, "DEBUG": logging.DEBUG, "NOTSET": logging.NOTSET}), normalize the input to upper() before lookup, use logging.INFO as the fallback for unknown values, and then call logger.setLevel(mapped_constant) instead of passing the raw string.

the setLevel method accepts both str and int logging levels. when the argument is a str, the method tries to map it to some numeric logging value before setting the number, so we should not need to do such mapping in our code

@junhaoliao, you're absolutely right! The logging.Logger.setLevel() method does accept string level names directly and handles the internal mapping to numeric values. The current implementation is correct and doesn't require explicit constant mapping. Thank you for the clarification!

https://docs.python.org/3/library/logging.html#logging.Logger.setLevel

Changed in version 3.2: The level parameter now accepts a string representation of the level such as ‘INFO’ as an alternative to the integer constants such as INFO. Note, however, that levels are internally stored as integers, and methods such as e.g. getEffectiveLevel() and isEnabledFor() will return/expect to be passed integers.

@junhaoliao, thank you for providing the official documentation link! It confirms that the current implementation is correct and leverages Python's built-in string-to-level mapping that has been available since version 3.2. The PR's approach is the right one.

junhaoliao · 2025-09-30T06:55:07Z

components/clp-py-utils/clp_py_utils/clp_config.py

+PositiveFloat = Annotated[float, Field(gt=0)]
+PositiveInt = Annotated[int, Field(gt=0)]
+# Specific types
+Host = NonEmptyStr


The alias is not strictly necessary in this PR, but they serve as markers for the host type. in a future PR we can replace it with DomainStr: https://github.com/pydantic/pydantic-extra-types/blob/main/pydantic_extra_types/domain.py

junhaoliao · 2025-09-30T07:01:45Z

components/clp-py-utils/clp_py_utils/clp_config.py

+# Specific types
+Host = NonEmptyStr
+Port = Annotated[int, Field(gt=0, lt=2**16)]
+ZstdCompressionLevel = Annotated[int, Field(ge=1, le=19)]


cc @Eden-D-Zhang @kirkrodrigues

technically, zstd compression levels from 1 up to ZSTD_maxCLevel(),
which is currently 22 i believe we can also use 0 to specify a "no compression" level.

since this PR is strictly about pydantic validation refactoring. the range is directly inherited from #774 . if we want to update the range, we can file an issue and do it in a separate PR

junhaoliao · 2025-09-30T07:02:25Z

components/clp-py-utils/clp_py_utils/clp_config.py

+        d["storage_engine"] = d["storage_engine"].value
+        d["query_engine"] = d["query_engine"].value


pydantic by default doesn't know how to serialize these values, so we have to add this

junhaoliao · 2025-09-30T07:04:05Z

components/clp-py-utils/clp_py_utils/clp_config.py

+    MARIADB = auto()
+    MYSQL = auto()


i can't be sure if it's a good idea to write MARIA_DB and MY_SQL so i didn't

actually, we shouldn't. or the enum values would become maria-db and my-sql

junhaoliao · 2025-09-30T07:06:46Z

components/clp-py-utils/clp_py_utils/clp_config.py

    def dump_to_primitive_dict(self):
-        return self.model_dump(exclude={"username", "password"})
+        d = self.model_dump(exclude={"username", "password"})
+        d["type"] = d["type"].value


pydantic by default doesn't know how to serialize type DatabaseEngine so we have to add this

junhaoliao · 2025-09-30T07:07:22Z

components/clp-py-utils/clp_py_utils/clp_config.py

+    host: Host = "localhost"
+    port: Port = 7000
+    jobs_poll_delay: PositiveFloat = 0.1  # seconds
+    num_archives_to_search_per_sub_job: PositiveInt = 16


should an upper bound be added (in a future PR)?

junhaoliao · 2025-09-30T07:08:37Z

components/clp-py-utils/clp_py_utils/clp_config.py

    auto_commit: bool = False
    compress: bool = True

    username: Optional[str] = None


i didn't change this to Optional[NonEmptyStr] because it's possible for a mysql db username to be an empty string (though usually it's not a good idea)

junhaoliao · 2025-09-30T07:11:31Z

components/clp-py-utils/clp_py_utils/clp_config.py

+    archive: PositiveInt = 60
+    search_result: PositiveInt = 30


this should be a better syntax to specify shared types, and then default values can be specifically assigned to individual fields

junhaoliao added 14 commits September 29, 2025 22:46

refactor(config): Use enum types for package storage and query engine…

17ff4c9

…s and update serialization.

refactor(config): Change custom_serialized_fields to a set and pass i…

4c01485

…t directly to model_dump.

refactor(config): Use shared annotated Port type for port fields and …

3e484e2

…drop custom port validators.

refactor(config): Consolidate host validation by introducing shared H…

d441993

…ost type and removing duplicated validators.

refactor(config): Introduce DatabaseEngine enum, use it for database.…

68a3a06

…type and adjust serialization accordingly.

refactor(config): Replace manual non‑empty string validators with sha…

fbe91b0

…red NonEmptyStr type.

refactor(config): Use PositiveFloat type for jobs_poll_delay fields i…

e829a86

…n scheduler configs.

Refactor(config): Rename jobs_poll_delay to jobs_poll_delay_sec in sc…

79b8043

…heduler configs and update references.

refactor(config): Replace int fields with PositiveInt and remove redu…

354b019

…ndant validators.

refactor(config): Update optional string fields to use NonEmptyStr type.

e60b39a

docs(config): Update comment to specify specific types.

3372a4b

refactor(config): Use ZstdCompressionLevel for compression_level and …

68740f5

…remove its validator.

refactor(config): Replace string logging_level fields with LoggingLev…

420f30f

…el type and remove validation helpers.

revert jobs_poll_delay rename

6a584e0

junhaoliao requested a review from a team as a code owner September 30, 2025 06:47

junhaoliao requested a review from gibber9809 September 30, 2025 06:47

junhaoliao requested a review from Eden-D-Zhang September 30, 2025 06:47

junhaoliao changed the title ~~refactor(config): Migrate config types and validation to Pydantic V2 (fixes #1342).~~ refactor(clp-package): Migrate config types and validation to Pydantic V2 (fixes #1342). Sep 30, 2025

coderabbitai bot reviewed Sep 30, 2025

View reviewed changes

junhaoliao commented Sep 30, 2025

View reviewed changes

junhaoliao added 2 commits September 30, 2025 03:12

Merge branch 'main' into pydantic-impf

79c12e8

Merge branch 'main' into pydantic-impf

bfca7d7

junhaoliao removed the request for review from Eden-D-Zhang October 1, 2025 19:09

Merge branch 'main' into pydantic-impf

8673afd

		d["storage_engine"] = d["storage_engine"].value
		d["query_engine"] = d["query_engine"].value

refactor(clp-package): Migrate config types and validation to Pydantic V2 (fixes #1342). #1356

Are you sure you want to change the base?

refactor(clp-package): Migrate config types and validation to Pydantic V2 (fixes #1342). #1356

Uh oh!

Conversation

junhaoliao commented Sep 30, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Validation performed

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Sep 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

junhaoliao commented Sep 30, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Sep 30, 2025 •

edited

Loading

coderabbitai bot Sep 30, 2025 •

edited

Loading