Skip to content

Conversation

junhaoliao
Copy link
Member

@junhaoliao junhaoliao commented Sep 30, 2025

Description

This pull request migrates the type system and validation in clp-py-utils/clp_config.py and related modules to use Pydantic V2's new Annotated syntax and stronger type checking. Key changes include:

  • Replacing legacy validator functions with Pydantic Field constraints and Annotated type aliases like NonEmptyStr, PositiveInt, PositiveFloat, Host, Port, and ZstdCompressionLevel.
  • Introducing strict enum for DatabaseEngine using KebabCaseStrEnum.
  • Adding LoggingLevel Literal which replaces the previous logging level validation method with a map.
  • Refactoring constructor arguments and class attributes to these stricter types, improving type safety across all config objects.
  • Removing legacy field_validator and model_validator code that can now be handled by Pydantic's built-in validation.
  • Updating all classes (Database, Package, CompressionScheduler, QueryScheduler, Redis, ResultsCache, etc.) to use new type aliases and enums.
  • Adjusting utility and dump methods to work with the new type structure.

Checklist

  • The PR satisfies the contribution guidelines.
  • This is a breaking change and that has been indicated in the PR title, OR this isn't a
    breaking change.
  • Necessary docs have been updated, OR no docs need to be updated.

Validation performed

  1. To test the new logging level validation / setting utilities:
    1. Set logging_level to "DEBUG" in all applicable components in clp-config.yml. Started the package successfully and observed in the clp-compression_scheduler docker container logs that debug logs were printed. e.g.,
    2025-09-30 06:20:48,808 compression_scheduler [DEBUG] Search and schedule new tasks
    2025-09-30 06:20:48,808 compression_scheduler [DEBUG] Poll running jobs
    
    1. Set all logging_levels to "INVALID" in all applicable components and observed the package failed to start due to validation error.
  2. Set out of bound / invalid values, ran start-clp.sh and observed the script fails with validation errors:
  3. NonEmptyStr: ""
  4. PositiveFloat: 0, -1, -0.1, false. (Also tried 1 which is an integer and the input was considered valid, which is expected)
  5. PositiveInt: 0, -1, -0.1, false
  6. Host: ""
  7. Port: 0, -1, 65536. (also tried 1023 with non-root user - the script was able to start but port binding fails due to lack of permission)
  8. ZstdCompressionLevel: 0, 20

Summary by CodeRabbit

  • New Features

    • Added PRESTO support for the query engine.
    • Standardized config serialisation to primitive values (enums emit their values).
    • Strongly typed config fields (e.g., host, port, compression level) for clearer validation.
  • Refactor

    • Simplified logging: set the logging level directly with a level string; invalid inputs now fall back to INFO.
    • Removed legacy logging helpers and in-model validators in favour of type-based constraints across configs.

…ost type and removing duplicated validators.
@junhaoliao junhaoliao requested a review from a team as a code owner September 30, 2025 06:47
Copy link
Contributor

coderabbitai bot commented Sep 30, 2025

Walkthrough

Refactors configuration and logging utilities. Adds strong-typed aliases, enums, and consistent primitive-serialisation methods across config models; extends QueryEngine with PRESTO and introduces DatabaseEngine. Updates validation to type-driven constraints. Simplifies logging by introducing a LoggingLevel Literal and removing legacy validation helpers.

Changes

Cohort / File(s) Summary
Typed config, enums, and serialisation
components/clp-py-utils/clp_py_utils/clp_config.py
Adds NonEmptyStr/PositiveInt/PositiveFloat aliases; Host, Port, ZstdCompressionLevel; DatabaseEngine enum; extends QueryEngine with PRESTO. Updates models (Package, Database, S3Config, Redis, Queue, etc.) to use typed fields and adds dump_to_primitive_dict emitting enum values. Adjusts validators to model/type-based constraints; execution_container now Optional[NonEmptyStr].
Logging level literal and validation simplification
components/clp-py-utils/clp_py_utils/clp_logging.py
Introduces LoggingLevel Literal; removes LEVEL map, get_valid_logging_level, is_valid_logging_level. Updates set_logging_level to validate against Literal and default to INFO on invalid input; sets logger level directly from provided string.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant U as Caller
  participant CFG as Config Model
  participant E as Enum Field(s)

  U->>CFG: dump_to_primitive_dict()
  activate CFG
  loop For each field
    alt Field is Enum
      CFG->>E: .value
      E-->>CFG: primitive value
    else Field is nested model
      CFG->>CFG: recurse dump_to_primitive_dict()
    else Primitive/alias
      CFG->>CFG: include as-is
    end
  end
  CFG-->>U: dict of primitives (enums as values)
  deactivate CFG
Loading
sequenceDiagram
  autonumber
  participant U as Caller
  participant L as set_logging_level(level)
  participant T as LoggingLevel Literal

  U->>L: level (str)
  L->>T: validate level in get_args(T)
  alt valid
    L-->>U: set logger level to level
  else invalid
    Note over L: Fallback behaviour
    L-->>U: set logger level to "INFO"
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 5.56% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title accurately summarizes the core refactoring of the configuration types and validation logic in the clp-package to use Pydantic V2, and it clearly references the related issue number for context.
✨ Finishing touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 Ruff (0.13.1)
components/clp-py-utils/clp_py_utils/clp_config.py

�[1;31mruff failed�[0m
�[1mCause:�[0m Failed to load configuration /ruff.toml
�[1mCause:�[0m Failed to parse /ruff.toml
�[1mCause:�[0m TOML parse error at line 26, column 3
|
26 | "RSE100", # Use of assert detected
| ^^^^^^^^
Unknown rule selector: RSE100

components/clp-py-utils/clp_py_utils/clp_logging.py

�[1;31mruff failed�[0m
�[1mCause:�[0m Failed to load configuration /ruff.toml
�[1mCause:�[0m Failed to parse /ruff.toml
�[1mCause:�[0m TOML parse error at line 26, column 3
|
26 | "RSE100", # Use of assert detected
| ^^^^^^^^
Unknown rule selector: RSE100


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🧪 Early access (Sonnet 4.5): enabled

We are currently testing the Sonnet 4.5 model, which is expected to improve code review quality. However, this model may lead to increased noise levels in the review comments. Please disable the early access features if the noise level causes any inconvenience.

Note:

  • Public repositories are always opted into early access features.
  • You can enable or disable early access features from the CodeRabbit UI or by updating the CodeRabbit configuration file.

Comment @coderabbitai help to get the list of available commands and usage tips.

@junhaoliao junhaoliao changed the title refactor(config): Migrate config types and validation to Pydantic V2 (fixes #1342). refactor(clp-package): Migrate config types and validation to Pydantic V2 (fixes #1342). Sep 30, 2025
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
components/clp-py-utils/clp_py_utils/clp_config.py (1)

210-231: Align CLP connection “type” with configured engine or document override
Hardcoding "type": DatabaseEngine.MYSQL.value in get_clp_connection_params_and_type ignores the configured self.type; either switch to self.type.value or add a note explaining why CLP core only recognizes MySQL.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 62bea59 and 6a584e0.

📒 Files selected for processing (2)
  • components/clp-py-utils/clp_py_utils/clp_config.py (17 hunks)
  • components/clp-py-utils/clp_py_utils/clp_logging.py (2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: package-image
  • GitHub Check: lint-check (ubuntu-24.04)
  • GitHub Check: lint-check (macos-15)
🔇 Additional comments (12)
components/clp-py-utils/clp_py_utils/clp_config.py (11)

102-109: LGTM! Well-defined type aliases with appropriate constraints.

The type aliases provide clear, reusable constraints:

  • NonEmptyStr, PositiveFloat, PositiveInt for generic validation
  • Host, Port, ZstdCompressionLevel for domain-specific constraints

The port constraint gt=0, lt=2**16 correctly validates the range [1, 65535], and the Zstd compression level range [1, 19] aligns with the library's valid levels.


117-120: LGTM! DatabaseEngine enum properly defined.

The DatabaseEngine enum using KebabCaseStrEnum correctly defines MARIADB and MYSQL as the supported database types, consistent with the PR's migration to stronger type validation.


144-164: LGTM! Comprehensive engine compatibility validation.

The validate_query_engine_package_compatibility validator ensures:

  • CLP/CLP_S query engines match their corresponding storage engines
  • PRESTO query engine only works with CLP_S storage
  • Unsupported combinations are rejected with clear error messages

This prevents invalid configuration combinations at validation time.


166-170: LGTM! Consistent primitive serialization.

The dump_to_primitive_dict method correctly serializes enum fields to their primitive string values, enabling JSON/YAML export without Pydantic-specific types.


258-277: LGTM! Scheduler and worker classes properly migrated.

All scheduler and worker classes correctly use:

  • PositiveFloat for polling delays
  • PositiveInt for numeric parameters
  • LoggingLevel for logging level validation
  • Host and Port where applicable

The type migration is consistent and well-structured.


279-353: LGTM! Redis and Queue classes properly migrated.

Both classes correctly:

  • Use Host and Port type aliases for network parameters
  • Apply NonEmptyStr to username fields where appropriate
  • Exclude credentials from dump_to_primitive_dict for security
  • Maintain credential loading logic from files and environment

The migration preserves security-sensitive serialization logic.


356-404: LGTM! AWS authentication classes properly migrated.

The AWS-related classes correctly:

  • Use NonEmptyStr for credentials and config fields to prevent empty strings
  • Maintain comprehensive validation logic in validate_authentication (lines 372-397)
  • Ensure mutual exclusivity and conditional requirements for auth types

The validation logic properly handles all auth type combinations.


508-542: LGTM! Output configuration classes properly migrated.

Both ArchiveOutput and StreamOutput correctly:

  • Use PositiveInt for size parameters
  • Apply ZstdCompressionLevel for compression validation (line 514)
  • Use Optional[PositiveInt] for nullable numeric fields
  • Implement dump_to_primitive_dict for nested storage serialization

The type constraints ensure valid compression and sizing parameters.


545-567: LGTM! UI and service classes properly migrated.

The WebUi, SweepInterval, GarbageCollector, and Presto classes correctly apply:

  • Host and Port for network configuration
  • NonEmptyStr for collection names
  • PositiveInt for rates and intervals
  • LoggingLevel for logging configuration

All type migrations are appropriate for their use cases.


734-757: LGTM! CLPConfig serialization properly handles nested objects.

The dump_to_primitive_dict method correctly:

  • Identifies fields requiring custom serialization (enums, nested configs)
  • Delegates to nested dump_to_primitive_dict methods for complex objects
  • Converts pathlib.Path objects to strings for primitive export
  • Handles optional paths appropriately

This provides consistent primitive serialization across the entire config hierarchy.


759-767: LGTM! Presto configuration validation enforces requirement.

The validate_presto_config validator correctly ensures that the presto configuration is non-null when the query engine is set to PRESTO, preventing incomplete configuration at validation time.

components/clp-py-utils/clp_py_utils/clp_logging.py (1)

4-11: No issue with "WARN" literal. Testing confirms logger.setLevel("WARN") is valid and maps to the WARNING level.

Likely an incorrect or invalid review comment.

Comment on lines 29 to +35
def set_logging_level(logger: logging.Logger, level: str):
if not is_valid_logging_level(level):
if level not in get_args(LoggingLevel):
logger.warning(f"Invalid logging level: {level}, using INFO as default")
logger.setLevel(logging.INFO)
return
logger.setLevel(LOGGING_LEVEL_MAPPING[level])

logger.setLevel(level)
Copy link
Contributor

@coderabbitai coderabbitai bot Sep 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Consider mapping string levels to logging constants.

The current implementation passes the level string directly to logger.setLevel(). While this works for standard level names, it relies on Python's internal string-to-level mapping. For robustness and clarity, consider explicitly mapping to logging constants.

Apply this diff to use explicit constant mapping:

 def set_logging_level(logger: logging.Logger, level: str):
     if level not in get_args(LoggingLevel):
         logger.warning(f"Invalid logging level: {level}, using INFO as default")
         logger.setLevel(logging.INFO)
         return
-
-    logger.setLevel(level)
+    
+    level_map = {
+        "DEBUG": logging.DEBUG,
+        "INFO": logging.INFO,
+        "WARN": logging.WARNING,
+        "WARNING": logging.WARNING,
+        "ERROR": logging.ERROR,
+        "CRITICAL": logging.CRITICAL,
+    }
+    logger.setLevel(level_map[level])

This approach provides explicit control over level mapping and handles the "WARN"logging.WARNING conversion explicitly.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def set_logging_level(logger: logging.Logger, level: str):
if not is_valid_logging_level(level):
if level not in get_args(LoggingLevel):
logger.warning(f"Invalid logging level: {level}, using INFO as default")
logger.setLevel(logging.INFO)
return
logger.setLevel(LOGGING_LEVEL_MAPPING[level])
logger.setLevel(level)
def set_logging_level(logger: logging.Logger, level: str):
if level not in get_args(LoggingLevel):
logger.warning(f"Invalid logging level: {level}, using INFO as default")
logger.setLevel(logging.INFO)
return
level_map = {
"DEBUG": logging.DEBUG,
"INFO": logging.INFO,
"WARN": logging.WARNING,
"WARNING": logging.WARNING,
"ERROR": logging.ERROR,
"CRITICAL": logging.CRITICAL,
}
logger.setLevel(level_map[level])
🤖 Prompt for AI Agents
In components/clp-py-utils/clp_py_utils/clp_logging.py around lines 29 to 35,
the function currently calls logger.setLevel(level) with a string; change this
to explicitly map the validated string to the logging module constants (e.g.,
{"CRITICAL": logging.CRITICAL, "ERROR": logging.ERROR, "WARNING":
logging.WARNING, "WARN": logging.WARNING, "INFO": logging.INFO, "DEBUG":
logging.DEBUG, "NOTSET": logging.NOTSET}), normalize the input to upper() before
lookup, use logging.INFO as the fallback for unknown values, and then call
logger.setLevel(mapped_constant) instead of passing the raw string.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the setLevel method accepts both str and int logging levels. when the argument is a str, the method tries to map it to some numeric logging value before setting the number, so we should not need to do such mapping in our code

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@junhaoliao, you're absolutely right! The logging.Logger.setLevel() method does accept string level names directly and handles the internal mapping to numeric values. The current implementation is correct and doesn't require explicit constant mapping. Thank you for the clarification!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://docs.python.org/3/library/logging.html#logging.Logger.setLevel

Changed in version 3.2: The level parameter now accepts a string representation of the level such as ‘INFO’ as an alternative to the integer constants such as INFO. Note, however, that levels are internally stored as integers, and methods such as e.g. getEffectiveLevel() and isEnabledFor() will return/expect to be passed integers.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@junhaoliao, thank you for providing the official documentation link! It confirms that the current implementation is correct and leverages Python's built-in string-to-level mapping that has been available since version 3.2. The PR's approach is the right one.

PositiveFloat = Annotated[float, Field(gt=0)]
PositiveInt = Annotated[int, Field(gt=0)]
# Specific types
Host = NonEmptyStr
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The alias is not strictly necessary in this PR, but they serve as markers for the host type. in a future PR we can replace it with DomainStr: https://github.com/pydantic/pydantic-extra-types/blob/main/pydantic_extra_types/domain.py

# Specific types
Host = NonEmptyStr
Port = Annotated[int, Field(gt=0, lt=2**16)]
ZstdCompressionLevel = Annotated[int, Field(ge=1, le=19)]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @Eden-D-Zhang @kirkrodrigues

technically, zstd compression levels from 1 up to ZSTD_maxCLevel(),
which is currently 22
i believe we can also use 0 to specify a "no compression" level.

since this PR is strictly about pydantic validation refactoring. the range is directly inherited from #774 . if we want to update the range, we can file an issue and do it in a separate PR

Comment on lines +168 to +169
d["storage_engine"] = d["storage_engine"].value
d["query_engine"] = d["query_engine"].value
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pydantic by default doesn't know how to serialize these values, so we have to add this

Comment on lines +118 to +119
MARIADB = auto()
MYSQL = auto()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i can't be sure if it's a good idea to write MARIA_DB and MY_SQL so i didn't

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, we shouldn't. or the enum values would become maria-db and my-sql

def dump_to_primitive_dict(self):
return self.model_dump(exclude={"username", "password"})
d = self.model_dump(exclude={"username", "password"})
d["type"] = d["type"].value
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pydantic by default doesn't know how to serialize type DatabaseEngine so we have to add this

host: Host = "localhost"
port: Port = 7000
jobs_poll_delay: PositiveFloat = 0.1 # seconds
num_archives_to_search_per_sub_job: PositiveInt = 16
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should an upper bound be added (in a future PR)?

auto_commit: bool = False
compress: bool = True

username: Optional[str] = None
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i didn't change this to Optional[NonEmptyStr] because it's possible for a mysql db username to be an empty string (though usually it's not a good idea)

Comment on lines +555 to +556
archive: PositiveInt = 60
search_result: PositiveInt = 30
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be a better syntax to specify shared types, and then default values can be specifically assigned to individual fields

@junhaoliao junhaoliao removed the request for review from Eden-D-Zhang October 1, 2025 19:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant