-
Notifications
You must be signed in to change notification settings - Fork 82
refactor(clp-package): Migrate config types and validation to Pydantic V2 (fixes #1342). #1356
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…s and update serialization.
…t directly to model_dump.
…drop custom port validators.
…ost type and removing duplicated validators.
…type and adjust serialization accordingly.
…red NonEmptyStr type.
…n scheduler configs.
…heduler configs and update references.
…ndant validators.
…remove its validator.
…el type and remove validation helpers.
WalkthroughRefactors configuration and logging utilities. Adds strong-typed aliases, enums, and consistent primitive-serialisation methods across config models; extends QueryEngine with PRESTO and introduces DatabaseEngine. Updates validation to type-driven constraints. Simplifies logging by introducing a LoggingLevel Literal and removing legacy validation helpers. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant U as Caller
participant CFG as Config Model
participant E as Enum Field(s)
U->>CFG: dump_to_primitive_dict()
activate CFG
loop For each field
alt Field is Enum
CFG->>E: .value
E-->>CFG: primitive value
else Field is nested model
CFG->>CFG: recurse dump_to_primitive_dict()
else Primitive/alias
CFG->>CFG: include as-is
end
end
CFG-->>U: dict of primitives (enums as values)
deactivate CFG
sequenceDiagram
autonumber
participant U as Caller
participant L as set_logging_level(level)
participant T as LoggingLevel Literal
U->>L: level (str)
L->>T: validate level in get_args(T)
alt valid
L-->>U: set logger level to level
else invalid
Note over L: Fallback behaviour
L-->>U: set logger level to "INFO"
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 Ruff (0.13.1)components/clp-py-utils/clp_py_utils/clp_config.py�[1;31mruff failed�[0m components/clp-py-utils/clp_py_utils/clp_logging.py�[1;31mruff failed�[0m Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🧪 Early access (Sonnet 4.5): enabledWe are currently testing the Sonnet 4.5 model, which is expected to improve code review quality. However, this model may lead to increased noise levels in the review comments. Please disable the early access features if the noise level causes any inconvenience. Note:
Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
components/clp-py-utils/clp_py_utils/clp_config.py (1)
210-231
: Align CLP connection “type” with configured engine or document override
Hardcoding"type": DatabaseEngine.MYSQL.value
in get_clp_connection_params_and_type ignores the configuredself.type
; either switch toself.type.value
or add a note explaining why CLP core only recognizes MySQL.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (2)
components/clp-py-utils/clp_py_utils/clp_config.py
(17 hunks)components/clp-py-utils/clp_py_utils/clp_logging.py
(2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: package-image
- GitHub Check: lint-check (ubuntu-24.04)
- GitHub Check: lint-check (macos-15)
🔇 Additional comments (12)
components/clp-py-utils/clp_py_utils/clp_config.py (11)
102-109
: LGTM! Well-defined type aliases with appropriate constraints.The type aliases provide clear, reusable constraints:
NonEmptyStr
,PositiveFloat
,PositiveInt
for generic validationHost
,Port
,ZstdCompressionLevel
for domain-specific constraintsThe port constraint
gt=0, lt=2**16
correctly validates the range [1, 65535], and the Zstd compression level range [1, 19] aligns with the library's valid levels.
117-120
: LGTM! DatabaseEngine enum properly defined.The
DatabaseEngine
enum usingKebabCaseStrEnum
correctly defines MARIADB and MYSQL as the supported database types, consistent with the PR's migration to stronger type validation.
144-164
: LGTM! Comprehensive engine compatibility validation.The
validate_query_engine_package_compatibility
validator ensures:
- CLP/CLP_S query engines match their corresponding storage engines
- PRESTO query engine only works with CLP_S storage
- Unsupported combinations are rejected with clear error messages
This prevents invalid configuration combinations at validation time.
166-170
: LGTM! Consistent primitive serialization.The
dump_to_primitive_dict
method correctly serializes enum fields to their primitive string values, enabling JSON/YAML export without Pydantic-specific types.
258-277
: LGTM! Scheduler and worker classes properly migrated.All scheduler and worker classes correctly use:
PositiveFloat
for polling delaysPositiveInt
for numeric parametersLoggingLevel
for logging level validationHost
andPort
where applicableThe type migration is consistent and well-structured.
279-353
: LGTM! Redis and Queue classes properly migrated.Both classes correctly:
- Use
Host
andPort
type aliases for network parameters- Apply
NonEmptyStr
to username fields where appropriate- Exclude credentials from
dump_to_primitive_dict
for security- Maintain credential loading logic from files and environment
The migration preserves security-sensitive serialization logic.
356-404
: LGTM! AWS authentication classes properly migrated.The AWS-related classes correctly:
- Use
NonEmptyStr
for credentials and config fields to prevent empty strings- Maintain comprehensive validation logic in
validate_authentication
(lines 372-397)- Ensure mutual exclusivity and conditional requirements for auth types
The validation logic properly handles all auth type combinations.
508-542
: LGTM! Output configuration classes properly migrated.Both
ArchiveOutput
andStreamOutput
correctly:
- Use
PositiveInt
for size parameters- Apply
ZstdCompressionLevel
for compression validation (line 514)- Use
Optional[PositiveInt]
for nullable numeric fields- Implement
dump_to_primitive_dict
for nested storage serializationThe type constraints ensure valid compression and sizing parameters.
545-567
: LGTM! UI and service classes properly migrated.The
WebUi
,SweepInterval
,GarbageCollector
, andPresto
classes correctly apply:
Host
andPort
for network configurationNonEmptyStr
for collection namesPositiveInt
for rates and intervalsLoggingLevel
for logging configurationAll type migrations are appropriate for their use cases.
734-757
: LGTM! CLPConfig serialization properly handles nested objects.The
dump_to_primitive_dict
method correctly:
- Identifies fields requiring custom serialization (enums, nested configs)
- Delegates to nested
dump_to_primitive_dict
methods for complex objects- Converts
pathlib.Path
objects to strings for primitive export- Handles optional paths appropriately
This provides consistent primitive serialization across the entire config hierarchy.
759-767
: LGTM! Presto configuration validation enforces requirement.The
validate_presto_config
validator correctly ensures that thepresto
configuration is non-null when the query engine is set toPRESTO
, preventing incomplete configuration at validation time.components/clp-py-utils/clp_py_utils/clp_logging.py (1)
4-11
: No issue with"WARN"
literal. Testing confirmslogger.setLevel("WARN")
is valid and maps to the WARNING level.Likely an incorrect or invalid review comment.
def set_logging_level(logger: logging.Logger, level: str): | ||
if not is_valid_logging_level(level): | ||
if level not in get_args(LoggingLevel): | ||
logger.warning(f"Invalid logging level: {level}, using INFO as default") | ||
logger.setLevel(logging.INFO) | ||
return | ||
logger.setLevel(LOGGING_LEVEL_MAPPING[level]) | ||
|
||
logger.setLevel(level) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧹 Nitpick | 🔵 Trivial
Consider mapping string levels to logging constants.
The current implementation passes the level string directly to logger.setLevel()
. While this works for standard level names, it relies on Python's internal string-to-level mapping. For robustness and clarity, consider explicitly mapping to logging constants.
Apply this diff to use explicit constant mapping:
def set_logging_level(logger: logging.Logger, level: str):
if level not in get_args(LoggingLevel):
logger.warning(f"Invalid logging level: {level}, using INFO as default")
logger.setLevel(logging.INFO)
return
-
- logger.setLevel(level)
+
+ level_map = {
+ "DEBUG": logging.DEBUG,
+ "INFO": logging.INFO,
+ "WARN": logging.WARNING,
+ "WARNING": logging.WARNING,
+ "ERROR": logging.ERROR,
+ "CRITICAL": logging.CRITICAL,
+ }
+ logger.setLevel(level_map[level])
This approach provides explicit control over level mapping and handles the "WARN"
→ logging.WARNING
conversion explicitly.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
def set_logging_level(logger: logging.Logger, level: str): | |
if not is_valid_logging_level(level): | |
if level not in get_args(LoggingLevel): | |
logger.warning(f"Invalid logging level: {level}, using INFO as default") | |
logger.setLevel(logging.INFO) | |
return | |
logger.setLevel(LOGGING_LEVEL_MAPPING[level]) | |
logger.setLevel(level) | |
def set_logging_level(logger: logging.Logger, level: str): | |
if level not in get_args(LoggingLevel): | |
logger.warning(f"Invalid logging level: {level}, using INFO as default") | |
logger.setLevel(logging.INFO) | |
return | |
level_map = { | |
"DEBUG": logging.DEBUG, | |
"INFO": logging.INFO, | |
"WARN": logging.WARNING, | |
"WARNING": logging.WARNING, | |
"ERROR": logging.ERROR, | |
"CRITICAL": logging.CRITICAL, | |
} | |
logger.setLevel(level_map[level]) |
🤖 Prompt for AI Agents
In components/clp-py-utils/clp_py_utils/clp_logging.py around lines 29 to 35,
the function currently calls logger.setLevel(level) with a string; change this
to explicitly map the validated string to the logging module constants (e.g.,
{"CRITICAL": logging.CRITICAL, "ERROR": logging.ERROR, "WARNING":
logging.WARNING, "WARN": logging.WARNING, "INFO": logging.INFO, "DEBUG":
logging.DEBUG, "NOTSET": logging.NOTSET}), normalize the input to upper() before
lookup, use logging.INFO as the fallback for unknown values, and then call
logger.setLevel(mapped_constant) instead of passing the raw string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the setLevel
method accepts both str and int logging levels. when the argument is a str, the method tries to map it to some numeric logging value before setting the number, so we should not need to do such mapping in our code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@junhaoliao, you're absolutely right! The logging.Logger.setLevel()
method does accept string level names directly and handles the internal mapping to numeric values. The current implementation is correct and doesn't require explicit constant mapping. Thank you for the clarification!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://docs.python.org/3/library/logging.html#logging.Logger.setLevel
Changed in version 3.2: The level parameter now accepts a string representation of the level such as ‘INFO’ as an alternative to the integer constants such as INFO. Note, however, that levels are internally stored as integers, and methods such as e.g. getEffectiveLevel() and isEnabledFor() will return/expect to be passed integers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@junhaoliao, thank you for providing the official documentation link! It confirms that the current implementation is correct and leverages Python's built-in string-to-level mapping that has been available since version 3.2. The PR's approach is the right one.
PositiveFloat = Annotated[float, Field(gt=0)] | ||
PositiveInt = Annotated[int, Field(gt=0)] | ||
# Specific types | ||
Host = NonEmptyStr |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The alias is not strictly necessary in this PR, but they serve as markers for the host type. in a future PR we can replace it with DomainStr
: https://github.com/pydantic/pydantic-extra-types/blob/main/pydantic_extra_types/domain.py
# Specific types | ||
Host = NonEmptyStr | ||
Port = Annotated[int, Field(gt=0, lt=2**16)] | ||
ZstdCompressionLevel = Annotated[int, Field(ge=1, le=19)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @Eden-D-Zhang @kirkrodrigues
technically, zstd compression levels from 1 up to ZSTD_maxCLevel(),
which is currently 22 i believe we can also use 0
to specify a "no compression" level.
since this PR is strictly about pydantic validation refactoring. the range is directly inherited from #774 . if we want to update the range, we can file an issue and do it in a separate PR
d["storage_engine"] = d["storage_engine"].value | ||
d["query_engine"] = d["query_engine"].value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pydantic by default doesn't know how to serialize these values, so we have to add this
MARIADB = auto() | ||
MYSQL = auto() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i can't be sure if it's a good idea to write MARIA_DB
and MY_SQL
so i didn't
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually, we shouldn't. or the enum values would become maria-db
and my-sql
def dump_to_primitive_dict(self): | ||
return self.model_dump(exclude={"username", "password"}) | ||
d = self.model_dump(exclude={"username", "password"}) | ||
d["type"] = d["type"].value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pydantic by default doesn't know how to serialize type DatabaseEngine
so we have to add this
host: Host = "localhost" | ||
port: Port = 7000 | ||
jobs_poll_delay: PositiveFloat = 0.1 # seconds | ||
num_archives_to_search_per_sub_job: PositiveInt = 16 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should an upper bound be added (in a future PR)?
auto_commit: bool = False | ||
compress: bool = True | ||
|
||
username: Optional[str] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i didn't change this to Optional[NonEmptyStr]
because it's possible for a mysql db username to be an empty string (though usually it's not a good idea)
archive: PositiveInt = 60 | ||
search_result: PositiveInt = 30 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be a better syntax to specify shared types, and then default values can be specifically assigned to individual fields
Description
This pull request migrates the type system and validation in
clp-py-utils/clp_config.py
and related modules to use Pydantic V2's newAnnotated
syntax and stronger type checking. Key changes include:Field
constraints andAnnotated
type aliases likeNonEmptyStr
,PositiveInt
,PositiveFloat
,Host
,Port
, andZstdCompressionLevel
.DatabaseEngine
usingKebabCaseStrEnum
.LoggingLevel
Literal
which replaces the previous logging level validation method with a map.Checklist
breaking change.
Validation performed
clp-compression_scheduler
docker container logs that debug logs were printed. e.g.,logging_level
s to "INVALID" in all applicable components and observed the package failed to start due to validation error.start-clp.sh
and observed the script fails with validation errors:NonEmptyStr
:""
PositiveFloat
:0
,-1
,-0.1
, false. (Also tried1
which is an integer and the input was considered valid, which is expected)PositiveInt
:0
,-1
,-0.1
, falseHost
:""
Port
:0
,-1
,65536
. (also tried1023
with non-root user - the script was able to start but port binding fails due to lack of permission)ZstdCompressionLevel
:0
,20
Summary by CodeRabbit
New Features
Refactor