-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pre-insert schema validation #500
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR Overview
This PR introduces schema validation for online ingestion by implementing a base DataFrameValidator along with specialized validators for Pandas, Polars, and PySpark dataframes. It also integrates the new validation mechanism into the feature group engine, ensuring that the schema is validated before saving metadata or inserting data when online mode is enabled.
Reviewed Changes
File | Description |
---|---|
python/hsfs/core/schema_validation.py | Adds a base validator and specific implementations for different DF types. |
python/hsfs/core/feature_group_engine.py | Integrates schema validation before saving feature group metadata and during data insertion. |
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
Comments suppressed due to low confidence (1)
python/hsfs/core/schema_validation.py:98
- Consider adding a check to ensure that extract_numbers returns a non-empty list to avoid a potential IndexError when accessing the first element.
return int(self.extract_numbers(feature.online_type)[0])
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR Overview
This PR introduces pre-insert schema validation to ensure that incoming data conforms to the expected schema before insertion. Key changes include:
- Implementation of a generic base DataFrameValidator along with Pandas, Polars, and PySpark-specific validators.
- Addition of unit tests that cover various schema validation scenarios.
- Integration of schema validation checks in the feature group engine when saving or inserting data.
Reviewed Changes
File | Description |
---|---|
python/hsfs/core/schema_validation.py | Introduces DataFrameValidator and its implementations for Pandas, Polars, and PySpark. |
python/tests/test_schema_validator.py | Adds comprehensive unit tests for validating schema rules under different scenarios. |
python/hsfs/core/feature_group_engine.py | Integrates schema validation into the feature group save/insert workflows when enabled. |
Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.
Comments suppressed due to low confidence (2)
python/hsfs/core/schema_validation.py:107
- [nitpick] Consider renaming 'i_feature' to a more descriptive variable name such as 'feature' to improve clarity.
for i_feature in dataframe_features:
python/tests/test_schema_validator.py:177
- [nitpick] It might be more robust to verify the updated feature by iterating over the features and matching by feature name rather than relying on a fixed index.
assert df_features[2].online_type == "varchar(101)"
This PR adds/fixes/changes...
JIRA Issue: -
Priority for Review: -
Related PRs: -
How Has This Been Tested?
Checklist For The Assigned Reviewer: