Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pre-insert schema validation #500

Draft
wants to merge 15 commits into
base: main
Choose a base branch
from

Conversation

dhananjay-mk
Copy link
Contributor

This PR adds/fixes/changes...

  • please summarize your changes to the code
  • and make sure to include all changes to user-facing APIs

JIRA Issue: -

Priority for Review: -

Related PRs: -

How Has This Been Tested?

  • Unit Tests
  • Integration Tests
  • Manual Tests on VM

Checklist For The Assigned Reviewer:

- [ ] Checked if merge conflicts with master exist
- [ ] Checked if stylechecks for Java and Python pass
- [ ] Checked if all docstrings were added and/or updated appropriately
- [ ] Ran spellcheck on docstring
- [ ] Checked if guides & concepts need to be updated
- [ ] Checked if naming conventions for parameters and variables were followed
- [ ] Checked if private methods are properly declared and used
- [ ] Checked if hard-to-understand areas of code are commented
- [ ] Checked if tests are effective
- [ ] Built and deployed changes on dev VM and tested manually
- [x] (Checked if all type annotations were added and/or updated appropriately)

@dhananjay-mk dhananjay-mk requested review from vatj and Copilot March 4, 2025 14:41
@dhananjay-mk dhananjay-mk marked this pull request as draft March 4, 2025 14:41

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Overview

This PR introduces schema validation for online ingestion by implementing a base DataFrameValidator along with specialized validators for Pandas, Polars, and PySpark dataframes. It also integrates the new validation mechanism into the feature group engine, ensuring that the schema is validated before saving metadata or inserting data when online mode is enabled.

Reviewed Changes

File Description
python/hsfs/core/schema_validation.py Adds a base validator and specific implementations for different DF types.
python/hsfs/core/feature_group_engine.py Integrates schema validation before saving feature group metadata and during data insertion.

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

Comments suppressed due to low confidence (1)

python/hsfs/core/schema_validation.py:98

  • Consider adding a check to ensure that extract_numbers returns a non-empty list to avoid a potential IndexError when accessing the first element.
return int(self.extract_numbers(feature.online_type)[0])
@dhananjay-mk dhananjay-mk requested a review from Copilot March 5, 2025 18:26

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Overview

This PR introduces pre-insert schema validation to ensure that incoming data conforms to the expected schema before insertion. Key changes include:

  • Implementation of a generic base DataFrameValidator along with Pandas, Polars, and PySpark-specific validators.
  • Addition of unit tests that cover various schema validation scenarios.
  • Integration of schema validation checks in the feature group engine when saving or inserting data.

Reviewed Changes

File Description
python/hsfs/core/schema_validation.py Introduces DataFrameValidator and its implementations for Pandas, Polars, and PySpark.
python/tests/test_schema_validator.py Adds comprehensive unit tests for validating schema rules under different scenarios.
python/hsfs/core/feature_group_engine.py Integrates schema validation into the feature group save/insert workflows when enabled.

Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.

Comments suppressed due to low confidence (2)

python/hsfs/core/schema_validation.py:107

  • [nitpick] Consider renaming 'i_feature' to a more descriptive variable name such as 'feature' to improve clarity.
for i_feature in dataframe_features:

python/tests/test_schema_validator.py:177

  • [nitpick] It might be more robust to verify the updated feature by iterating over the features and matching by feature name rather than relying on a fixed index.
assert df_features[2].online_type == "varchar(101)"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant