pre-insert schema validation #500

dhananjay-mk · 2025-03-04T14:41:10Z

This PR adds/fixes/changes...

please summarize your changes to the code
and make sure to include all changes to user-facing APIs

JIRA Issue: -

Priority for Review: -

Related PRs: -

How Has This Been Tested?

Unit Tests
Integration Tests
Manual Tests on VM

Checklist For The Assigned Reviewer:

- [ ] Checked if merge conflicts with master exist
- [ ] Checked if stylechecks for Java and Python pass
- [ ] Checked if all docstrings were added and/or updated appropriately
- [ ] Ran spellcheck on docstring
- [ ] Checked if guides & concepts need to be updated
- [ ] Checked if naming conventions for parameters and variables were followed
- [ ] Checked if private methods are properly declared and used
- [ ] Checked if hard-to-understand areas of code are commented
- [ ] Checked if tests are effective
- [ ] Built and deployed changes on dev VM and tested manually
- [x] (Checked if all type annotations were added and/or updated appropriately)

PR Overview

This PR introduces schema validation for online ingestion by implementing a base DataFrameValidator along with specialized validators for Pandas, Polars, and PySpark dataframes. It also integrates the new validation mechanism into the feature group engine, ensuring that the schema is validated before saving metadata or inserting data when online mode is enabled.

Reviewed Changes

File	Description
python/hsfs/core/schema_validation.py	Adds a base validator and specific implementations for different DF types.
python/hsfs/core/feature_group_engine.py	Integrates schema validation before saving feature group metadata and during data insertion.

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

Comments suppressed due to low confidence (1)

python/hsfs/core/schema_validation.py:98

Consider adding a check to ensure that extract_numbers returns a non-empty list to avoid a potential IndexError when accessing the first element.

return int(self.extract_numbers(feature.online_type)[0])

PR Overview

This PR introduces pre-insert schema validation to ensure that incoming data conforms to the expected schema before insertion. Key changes include:

Implementation of a generic base DataFrameValidator along with Pandas, Polars, and PySpark-specific validators.
Addition of unit tests that cover various schema validation scenarios.
Integration of schema validation checks in the feature group engine when saving or inserting data.

Reviewed Changes

File	Description
python/hsfs/core/schema_validation.py	Introduces DataFrameValidator and its implementations for Pandas, Polars, and PySpark.
python/tests/test_schema_validator.py	Adds comprehensive unit tests for validating schema rules under different scenarios.
python/hsfs/core/feature_group_engine.py	Integrates schema validation into the feature group save/insert workflows when enabled.

Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.

Comments suppressed due to low confidence (2)

python/hsfs/core/schema_validation.py:107

[nitpick] Consider renaming 'i_feature' to a more descriptive variable name such as 'feature' to improve clarity.

for i_feature in dataframe_features:

python/tests/test_schema_validator.py:177

[nitpick] It might be more robust to verify the updated feature by iterating over the features and matching by feature name rather than relying on a fixed index.

assert df_features[2].online_type == "varchar(101)"

dhananjay-mk added 12 commits February 10, 2025 12:21

init

c6902b9

init

35b59d0

refactor common methods to utils

eee0bcf

Merge remote-tracking branch 'upstream/main' into schemaval

876fac0

modify raising error conditions

b4d42c0

major refactor-switching to class

3a40b8e

Merge remote-tracking branch 'upstream/main' into schemaval

405fdda

revert engine changes

34a53ee

rminor cleanup

90f8d51

minor cleanup

53621ca

refactor and cleanup

3748ee4

Merge remote-tracking branch 'upstream/main' into schemaval

9370856

dhananjay-mk requested review from vatj and Copilot March 4, 2025 14:41

dhananjay-mk marked this pull request as draft March 4, 2025 14:41

Copilot AI reviewed Mar 4, 2025

View reviewed changes

add tests

26d388e

dhananjay-mk requested a review from Copilot March 5, 2025 18:26

Copilot AI reviewed Mar 5, 2025

View reviewed changes

dhananjay-mk added 2 commits March 11, 2025 11:22

Merge remote-tracking branch 'upstream/main' into schemaval

790c2c6

update docs

1ccee82

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pre-insert schema validation #500

pre-insert schema validation #500

dhananjay-mk commented Mar 4, 2025

pre-insert schema validation #500

Are you sure you want to change the base?

pre-insert schema validation #500

Conversation

dhananjay-mk commented Mar 4, 2025

Choose a reason for hiding this comment

PR Overview

Reviewed Changes

Choose a reason for hiding this comment

PR Overview

Reviewed Changes