Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bugfix/1677 Fix Pandera DataFrame - Pydantic compatibility #1904

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

Jarek-Rolski
Copy link
Contributor

@Jarek-Rolski Jarek-Rolski commented Feb 2, 2025

Update to pydantic-core requires additional parameter json_schema_input_schema in core_schema.no_info_plain_validator_function function.

I did some checking and it seems that it doesn't matter what is put under the variable as long as it belongs to core_schema valid schema types. Generated schema also doesn't contain field checks. However, pydantic model validation includes pandera submodel with all checks.

I'm not sure if this is correct. I made changes based on #1677 and #1704

I had to modify one test, because current code change discovers schema issue earlier than before.

Copy link

codecov bot commented Feb 3, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 93.38%. Comparing base (812b2a8) to head (0747fc9).
Report is 188 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1904      +/-   ##
==========================================
- Coverage   94.28%   93.38%   -0.90%     
==========================================
  Files          91      121      +30     
  Lines        7013     9304    +2291     
==========================================
+ Hits         6612     8689    +2077     
- Misses        401      615     +214     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Comment on lines 190 to 192
with config_context(validation_enabled=False):
schema_model = _source_type().__orig_class__.__args__[0]
schema = schema_model.to_schema()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hey @Jarek-Rolski this PR looks almost ready to merge!

quick question: why is this config_context block needed here? I don't think the nested code does any validation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pandera makes validation during schema_model extraction, it seems to have a problem during validation. Two tests were raising an error e.g.:

tests\fastapi\test_app.py:14: in <module>
    from tests.fastapi.models import Transactions, TransactionsOut
tests\fastapi\models.py:46: in <module>
    class ResponseModel(BaseModel):
venv\Lib\site-packages\pydantic\_internal\_model_construction.py:205: in __new__
    complete_model_class(
venv\Lib\site-packages\pydantic\_internal\_model_construction.py:534: in complete_model_class
    schema = cls.__get_pydantic_core_schema__(cls, handler)
venv\Lib\site-packages\pydantic\main.py:643: in __get_pydantic_core_schema__
    return handler(source)
venv\Lib\site-packages\pydantic\_internal\_schema_generation_shared.py:83: in __call__
    schema = self._handler(source_type)
venv\Lib\site-packages\pydantic\_internal\_generate_schema.py:512: in generate_schema
    schema = self._generate_schema_inner(obj)
venv\Lib\site-packages\pydantic\_internal\_generate_schema.py:784: in _generate_schema_inner
    return self._model_schema(obj)
venv\Lib\site-packages\pydantic\_internal\_generate_schema.py:591: in _model_schema
    {k: self._generate_md_field_schema(k, v, decorators) for k, v in fields.items()},
venv\Lib\site-packages\pydantic\_internal\_generate_schema.py:947: in _generate_md_field_schema
    common_field = self._common_field_schema(name, field_info, decorators)
venv\Lib\site-packages\pydantic\_internal\_generate_schema.py:1134: in _common_field_schema
    schema = self._apply_annotations(
venv\Lib\site-packages\pydantic\_internal\_generate_schema.py:1890: in _apply_annotations
    schema = get_inner_schema(source_type)
venv\Lib\site-packages\pydantic\_internal\_schema_generation_shared.py:83: in __call__
    schema = self._handler(source_type)
venv\Lib\site-packages\pydantic\_internal\_generate_schema.py:1871: in inner_handler
    schema = self._generate_schema_inner(obj)
venv\Lib\site-packages\pydantic\_internal\_generate_schema.py:789: in _generate_schema_inner
    return self.match_type(obj)
venv\Lib\site-packages\pydantic\_internal\_generate_schema.py:871: in match_type
    return self._match_generic_type(obj, origin)
venv\Lib\site-packages\pydantic\_internal\_generate_schema.py:890: in _match_generic_type
    from_property = self._generate_schema_from_property(origin, obj)
venv\Lib\site-packages\pydantic\_internal\_generate_schema.py:679: in _generate_schema_from_property
    schema = get_schema(
pandera\typing\pandas.py:189: in __get_pydantic_core_schema__
    schema_model = _source_type().__orig_class__.__args__[0]
pandera\typing\common.py:129: in __patched_generic_alias_call
    result.__orig_class__ = self
pandera\typing\common.py:181: in __setattr__
    self.__dict__ = schema_model.validate(self).__dict__
pandera\api\dataframe\model.py:289: in validate
    cls.to_schema().validate(
pandera\api\pandas\container.py:126: in validate
    return self._validate(
pandera\api\pandas\container.py:147: in _validate
    return self.get_backend(check_obj).validate(
pandera\backends\pandas\container.py:104: in validate
    error_handler = self.run_checks_and_handle_errors(
pandera\backends\pandas\container.py:182: in run_checks_and_handle_errors
    error_handler.collect_error(
pandera\api\base\error_handler.py:54: in collect_error
    raise schema_error from original_exc
E   pandera.errors.SchemaError: column 'id' not in dataframe. Columns in dataframe: []

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay cool I can take a look at this in a separate PR

Comment on lines 193 to 199
type_map = {
"str": core_schema.str_schema(),
"int64": core_schema.int_schema(),
"float64": core_schema.float_schema(),
"bool": core_schema.bool_schema(),
"datetime64[ns]": core_schema.datetime_schema(),
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this would be limited to just the numpy datatypes right?

will we need to create a follow-up PR to support the pyarrow datatypes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made some changes to enable pyarrow. I used pandera to_json_schema() function to get general types names. I tested it for various numpy/pandas/pyarrow types and it seems to work. I only had a problem with more exotic types like pyarrow.large_string, to_json_schema() labels it as "any".
Are you happy with such change or should I revert it and add pyarrow types?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool, this looks good to me for now, we can make further investments in a future PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants