Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Roundtripping inconsistent for objects of shape=() and numpy scalars... #98

Open
hmaarrfk opened this issue Jan 1, 2024 · 3 comments · May be fixed by #99
Open

Roundtripping inconsistent for objects of shape=() and numpy scalars... #98

hmaarrfk opened this issue Jan 1, 2024 · 3 comments · May be fixed by #99

Comments

@hmaarrfk
Copy link
Contributor

hmaarrfk commented Jan 1, 2024

It seems that there are some edge cases with serializing scalars and numpy arrays with 0-dimensions

This has always confused me, so for precision, I quote numpy's documentation

An array scalar is an instance of the types/classes float32, float64, etc., whereas a 0-dimensional array is an ndarray instance containing precisely one array scalar.

So I added the following test to the file tests/test_np.py to see if things serialize correctly:

def test_nd_array_shape_empty():
    to_dump = zeros((), dtype='uint32')
    to_dump[...] = 123

    the_dumps = dumps(to_dump)
    the_double_dumps = dumps(loads(dumps(to_dump)))

    assert the_dumps == the_double_dumps
_________________________________ test_nd_array_shape_empty __________________________________

    def test_nd_array_shape_empty():
        to_dump = zeros((), dtype='uint32')
        to_dump[...] = 123
    
        the_dumps = dumps(to_dump)
        the_double_dumps = dumps(loads(dumps(to_dump)))
    
>       assert the_dumps == the_double_dumps
E       assert '{"__ndarray_... "shape": []}' == '123'
E         - 123
E         + {"__ndarray__": 123, "dtype": "uint32", "shape": []}

tests/test_np.py:197: AssertionError

After round tripping, we aren't preserving the "0-dimension" and it is being downcast to a numpy-scalar.

This strange behavior leads to the test suite using encode_scalars_inplace as an attempt to workaround the warning that "scalars cannot be reliably encoded", then turns around to using the strange "automatic downcast" behavior to recover the original structure.
https://github.com/mverleg/pyjson_tricks/blob/master/tests/test_np.py#L170

However, this would be incorrect if the user mixes "0-dimensional" numpy arrays.

Now, I originally came here to try to address a pretty specific usecase of ours: I am trying to serialize numpy datetime64 scalars. I tried to augment them to 0-dimensional arrays, however, this breaks two assumptions made in pyjson_tricks:

  • numpy.datetime64 objects are not numpy.generics used in the encoder
  • numpy.datetime64 constructors cannot be obtained with the getattr used in the decoder since they have units (annoying I know, date is complicated)

My proposal unfortunately requires breaking anybody using replace_scalars_inplace such as the test suite

Other references

@mverleg
Copy link
Owner

mverleg commented Jan 2, 2024

Nice thanks for all the effort you put into this!

Mathematically the distinction between 0d array and scalar seems weird, but I guess from a type system perspective it makes sense. And in any case numpy's eccentricities are not in our control.

This strange behavior leads to the test suite using encode_scalars_inplace as an attempt to workaround the warning that "scalars cannot be reliably encoded", then turns around to using the strange "automatic downcast" behavior to recover the original structure.

The test suite isn't trying to work around the warning, it's testing the function encode_scalars_inplace itself. It's a hacky function, but necessary because some scalars cannot be serialzied correctly (which ones depends on the Python version). This is described in #18. The function is a workaround of course, but part of the documented api:

So if you really want to encode numpy scalars, you’ll have to do the conversion beforehand. For that purpose you can use encode_scalars_inplace, which mutates a nested data structure (in place!) to replace any numpy scalars by their representation. If you serialize this result, it can subsequently be loaded without further adaptations.

The easiest way to see this behaviour is to add float64 to your new test test_scalar_roundtrip.

But it seems good to support 0dimensional arrays, and with a few adjustments I think it'll work great. Thanks again!

@hmaarrfk
Copy link
Contributor Author

hmaarrfk commented Jan 2, 2024

I'm really not sure how to reconcile the fact that the default json encoder will simply encode float64 objects as float.

The issue seems to stem from the fact that numpy's float64 scalar is a subclass of the python float type:

import numpy as np
np.float64.mro()
[numpy.float64,
 numpy.floating,
 numpy.inexact,
 numpy.number,
 numpy.generic,
 float,
 object]

Eventually the json encoder will simply check isinstance(o, float) and since that returns "true", the encoder will just encode it.
https://github.com/python/cpython/blob/b0fb074d5983f07517cec76a37268f13c986d314/Lib/json/encoder.py#L426
I think that for backward compatibility concerns, numpy will also refuse to change the mro.

We can

  • either ask that cpython allow us to override the default encoder for other objects, i've never had direct discussions with them.
  • We could extend the TricksEncoder to simply use default first, before the isinstance(o, float) is called.

@hmaarrfk
Copy link
Contributor Author

hmaarrfk commented Jan 2, 2024

We could extend the TricksEncoder to simply use default first, before the isinstance(o, float) is called.

I implemented this in this commit 70bacca

it copies alot of code from cpython at the very least we should split this off into a file to keep correct attribution to their license. i change 3 lines that had if isinstance(obj, float).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants