-
Notifications
You must be signed in to change notification settings - Fork 3.8k
Open
Description
Describe the bug, including details regarding any error messages, version, and platform.
Numpy arrays seem to get converted with pyarrow.array(..) either to pyarrow.Array OR pyarrow.ChunkedArray depending on its' size according to the documentation.
The pa.DictionaryArray.from_arrays(..) is also specified, that it can treat ndarrays, but in case the ndarray is a large object, there is an internal exception which is not able to treat the ChunkedArray object.
import pyarrow as pa
import numpy as np
# 300 MB binary blobs × 10 = 3 GB of total data
blob = b'x' * (300 * 1024 * 1024)
data = [blob] * 10
a = (np.array(data))
# test conversion to pyarrow array
print(type(pa.array(a))) # -> pyarrow.lib.ChunkedArray
indices = np.array(list(range(10)))
# this throws an error, even if a is a legit numpy array
pa.DictionaryArray.from_arrays(indices, a)
'''
------- RESULT ----------
Traceback (most recent call last):
File "<input>", line 1, in <module>
File "pyarrow\\array.pxi", line 4091, in pyarrow.lib.DictionaryArray.from_arrays
TypeError: Cannot convert pyarrow.lib.ChunkedArray to pyarrow.lib.Array
'''
``
### Component(s)
Python