Skip to content

[Python] pa.DictionaryArray.from_arrays(..) can't convert large numpy / ndarrays #47246

@r-matejko

Description

@r-matejko

Describe the bug, including details regarding any error messages, version, and platform.

Numpy arrays seem to get converted with pyarrow.array(..) either to pyarrow.Array OR pyarrow.ChunkedArray depending on its' size according to the documentation.

The pa.DictionaryArray.from_arrays(..) is also specified, that it can treat ndarrays, but in case the ndarray is a large object, there is an internal exception which is not able to treat the ChunkedArray object.

import pyarrow as pa
import numpy as np

# 300 MB binary blobs × 10 = 3 GB of total data
blob = b'x' * (300 * 1024 * 1024)
data = [blob] * 10
a = (np.array(data))

# test conversion to pyarrow array
print(type(pa.array(a))) # -> pyarrow.lib.ChunkedArray

indices = np.array(list(range(10)))

# this throws an error, even if a is a legit numpy array
pa.DictionaryArray.from_arrays(indices, a)

'''
------- RESULT ----------
Traceback (most recent call last):
File "<input>", line 1, in <module>
File "pyarrow\\array.pxi", line 4091, in pyarrow.lib.DictionaryArray.from_arrays
TypeError: Cannot convert pyarrow.lib.ChunkedArray to pyarrow.lib.Array
'''
``

### Component(s)

Python

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions