-
Notifications
You must be signed in to change notification settings - Fork 270
Pyarrow data type, default to small type and fix large type override #1859
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
@@ -626,7 +626,7 @@ def field(self, field: NestedField, field_result: pa.DataType) -> pa.Field: | |||
|
|||
def list(self, list_type: ListType, element_result: pa.DataType) -> pa.DataType: | |||
element_field = self.field(list_type.element_field, element_result) | |||
return pa.large_list(value_type=element_field) | |||
return pa.list_(value_type=element_field) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not convinced that we need to change this. We use schema_to_pyarrow
in many places:
Schema.as_arrow()
, this can be problematic when people already allocate buffers that are larger than what fits in the small ones._ConvertToArrowExpression.{visit_in,visit_not_in}
, I checked manually, and it looks like we can mix large and normal types here :)ArrowProjectionVisitor
has the issue similar to what you've described in Arrow: Infer the types when reading #1669 (comment). I think the other way around is also an issue. If you would promote alarge_string
, it would now produce abinary
and not alarge_binary
.ArrowScan.to_table()
will return the schema when there is no data, both small and large are okay.DataScan.to_arrow_batch_reader()
, I think we should always update to the large type. Since this is streaming, we don't know upfront if the small buffers are big enough, therefore it is safe to go with the large ones.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Fokko Just coming back to this PR. Is there a reason why we'd want to default to large_list
?
The difference between list_
and large_list
is the number of elements supported by the list. According to the large_list
docs,
Unless you need to represent data larger than 2**31 elements, you should prefer list_().
2**31
is 2_147_483_648
, 2 billion items in the list seems pretty rare.
I did a small experiment, this works with list_
import pyarrow as pa
import numpy as np
size = 2**31 - 2
pa.array([np.zeros(size, dtype=np.int8)], type=pa.list_(pa.int8()))
but this will crash python, and would require large_list
import pyarrow as pa
import numpy as np
size = 2**31 - 1
pa.array([np.zeros(size, dtype=np.int8)], type=pa.list_(pa.int8()))
Co-authored-by: Fokko Driesprong <[email protected]>
e4d7972
to
79a80c2
Compare
Rationale for this change
#1669 made the change to infer the type when reading, and not default pyarrow data types to the large type. Originally, default to large type was introduced by #986.
I found a bug in #1669 where type promotion from string->binary defaults to large_binary (#1669 (comment)). Which led to to find that we still use large type in
_ConvertToArrowSchema
. Furthermore, I found that we did not respectPYARROW_USE_LARGE_TYPES_ON_READ=True
when reading.This PR is a continuation of #1669.
pyarrow.use-large-types-on-read
to default valueFalse
_ConvertToArrowSchema
to use small data type instead of largePYARROW_USE_LARGE_TYPES_ON_READ
is enabled (set toTrue
),ArrowScan
andArrowProjectionVisitor
and should cast to large typePYARROW_USE_LARGE_TYPES_ON_READ
toTrue
This PR should help us infer the data type when reading while keeping the
PYARROW_USE_LARGE_TYPES_ON_READ
override behavior until deprecation.Are these changes tested?
Yes
Are there any user-facing changes?
No