-
-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Arrow Support #8329
Comments
The Variable-size Binary View Layout supports multiple data buffers, though it seems like that's designed more for a list of strings, so I'm not sure how it would handle image data. |
I don't see where a variable length structure would really gain us anything -- We'd have to construct an offset buffer, we'd lose actual types, and we still wouldn't be able to splice multiple allocation blocks together. |
Well, like I said, I'm not sure how it would handle image data. I just noticed that that seems to be the only way to provide multiple data buffers. Arrow requiring all data to be in a single contiguous buffer just seems absurd to me. |
It looks like PyArrow has a way to handle that: https://arrow.apache.org/docs/python/data.html#tables Also, it might not be efficient, but there's a way to convert a NumPy array to an Arrow array. Since Pillow already supports NumPy, this might be an easy way to get something working before doing things in C to make it faster. |
@Yay295 I think from a utility point of view, we'd want to be exposing band level values. Binary chunks aren't going to be nearly as useful if they have to be interpreted. There are also some alignment issues that would come from that, at least for large binaries (64 byte boundaries). It also wouldn't solve the core issue of the storage needing to be continuous. At the moment, the np array calls require a memory copy, e.g. a It looks like what PyArrow is doing with the table is effectively the |
Would there ever be a future where we might account for chroma subsampling in |
I'd think the best way to accomplish that would be with planar image storage. My understanding of subsampling is that the resolution of one of the channels is effectively 1/2 or 1/4 of the resolution of the other bands. If we did this with planar storage, chroma would just be a uint8 image with 1/4 of the pixels. Alternately, it could be stored as a null mapping in the validity buffer. (which we're not currently handling, but would probably be appropriate for the 2 and three channel image formats (pa/la/rgb/hsv). For subsampling, we could null out every nth item in a particular channel. |
I think the first approach might be complicated a bit for 10- and 12-bit images (or maybe not, besides the fact that it wouldn't be a uint8 image). In case it is at all useful or relevant: libavutil in ffmpeg uses two structs, |
For multi-channel images (assuming each channel has the same data type and dimensions) you could represent that as an array with type Fixed Shape Tensor. |
I've just put in a comment on that in here: apache/arrow#43831 (comment) -- what I don't see in a tensor is how to represent that in the PyCapsule interface, unless it's a nested set of fixed-size list ( |
Yeah that's it. Plus extra extension metadata on the field |
Following on to some of the discussion in #1888, specifically here: #1888 (comment)
Rationale
Arrow is the emerging memory layout for zero copy sharing of data in the new data ecosystem. It is an uncompressed columnar format, specifically designed for interop between different implementations and languages. It can be viewed as the spiritual successor to the existing numpy array interface that we provide. The arrow format is supported by numpy 2, pandas 2, polars, pyarrow, and arro3, and others in the python ecosystem.
What Support means
Technical Details
(Apache docs are here: https://arrow.apache.org/docs/format/Columnar.html)
An Arrow Schema is a set of metadata, containing type information, and potentially child schemas. An Arrow Array has an (implicitly) associated schema, metadata about the length of the storage, as well as a buffer of a contiguously allocated chunk of memory for the data. The Arrow Array will generally have the same parent/child arrangement as the schema structure.
obj.__arrow_c_schema__()
must return a PyCapsule with anarrow_schema
name and an arrow schema struct.obj.__arrow_c_array__(schema=None)
must return a tuple of the schema above and a PyCapsule with anarrow_array
name and an arrow array struct. The schema is advisory, caller may request a format.The lifetime of the Schema and Array structures is dependent on the caller -- so there are release callbacks that must be called when the caller is done with the memory. This complicates the lifetime of our image storage.
We have two cases at the moment:
A single channel image can be encoded as a single array of
height*width
items, using the type of the underlying storage. (e.g., uint8/int32/float32).A multichannel image can be encoded in a similar manner, using
4*height*width
items in the array. The caller would be responsible for knowing that it's 4 elements per pixel. It's also possible to use a parent type of aFixedWidthArray
of 4 elements, and a child array of4*height*width
elements. The fixed width arrays are statically defined, so the underlying array is still the same continuous block of memory.Flat:
Nested:
An alternate encoding of a multichannel image would be to use a struct of channels, e.g. Struct[r,g,b,a]. This would require 4 child arrays, each allocated in a continuous chunk, as in planar image storage. This is not compatible with our current storage.
While our core storage is generally compatible with this layout, there are three issues:
Implementation Notes
The PR #8330 implements Pillow->Arrow for images that don't trip the above caveats.
There are no additional build or runtime dependencies. The arrow structures are designed to be copied into a header and used from there. (licensing is not an issue as those fragments are under an Apache License). There is an additional test dependency on PyArrow at the moment. In theory, numpy 2 could be used for this, but I'm not sure if we'd be testing the legacy array access or arrow access.
The lifetime of the core imaging struct is now separated from the imaging Python Object. There's effectively a refcount implemented for this -- there's an initial 1 for the image->im reference, every arrow array that references an image increments it, and calling ImagingDelete decrements it.
Outstanding Questions
For consumers of data -- what's the most useful format?
arr[(y*(width)+x)*4 + channel]
arr[y*(width)+x][channel]
?arr[y][x][channel]
?The text was updated successfully, but these errors were encountered: