-
Notifications
You must be signed in to change notification settings - Fork 39
Inconsistent Filesizes with .awkd Files #246
Comments
I think the problem is that this serialization is a somewhat naive snapshot of what's in memory, so if there are unreachable elements, they're still written. there isn't an additional pass to look for what can be compacted before writing. The serialization is one of the things that's lagging in Awkward 1—there are a lot of serialization protocols for this sort of data, it might be a mistake for me to introduce another one. This .awkd file format is the only protocol guaranteed to save all data about an Awkward Array, but as you've noted, you don't always want to save all data. Does the Parquet file format save everything that you need? If you have Lorentz vectors or something, it currently won't save that, but I'm figuring out how to use "application metadata" to include such things. The Parquet format is considerably more compact. There's also Arrow, but that's a line protocol, not a file format (though there's nothing stopping you from putting the serialized Arrow data in files). |
I've never used Parquet, so I don't know much about it. |
No, you can The arrays need to have the same length ( Another thing that I thought of after having sent yesterday's answer is that ROOT's RNTuple would also be capable of storing this information, but the reader/writer of RNTuple in Python is still under development, so that doesn't help you now. Also in development: Awkward 1's |
Thanks for the response! I may give that a try later. For now, it seems that just creating a new JaggedArray from the old one and saving the new one does indeed filter out what I'm looking for and reduces filesizes in an expected manner.
So I think I at least have a solution to my original struggle, but I am still slightly curious about the behavior when passing event masks - mainly, why contiguous event numbers do seem to drop filesize. I checked if it was related to the highest index selected, but that doesn't seem to be the case. From before:
vs
Not the biggest deal, but I don't know if it's expected |
It's expected; in some cases, it's a feature, not a bug. But we might want to call out a specific "how to compact an array" for this very common case of filtering in order to make the data size smaller, rather than filtering for statistical significance. (Or for both reasons.) |
Beware that |
I am seeing behavior that I don't understand when saving collections of
JaggedArrays
with theawkward.save()
function.I have an initial set of arrays each with an outer dimension of 10,000. I save those arrays with
The resulting filesize is about 280 MB.
I then want to filter out events from those arrays. As an example, let's say I want the first 10 events.
This produces a filesize of about 280 kB, which makes sense - I've selected 1/1000 events, so the filesize is about 1000x smaller.
However, now I instead select a more distributed set of 10 events.
The resulting filesize is now back to the original 280 MB.
Is this behavior expected? Or am I doing something wrong? When I load the data back, I do only seem to have access to the events I filtered, but the increased filesize is giving me memory issues (on larger files) as I try to concatenate only the filtered events.
How can I achieve saving a small subset of events for later concatenation?
The text was updated successfully, but these errors were encountered: