Open
Description
How do we deal with datasets that are too large to be read into an Array
? Something of 5-50Gb, for example. Are there any tools for it or earlier discussion?
I thought about:
- Iterators instead of arrays. Pros: simple. Cons: some tools (e.g. from MLDataUtils) may require random access to elements of a dataset.
- New array type with lazy data loading. Maybe memory-mapped array, maybe something more custom. Pros: exposes
AbstractArray
interface, so existing tools will work. Cons: some tools and algorithms may expect data to be in memory while for disk-based arrays their performance will drop drastically. - Completely custom interface. PyTorch's datasets/dataloaders may be a good example. Pros: flexible, easy to provide fast access. Cons: most functions from MLDataUtils will break.
Metadata
Metadata
Assignees
Labels
No labels