Skip to content

Large datasets #10

Open
Open
@dfdx

Description

@dfdx

How do we deal with datasets that are too large to be read into an Array? Something of 5-50Gb, for example. Are there any tools for it or earlier discussion?

I thought about:

  • Iterators instead of arrays. Pros: simple. Cons: some tools (e.g. from MLDataUtils) may require random access to elements of a dataset.
  • New array type with lazy data loading. Maybe memory-mapped array, maybe something more custom. Pros: exposes AbstractArray interface, so existing tools will work. Cons: some tools and algorithms may expect data to be in memory while for disk-based arrays their performance will drop drastically.
  • Completely custom interface. PyTorch's datasets/dataloaders may be a good example. Pros: flexible, easy to provide fast access. Cons: most functions from MLDataUtils will break.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions