Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large datasets #10

Open
dfdx opened this issue Nov 19, 2017 · 24 comments
Open

Large datasets #10

dfdx opened this issue Nov 19, 2017 · 24 comments

Comments

@dfdx
Copy link

dfdx commented Nov 19, 2017

How do we deal with datasets that are too large to be read into an Array? Something of 5-50Gb, for example. Are there any tools for it or earlier discussion?

I thought about:

  • Iterators instead of arrays. Pros: simple. Cons: some tools (e.g. from MLDataUtils) may require random access to elements of a dataset.
  • New array type with lazy data loading. Maybe memory-mapped array, maybe something more custom. Pros: exposes AbstractArray interface, so existing tools will work. Cons: some tools and algorithms may expect data to be in memory while for disk-based arrays their performance will drop drastically.
  • Completely custom interface. PyTorch's datasets/dataloaders may be a good example. Pros: flexible, easy to provide fast access. Cons: most functions from MLDataUtils will break.
@Evizero
Copy link
Member

Evizero commented Nov 19, 2017

From a design standpoint MLDataUtils, or more specifically the backend MLDataPattern, is equipped for iterators. http://mldatapatternjl.readthedocs.io/en/latest/introduction/design.html#what-about-streaming-data. Its just that we don't have any example implementations for something like that yet
Other than RandomObs etc. that is, but those are more of a "data container decorator" that transform data containers into iterators.

Alternatively it would also be possible to just implement custom data container types for these large datasets that delay data access until it is actually needed. The caveat is that these assumes that at least the indicies fit in memory. The upside is they work seamlessly with partitioning etc. http://mldatapatternjl.readthedocs.io/en/latest/documentation/container.html

My goal with MLDataPattern was to implement the package in such way that it can deal with exactly these use cases cleanly in a "Julian" manner. It can even deal with big labeled datasets where the labels themself would be cheap to store and access (and thus available for resampling) http://mldatapatternjl.readthedocs.io/en/latest/documentation/targets.html#support-for-custom-types

Maybe the MLDataPattern documentation would be a good start for a discussion? I realize its quite verbose and a lot to read, but I tried to be very specific with my definitions there

@dfdx
Copy link
Author

dfdx commented Nov 20, 2017

Excellent, thanks! I need some time to go trough the docs, so will get back later.

@Evizero
Copy link
Member

Evizero commented Nov 20, 2017

Maybe @Sheemon7 has some opinions to share on this topic? given that he actually worked with large datatsets in combination with MLDataPattern.

@simonmandlik
Copy link

My only experience of large datasets and MLDataPattern was one usecase, when feature array couldn't fit into memory, however simple vector of indices could (dataset comprised around 10^10 examples). I must say that MLDataPattern was smartly designed to cope with such situations.

Are your labels multi-dimensional? If not, I am really curious which type of dataset we're talking about (dataset with so many observations, that even same number of numbers doesn't fit into memory). In that case similar functionality to PyTorch's dataloaders is necessary (isn't supported now, but shouldn't be that hard to implement). In the other case, have you considered similar strategy as with features? You can "index" examples by integers and generate their real "values" on demand. There is a targets method that takes arbitrary function and thus can transform your index of example to its label (similar how getobs works for features). But this is just an idea, if you could provide more specific information, maybe it would help.

@dfdx
Copy link
Author

dfdx commented Nov 20, 2017

Oh, I didn't mean labels of that size, only feature array. Although I can certainly recall a couple of datasets of dozens of terabyte with labels of several thousand gigabytes, but for them I would use Hadoop or at least some external database because even reading these data from a single disk would be terribly slow.

Here I mean reasonably large data, labeled or unlabeled. I mostly work with text and images. The dataset that brought me here is Food 101 which contains 101 000 images totaling 5Gb. I can work with it as is with my 16Gb RAM, but there might be people who can't. I also have a couple of face datasets that are 10-24Gb which can only be loaded lazily.

(I'm in the middle of reading the docs, so no useful comments yet)

@oxinabox
Copy link
Member

oxinabox commented Nov 21, 2017

Some musings:

The goal of MLDatasets (at its heart) is to get the data into a form that can be processed by MLDataUtils.
Since if MLDataUtils can get at the data, that (proof by construction) shows that a julia program can get at the data.

We can think of data as coming in 3 sizes, these sizes actually depend on quality of hardware commonly in use in ML.

  • Small Data: it fits in your RAM (right now this is <8Gb on disk size, corresponding to <32Gb in memory)
  • Medium Data: it fits on one computer (so like 8 hard disks for 8TB)
  • Big Data: It does not fit on one computer.

From that:

  • Small data is easy, and covered by MLDatasets
  • Big data is out of scope, it needs a whole bunch of networking magic, loading is a concern for a different package, and even working with it on a single machine may not make sense anyway.
  • Medium data is thus the question.

Julia provides in Base (in 0.7 in stdlib) memory mapped arrays.
I've not tried them.
I think though they make a lot of sense as the best way to load data that naturally loads well into arrays.
Though they can only map single files, and the file needs to be formatted as the right kind of binary data.
Any data that is small enough to fit in a single file is probably small enough to fit in RAM, so is small data.

I think there might be scope for a MediumData.jl (or OnDiskDatasets.jl),
which handles the shenanigans of providing an array view on to multi-file datasets that live on disk.
It shouldn't actually be too insane.
Something like

  • Base Memory Mapped Arrays,
  • plus MappedArrays.jl to handle the data not looking right,
  • plus CatViews.jl to bang all the different files of data into one array.

Seems viable, seems like it deserves its own package separate from MLDataUtils,
that MLDataUtils REQUIRES and uses its domain information to say what kind of Mappings need to be used on the data to have it in a good form for use.

Now handling this data with an iterator is easy.
All you need is an Iterator of FileIDs, then use coroutines to express how data comes out of those.
CorpusLoaders.jl is packed full of that.
I think this type of solution is completely in-scope for MLDatasets.

(CorpusLoaders.jl is basically my own shot at MLDatasets for natural language datasets (though there is some overlap), it needs updating at the moment since it was not made with MLDataUtils in mind, and I might want to reconsider some designed decision based on that.)

Since MLDataUtils can process things that are iteration based, or things that are indexed based that means all is good to provide either.
Providing Iteration based means people with more RAM can collect it and then have it indexed based.
Providing Indexed based leads to more feature-ful MLDataUtils experience right now, right?
But as said it is work to do and I don't think there are any current really good solutions to it.
I could be wrong though.

@dfdx
Copy link
Author

dfdx commented Nov 22, 2017

@oxinabox Thanks for a detailed comment!

Though they can only map single files, and the file needs to be formatted as the right kind of binary data.

I immediately 2 limitations here:

  • file system should support large files, so FAT32 is out of scope (yes, I still use it e.g. on USB stick since it's the only one compatible with all systems I meet)
  • most likely you aren't going to delete original data (e.g. a bunch of images), so basically you have to double used space on disk - one copy for original images and one for the array (for compressed image formats like JPG array file will be even larger)

What I like about MLDataPatterns is that data containers (which give you access to most of not all the cool features) require you to define only 2 functions - nobs() and getobs(), and both can be efficiently implemented for on-disk data (at least for images). I will try to implement it for my dataset and see how it works.

@oxinabox
Copy link
Member

Yes and yes, I agree.

file system should support large files, so FAT32 is out of scope (yes, I still use it e.g. on USB stick since it's the only one compatible with all systems I meet)

Multifile data is a must.
Data is almost always broken up into files <4GB, because all kinds of things break if it isn't.
and it is faster to download in pieces.

One supports multiple files in iteration trivially.
A theoretical OnDiskDataSets.jl package would support multiple file mmapped arrays via something like CatViews.jl

most likely you aren't going to delete original data (e.g. a bunch of images), so basically you have to double used space on disk - one copy for original images and one for the array (for compressed image formats like JPG array file will be even larger)

Yes.
For some formats maybe something lazy along the lines of MappedArrays.jl could be done.
a kind of MappedReducedArrays.jl maybe is sufficient.
Which looks at a field as a mmapped array, applies a map and a reduce function something like a kernal ideally that only needs local information (but this might not be possible for all file types) and exposes its result as a array "view" of different dimensions to the file on disk.
(idk that all reductions that are useful are invertable though, so it might be readonly)

This is basically easy in the iteration case, of course.

@dfdx
Copy link
Author

dfdx commented Nov 25, 2017

I made a pull request for the above mentioned dataset, but results are different from what I expected. Basically, I created 2 new data types - Food101Data and Food101Targets, and defined methods:

getobs(::Food101Data, ::Integer)::Array{UInt8,3}
getobs(::Food101Data, ::Vector)::Array{UInt8,4}
nobs(::Food101Data)::Int

# and the same for Food101Targets

Then I called batchview(data) and looked at the first batch. I expected the first batch to be what getobs(::Food101Data, ::Vector) returns, i.e. Array{UInt8,4} which I can then pass to my training functions. But instead I get:

DataSubset(::Food101.Food101Data, ::UnitRange{Int64}, ObsDim.Undefined())
 100 observations

Is it intended result or I'm doing something wrong?

@Evizero
Copy link
Member

Evizero commented Nov 25, 2017

Is it intended result or I'm doing something wrong?

That is on purpose. batchview is still still lazy and returns subsets. You can get the data by calling getobs on this subset.

Here is why: if you want to iterate over all batches you can use eachbatch instead, which can make use of a pre-allocated buffer that it reuses for every batch. For the pre-allocation to work your Food101Data has to implement getobs! though.

@dfdx
Copy link
Author

dfdx commented Nov 25, 2017

Got it, I'll add getobs! and try eachbatch instead of batchview.

Another question about testing. For smaller datasets we can download all the data to check if everything works fine, but a couple of 5-10Gb datasets probably won't make Travis happy. Any other options?

@Evizero
Copy link
Member

Evizero commented Nov 25, 2017

oh, that's a good question. I don't know. I'll think about it though. I am very open to ideas

@dfdx
Copy link
Author

dfdx commented Nov 25, 2017

Basically, code for reading a particular datasets is unlikely to change often, so we can write testing code, run it locally and comment out before merging. This way it will be easy to retest it later, but it won't affect Travis or any users not interested in this specific dataset. The only risk factor I see is when 3rd party libraries (like Images.jl in my case) break and we can't see it because tests aren't running automatically.

@oxinabox
Copy link
Member

oxinabox commented Nov 25, 2017

My plan is (and what is partially implemented in CorpusLoaders ):

  • Unit tests / Tests on subdatasets that I create by hand, running with Travis every time
  • An environment variable controlling if full downloads and tests run

Longer term I plan to make a CI server separately (either self-hosted, or using another service like CircleCI) to run interation tests with the full dataset. (By setting that enviroment variable)

@dfdx
Copy link
Author

dfdx commented Nov 26, 2017

Using environment variables sounds like a perfect solution!

Regarding subdatasets and custom CI, I guess it's much more involved and may not work for all the users. Or maybe I just don't have experience in doing such things.

@Evizero
Copy link
Member

Evizero commented Nov 26, 2017

I think / hope that separation of concerns will help us out here in the long run. For example once we switch the download logic to use DataDeps.jl as backend, we know that as long as the URL is not dead, that at least the downloading part should work (if not, then its likely a new issue in DataDeps and not here). Also, given that DataDeps uses checksums, any changes to datasets will trigger visible errors to users that are likely to be reported quickly.

Concerning other upstream changes (e.g. like Images.jl as you mentioned). If we design helper functions separately (see. the MNIST submodule for an example of how I want to structure datasets in the future), then these helper functions should be easily testable with toy data of the right structure (like some Array{UInt8,3}).

All in all I think we need not be too concerned with CI right now (unless it interests you). I'd rather focus on increasing the utility of this package by 1.) adding more diverse datasets and 2.) improving the design/interface to nicely deal with issues that come up in 1.

@Evizero
Copy link
Member

Evizero commented Nov 26, 2017

To elaborate on my reference to the MNIST submodule. I think the cleanest approach to provide datasets to users is to simply return the data in the most sensible native form.

As example the MNIST.traintensor() returns a 28x28x60000 Array{Float64,3} in the native memory layout (which means width*height*N which is not how Images needs it). This means that the feature order is the same as in all other examples/papers that use MNIST. In order to get a proper julia image one can use MNIST.convert2image(array) which returns a Array{Gray{Float64}} with the "julian" memory layout that is needed to display an image properly in a notebook etc.

As a side node. I went back and forth between returning a Array{Float64} or a native Array{UInt8}, but I decided that the UInt8 is pretty much never what a user wants. Plus the dataset is tiny. I am less certain which direction to go with such a big dataset, but somehow I also don't expect a Array{UInt8} to be of much use.

@dfdx
Copy link
Author

dfdx commented Nov 29, 2017

Sorry for infrequent replies - I've got overwhelmed with other projects I'm inn charge of.

I agree that UInt8 is rarely useful in practice, but it's unclear what it use instead:

  • converting each UInt8 to Float64 would result in 8x increase in size, and in case of Food-101 it means 40Gb instead of only 5Gb
  • MNIST has only one channel, RGB images have 3; it's much less common to use Float64 for them
  • often in image recognition mini-batches are immediately copied to GPU memory as arrays of Float32, not Float64 (at least for CUDA, not sure about OpenCL).

Regarding the native layout, in case of MNIST there's a lot of papers using it, but for less well-known datasets like Food-101 there's almost none of them, so aligning with others doesn't make much sense.

I agree about CI - I've got several much more important issue with this example dataset. In particular, not all images are good (see JuliaIO/ImageMagick.jl#106), and even of they are perfect, loading only one from disk takes ~5ms, so reading mini-batch of 100 images takes nearly as much as an average single
convnet iteration. Maybe memory-mapped arrays and/or caching will help to reduce it.

@Evizero
Copy link
Member

Evizero commented Dec 28, 2017

Now that I am working on integrating DataDeps (and with it update CIFAR-10 and CIFAR-100), I am revisiting the Float64 vs UInt8 topic.

I am currently leaning to the following solution, which I will also adopt for MNIST:

  • by default CIFAR10.traintensor(::Int) will return a Array{N0f8,3} of dimensions (width,height,channels); so the way they are stored. I think N0f8 is a better default than UInt8 and it causes no additional overhead. Similarly CIFAR10.traintensor([::Vector]) will return a Array{N0f8,4} of dimensions (width,height,channels,nobs).

  • All functions accept an eltype as first parameter. I.e. CIFAR10.traintensor(UInt8, ...) or CIFAR10.traintensor(Float32, ...). The upside of this is that by default it doesn't need to do any additional work, and the optional converting can be done efficiently. Also nice is that specifying a floating point type will return the values in 0.0-1.0, while specifying an Integer type will return the values between 0-255

Concerning dimensions: I think using the native dimension layout is a good default. I'll provide some util functions to change that (like CIFAR10.convert2image(...) or some keyword parameter) . For Food-101 i don't think it matters because the data doesn't seem to be in a special binary format like CIFAR or MNIST, but simple JPG, right? In that case its fair enough to just use whatever FileIO gives you.

@dfdx
Copy link
Author

dfdx commented Dec 28, 2017

Does it mean a new dataset will need to implement traintensor in addition to traindata?

For Food-101 i don't think it matters because the data doesn't seem to be in a special binary format like CIFAR or MNIST, but simple JPG, right?

Exactly, but it's not really good. I found that reading images from disk, decoding into a matrix and resizing (in Food-101 images may have different size) is about 3x slower than running a single pass of 3-layer ConvNet for the same data. So large portion of time during learning is spent not on actual training, but instead on getting the data, which is unacceptable. This is the main reason for not finishing this PR, although CIFAR and MNIST shouldn't have these problems.

Also, in PyTorch I found an excellent ImageFolder dataset class. Using it, you specify a path to a folder with images like this:

root/
    label1/
        img1.jpg
        img2.jpg
    label2/
        img1.jpg
        img2.jpg
        img3.jpg

and a transformation function to apply to each image. I think it's much better design, and specific datasets may simply add automatic downloading. The downside is that given a custom transformation function it's much harder to preserve type stability - something that your design is good at.

@Evizero
Copy link
Member

Evizero commented Dec 29, 2017

Does it mean a new dataset will need to implement traintensor in addition to traindata?

Right now, yes, but I am not completely sold on it. It is convenient if you just want to display some image etc. Also it makes sense for MNIST where the labels are in a separate file. The story isn't that clear once we look at CIFAR where the labels and images are stored alternating in the same file(s). There I simply implemented traindata only and made traintensor(...) = traindata(...)[1].

@Evizero
Copy link
Member

Evizero commented Dec 29, 2017

What do you think about the idea that after download we repack the data into a HDF5 file? HDF5 supports compression as well as reading individual "datasets" (in our case individual images within that file)

I do like the idea of providing some convenient reader for image folders, but in the scope of this package we have the option of repackaging some data after its downloaded. The downside would be to have a binary dependency on HDF5 and I'll admit I am not even sure if it buys performance or memory

I think it's much better design, and specific datasets ...

Better in relation to what? It sounds to me like the difference is a.) the folder structure on the disk, and b.) the option to specify a custom transformation function. I think both ideas are orthogonal to most performance issues that you found in your current PR.

@dfdx
Copy link
Author

dfdx commented Dec 29, 2017

What do you think about the idea that after download we repack the data into a HDF5 file? HDF5 supports compression as well as reading individual "datasets" (in our case individual images within that file)

Yes, I thought about it, but the downsides are:

  • this increases the space on the disk since you most likely don't want to delete original images; also I don't think HDF5 compression is better than JPG for images, so the resulting file may be pretty large
  • in Food-101 dataset, for example, images may have different sizes; to put them to HDF5 and increase performance we need to resize them to a common size, but which one? default for Food-101 seems to be 512x512 which is larger than needed for most applications

I think preprocessing images and using better storage format the right direction, but the devil is in the detail. Even if we decide to delete original images and figure out a good default size, resulting HDF5 file will be > 4Gb for many datasets, and so we need to split it into parts which brings another set of questions.

Better in relation to what? [...] I think both ideas are orthogonal to most performance issues that you found in your current PR.

This part isn't about performance, but convenience of adding new datasets: with custom solution in Julia I spent several days to plug in Food-101, with PyTorch's ImageFolder it took only a few minutes. We may go the same way and provide a set of functions like:

getobj(dataset::ImageFolder, ...)
nobs(dataset::ImageFolder, ...)
traintensor(dataset::ImageFolder, ...)

and concrete datasets may then reuse it simply by inheriting (or including) ImageFolder:

struct Food101Dataset <: ImageFolder
     root::AbstractString
end

@Evizero
Copy link
Member

Evizero commented Dec 30, 2017

Yes, I thought about it, but the downsides are

right. all very good points. I will think on this a little

This part isn't about performance, but convenience of adding new datasets: with custom solution in Julia I spent several days to plug in Food-101, with PyTorch's ImageFolder it took only a few minutes.

I agree with you that a folder image source is a good generic idea. I had something like that a long time ago actually (https://github.com/Evizero/AugmentorDeprecated.jl/blob/master/src/dirimagesource.jl) but after my use case was done I didn't really work with filesystem based datasets directly anymore (offline resizing and repacking occasionally was just more convenient).

The question is where it should live nowadays. MLDatasets seems like a good candidate if it really needs to reason about colors and such. If it doesn't actually have to understand more than FileIO.load then it could also reside in MLDataPattern or MLDataUtils.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants