-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Large datasets #10
Comments
From a design standpoint MLDataUtils, or more specifically the backend MLDataPattern, is equipped for iterators. http://mldatapatternjl.readthedocs.io/en/latest/introduction/design.html#what-about-streaming-data. Its just that we don't have any example implementations for something like that yet Alternatively it would also be possible to just implement custom data container types for these large datasets that delay data access until it is actually needed. The caveat is that these assumes that at least the indicies fit in memory. The upside is they work seamlessly with partitioning etc. http://mldatapatternjl.readthedocs.io/en/latest/documentation/container.html My goal with MLDataPattern was to implement the package in such way that it can deal with exactly these use cases cleanly in a "Julian" manner. It can even deal with big labeled datasets where the labels themself would be cheap to store and access (and thus available for resampling) http://mldatapatternjl.readthedocs.io/en/latest/documentation/targets.html#support-for-custom-types Maybe the MLDataPattern documentation would be a good start for a discussion? I realize its quite verbose and a lot to read, but I tried to be very specific with my definitions there |
Excellent, thanks! I need some time to go trough the docs, so will get back later. |
Maybe @Sheemon7 has some opinions to share on this topic? given that he actually worked with large datatsets in combination with MLDataPattern. |
My only experience of large datasets and MLDataPattern was one usecase, when feature array couldn't fit into memory, however simple vector of indices could (dataset comprised around 10^10 examples). I must say that MLDataPattern was smartly designed to cope with such situations. Are your labels multi-dimensional? If not, I am really curious which type of dataset we're talking about (dataset with so many observations, that even same number of numbers doesn't fit into memory). In that case similar functionality to PyTorch's dataloaders is necessary (isn't supported now, but shouldn't be that hard to implement). In the other case, have you considered similar strategy as with features? You can "index" examples by integers and generate their real "values" on demand. There is a |
Oh, I didn't mean labels of that size, only feature array. Although I can certainly recall a couple of datasets of dozens of terabyte with labels of several thousand gigabytes, but for them I would use Hadoop or at least some external database because even reading these data from a single disk would be terribly slow. Here I mean reasonably large data, labeled or unlabeled. I mostly work with text and images. The dataset that brought me here is Food 101 which contains 101 000 images totaling 5Gb. I can work with it as is with my 16Gb RAM, but there might be people who can't. I also have a couple of face datasets that are 10-24Gb which can only be loaded lazily. (I'm in the middle of reading the docs, so no useful comments yet) |
Some musings: The goal of MLDatasets (at its heart) is to get the data into a form that can be processed by MLDataUtils. We can think of data as coming in 3 sizes, these sizes actually depend on quality of hardware commonly in use in ML.
From that:
Julia provides in Base (in 0.7 in stdlib) memory mapped arrays. I think there might be scope for a MediumData.jl (or OnDiskDatasets.jl),
Seems viable, seems like it deserves its own package separate from MLDataUtils, Now handling this data with an iterator is easy. (CorpusLoaders.jl is basically my own shot at MLDatasets for natural language datasets (though there is some overlap), it needs updating at the moment since it was not made with MLDataUtils in mind, and I might want to reconsider some designed decision based on that.) Since MLDataUtils can process things that are iteration based, or things that are indexed based that means all is good to provide either. |
@oxinabox Thanks for a detailed comment!
I immediately 2 limitations here:
What I like about MLDataPatterns is that data containers (which give you access to most of not all the cool features) require you to define only 2 functions - |
Yes and yes, I agree.
Multifile data is a must. One supports multiple files in iteration trivially.
Yes. This is basically easy in the iteration case, of course. |
I made a pull request for the above mentioned dataset, but results are different from what I expected. Basically, I created 2 new data types -
Then I called
Is it intended result or I'm doing something wrong? |
That is on purpose. Here is why: if you want to iterate over all batches you can use |
Got it, I'll add Another question about testing. For smaller datasets we can download all the data to check if everything works fine, but a couple of 5-10Gb datasets probably won't make Travis happy. Any other options? |
oh, that's a good question. I don't know. I'll think about it though. I am very open to ideas |
Basically, code for reading a particular datasets is unlikely to change often, so we can write testing code, run it locally and comment out before merging. This way it will be easy to retest it later, but it won't affect Travis or any users not interested in this specific dataset. The only risk factor I see is when 3rd party libraries (like Images.jl in my case) break and we can't see it because tests aren't running automatically. |
My plan is (and what is partially implemented in CorpusLoaders ):
Longer term I plan to make a CI server separately (either self-hosted, or using another service like CircleCI) to run interation tests with the full dataset. (By setting that enviroment variable) |
Using environment variables sounds like a perfect solution! Regarding subdatasets and custom CI, I guess it's much more involved and may not work for all the users. Or maybe I just don't have experience in doing such things. |
I think / hope that separation of concerns will help us out here in the long run. For example once we switch the download logic to use Concerning other upstream changes (e.g. like All in all I think we need not be too concerned with CI right now (unless it interests you). I'd rather focus on increasing the utility of this package by 1.) adding more diverse datasets and 2.) improving the design/interface to nicely deal with issues that come up in 1. |
To elaborate on my reference to the MNIST submodule. I think the cleanest approach to provide datasets to users is to simply return the data in the most sensible native form. As example the As a side node. I went back and forth between returning a |
Sorry for infrequent replies - I've got overwhelmed with other projects I'm inn charge of. I agree that
Regarding the native layout, in case of MNIST there's a lot of papers using it, but for less well-known datasets like I agree about CI - I've got several much more important issue with this example dataset. In particular, not all images are good (see JuliaIO/ImageMagick.jl#106), and even of they are perfect, loading only one from disk takes ~5ms, so reading mini-batch of 100 images takes nearly as much as an average single |
Now that I am working on integrating DataDeps (and with it update CIFAR-10 and CIFAR-100), I am revisiting the I am currently leaning to the following solution, which I will also adopt for MNIST:
Concerning dimensions: I think using the native dimension layout is a good default. I'll provide some util functions to change that (like |
Does it mean a new dataset will need to implement
Exactly, but it's not really good. I found that reading images from disk, decoding into a matrix and resizing (in Food-101 images may have different size) is about 3x slower than running a single pass of 3-layer ConvNet for the same data. So large portion of time during learning is spent not on actual training, but instead on getting the data, which is unacceptable. This is the main reason for not finishing this PR, although CIFAR and MNIST shouldn't have these problems. Also, in PyTorch I found an excellent ImageFolder dataset class. Using it, you specify a path to a folder with images like this:
and a transformation function to apply to each image. I think it's much better design, and specific datasets may simply add automatic downloading. The downside is that given a custom transformation function it's much harder to preserve type stability - something that your design is good at. |
Right now, yes, but I am not completely sold on it. It is convenient if you just want to display some image etc. Also it makes sense for MNIST where the labels are in a separate file. The story isn't that clear once we look at CIFAR where the labels and images are stored alternating in the same file(s). There I simply implemented |
What do you think about the idea that after download we repack the data into a HDF5 file? HDF5 supports compression as well as reading individual "datasets" (in our case individual images within that file) I do like the idea of providing some convenient reader for image folders, but in the scope of this package we have the option of repackaging some data after its downloaded. The downside would be to have a binary dependency on HDF5 and I'll admit I am not even sure if it buys performance or memory
Better in relation to what? It sounds to me like the difference is a.) the folder structure on the disk, and b.) the option to specify a custom transformation function. I think both ideas are orthogonal to most performance issues that you found in your current PR. |
Yes, I thought about it, but the downsides are:
I think preprocessing images and using better storage format the right direction, but the devil is in the detail. Even if we decide to delete original images and figure out a good default size, resulting HDF5 file will be > 4Gb for many datasets, and so we need to split it into parts which brings another set of questions.
This part isn't about performance, but convenience of adding new datasets: with custom solution in Julia I spent several days to plug in Food-101, with PyTorch's
and concrete datasets may then reuse it simply by inheriting (or including)
|
right. all very good points. I will think on this a little
I agree with you that a folder image source is a good generic idea. I had something like that a long time ago actually (https://github.com/Evizero/AugmentorDeprecated.jl/blob/master/src/dirimagesource.jl) but after my use case was done I didn't really work with filesystem based datasets directly anymore (offline resizing and repacking occasionally was just more convenient). The question is where it should live nowadays. MLDatasets seems like a good candidate if it really needs to reason about colors and such. If it doesn't actually have to understand more than |
How do we deal with datasets that are too large to be read into an
Array
? Something of 5-50Gb, for example. Are there any tools for it or earlier discussion?I thought about:
AbstractArray
interface, so existing tools will work. Cons: some tools and algorithms may expect data to be in memory while for disk-based arrays their performance will drop drastically.The text was updated successfully, but these errors were encountered: