-
Notifications
You must be signed in to change notification settings - Fork 260
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] lance support mosaicml streaming #3461
Comments
Yeah I think that's a great idea. We've talked about that with Mosaic folks in the past. Do you want to cross post there and maybe we can strike up the collab conversation again? Likely the integration needs to live in mosaic streaming repo at the end of the day. Would you have interest in helping develop that integration? |
Well, I am interested in participating in this development. I want to confirm if there is a design document already. |
FWIW, since I first brought this up, I have actually implemented a version of this within an internal system. Unfortunately this means I cannot open source it. However I can describe the rough approach I took. The integration path I took was to dynamically generate the "fragments" that mosaicml streaming needs, as just metadata necessary to make the necessary lance.take calls for the custom The main difficulties of this approach are
|
I'd be curious to know more what you mean by this. Our intention was that Lance should work fine in multithreaded environments. Everything should be thread safe and most heavy operations should release the GIL. Are there specific things that don't work in threads? |
@wjones127 Yes, I basically got a bunch of GIL locks when those two threads try to access the lance dataset (not at the same time, just at all). lance calls will just stall. It's very possible this has to do with PyTorch data loader forking vs. spawn behavior. but I seem to recall even with using spawn if I didn't additionally spawn an extra worker just for lance, the process still doesn't work. |
@oceanusxiv have you tried pickle unpickle dataset. then i guess two threads share nothing |
@chenkovsky the "dataset" we're referring to is the |
I think lance.dataset.LanceDataset supports pickle. lance/python/python/lance/dataset.py Line 245 in f73398a
actually ray support depends on pickle unpickle. |
@oceanusxiv |
@Jay-ju Yes, it's a performance thing, none of this is necessary if the lance dataset exists locally, but when the lance dataset is remote and you must make network calls, it just isn't performant enough if you do a row id pull for every sample. Pulling the remote source into a local cache as shards is a central conceit of the streaming dataset specifically for performance reasons. |
@chenkovsky Oh I didn't realize that works for dataset itself, yeah this might be worth a try instead of trying to spawn a separate process. |
@oceanusxiv |
@Jay-ju Correct, it's just not fast enough if you have to do a network call every time you want to retrieve a single sample. It's plenty fast enough if the lance dataset is in the local filesystem. |
like this issue mosaicml/streaming#832
Lance can perform random reads very well, and the primary key can bring good shuffle ability. However, streaming can provide a better shuffle algorithm for training.
So I bring up the possibility of this combination again here and want to discuss the ideas of the community here.
The text was updated successfully, but these errors were encountered: