-
Notifications
You must be signed in to change notification settings - Fork 23
Add optional bulk_remote_exists #637
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Definitely interested, and open to contributions. I actually implemented something similar a few months ago in DVC: That approach, however, broke And I'd like to maintain the compatibility.
Regarding the implementation, for bulk checks, we should leverage It can make batched asyncio calls to Utilizing that, I think we can implement batched remote exists check that would be fast enough for all cases. |
Do you use DVC pushes |
- use return_exceptions=True for batch retrieval - skip unnecessary network calls by accepting cached_info - do a single fs.info call, then pass that info to build_entry - we group storage instances by their underlying ODB path to unify batches and perform the fs.info call for the entire batch
|
Hi @skshetry! Sami asked me to take over his PR for now. I've significantly rewritten the code1 to fit with your suggestions. So to summarize:
Changes are available here (these are links to diffs from the current respective
I can make a new PR with the other two repos later on, but I first wanted to comment here and see if you'd accept this approach at all or if there are any bigger changes I should implement first. Footnotes
|
|
Contributions are always welcome. Please go ahead and open the pull requests, and we can discuss details during review. |
src/dvc_data/index/index.py
Outdated
| else: | ||
| for entry in callback.wrap(storage_entries): | ||
| results[entry] = storage.exists(entry, **kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please create bulk_exists with this naive implementation on the base class. We can optimize this in the future, and will also cleanup the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a naive bulk_exists to the base class.
| # Maps from path to info | ||
| cached_info: dict[str, Any] = { | ||
| p: info if not isinstance(info, Exception) else None | ||
| for p, info in zip(all_paths, batch_info) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we caching?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is so that when we call bulk_exists for the other storage instances that have the same ODB path, we don't have to call fs.info again and instead re-use the info we've already retrieved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for the other storage instances that have the same ODB path
What is the usecase? When would those different instances have the same path?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Short disclaimer, I have to admit I'm not that familiar with DVC internals here so it's entirely possible I'm misunderstanding some part of this 😅
But I noticed that it does sometimes happen that separate storage instances have the same remote path. As a concrete example, when I tried out the example repo I didn't actually see a speed-up with the bulk changes here (before implementing the caching part) when running dvc data status --not-in-remote. When I looked into it with the help of pdb, I saw that there were multiple remotes with the same path (in this case due to different outputs):
Output of `p list(by_storage.keys())` in index.py:546
[ObjectStorage(key=('model.pkl',), odb=HashFileDB(fs=<dvc_http.HTTPSFileSystem object at 0x7de6f062ccd0>, path='https://remote.dvc.org/get-started/files/md5', read_only=False), index=<dvc_data.index.index.DataIndex object at 0x7de6f07f3680>, read_only=False), ObjectStorage(key=('eval',), odb=HashFileDB(fs=<dvc_http.HTTPSFileSystem object at 0x7de6f05b63f0>, path='https://remote.dvc.org/get-started/files/md5', read_only=False), index=<dvc_data.index.index.DataIndex object at 0x7de6f060c6b0>, read_only=False), ObjectStorage(key=('data', 'prepared'), odb=HashFileDB(fs=<dvc_http.HTTPSFileSystem object at 0x7de6f0a4a900>, path='https://remote.dvc.org/get-started/files/md5', read_only=False), index=<dvc_data.index.index.DataIndex object at 0x7de6f05b5940>, read_only=False), ObjectStorage(key=('data', 'features'), odb=HashFileDB(fs=<dvc_http.HTTPSFileSystem object at 0x7de6f062c7d0>, path='https://remote.dvc.org/get-started/files/md5', read_only=False), index=<dvc_data.index.index.DataIndex object at 0x7de6f05d96d0>, read_only=False)]
src/dvc_data/index/index.py
Outdated
| value = cast("str", entry.hash_info.value) | ||
| key = self.odb._oid_parts(value) | ||
|
|
||
| if isinstance(info, Exception) or info is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should only handle FileNotFoundError, and fail on other cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, changed it accordingly.
NOTE: The code below was entirely LLM-written. Will gladly clean up / rewrite if a contribution of this type would be accepted. Please read on for context.
Our org uses DVC for a bunch of stuff. In most of our pipeline repos we have a CI check that verifies that the pipeline has been fully reproduced (
dvc repro --dry --allow-missing) and all data has been pushed to the remote (dvc data status --not-in-remote) before merging to main. Recently, this CI check on one of our repos started timing out because of how long it's taking. I thought I knew what the bottleneck was (inefficient remote checking), confirmed with a profiler, then stuck models on the problem. I've included flamegraphs and cProfile logs from before and after for comparison.So, is there a version of this change that you'd accept? Note that it also requires a small change to DVC itself.
Cheers!
BEFORE

AFTER

dvc_cprofile.zip