Add optional bulk_remote_exists #637

sjawhar · 2025-12-03T16:21:01Z

NOTE: The code below was entirely LLM-written. Will gladly clean up / rewrite if a contribution of this type would be accepted. Please read on for context.

Our org uses DVC for a bunch of stuff. In most of our pipeline repos we have a CI check that verifies that the pipeline has been fully reproduced (dvc repro --dry --allow-missing) and all data has been pushed to the remote (dvc data status --not-in-remote) before merging to main. Recently, this CI check on one of our repos started timing out because of how long it's taking. I thought I knew what the bottleneck was (inefficient remote checking), confirmed with a profiler, then stuck models on the problem. I've included flamegraphs and cProfile logs from before and after for comparison.

So, is there a version of this change that you'd accept? Note that it also requires a small change to DVC itself.

Cheers!

BEFORE

AFTER

dvc_cprofile.zip

CLAassistant · 2025-12-03T16:21:15Z

All committers have signed the CLA.

skshetry · 2025-12-03T16:51:17Z

Definitely interested, and open to contributions. I actually implemented something similar a few months ago in DVC:

perf(data status): batch remote checks dvc#10792

That approach, however, broke --no-remote-refresh, which I didn’t want to affect in the minor releases. The full implementation turned out to be more complex, so I ended up reverting the PR. If we remove the --no-remote-refresh flag, the implementation becomes simpler, but breaks compatibility.

And I'd like to maintain the compatibility.

build_entry() internally does fs.info() call, so if we can pass info to it, it would not call fs.info() again. Which it does in your implementation.

Regarding the implementation, for bulk checks, we should leverage fs.info() in batches. This functionality already exists in some form:

https://github.com/treeverse/dvc-objects/blob/0c04cec4c0d97416fad9535e19d0de39f288556a/src/dvc_objects/fs/base.py#L587

It can make batched asyncio calls to fs._info() or falls back to using fs.info() in a threadpool executor. The only issue is that it currently raises an error if a file is missing, even in batch mode which is something we’d need to handle, maybe by extending fs.info() with return_exceptions=True|False or other mechanisms.

Utilizing that, I think we can implement batched remote exists check that would be fast enough for all cases.

skshetry · 2025-12-03T17:03:55Z

all data has been pushed to the remote (dvc data status --not-in-remote) before merging to main. Recently, this CI check on one of our repos started timing out because of how long it's taking.

Do you use --not-in-remote with --granular?
If not, how many .dvc files or output do you have? Because without --granular, dvc only makes one single request per output. For tracked directories, it won't check files inside, just the .dir file.

DVC pushes .dir file at the end after all the entries tracked by that .dir is pushed, so the result should be same in an ideal condition.

- use return_exceptions=True for batch retrieval - skip unnecessary network calls by accepting cached_info - do a single fs.info call, then pass that info to build_entry - we group storage instances by their underlying ODB path to unify batches and perform the fs.info call for the entire batch

falko17 · 2025-12-07T00:23:03Z

Hi @skshetry! Sami asked me to take over his PR for now.

I've significantly rewritten the code¹ to fit with your suggestions. So to summarize:

fs.info now has a return_exceptions parameter, which is used by the bulk_*_exists methods
We group storage instances by their underlying ODB path to unify batches, then perform the fs.info call for the entire batch and pass the resulting info to build_entry.
This also uses the existing batch functionality from fs.info instead of using another ThreadPool on top of it.
Finally, there are some smaller fixes/changes, like that the progress bar now updates correctly for bulk calls.

Changes are available here (these are links to diffs from the current respective treeverse:main):

dvc-data: (current PR)
dvc-objects: treeverse/dvc-objects@main...falko17:dvc-objects:add-return-exceptions-for-info
dvc: treeverse/dvc@main...falko17:dvc:feature/bulk-remote-exists

I can make a new PR with the other two repos later on, but I first wanted to comment here and see if you'd accept this approach at all or if there are any bigger changes I should implement first.

It's still a bit messy and could be improved, but I wanted to get your opinion on the approach first. ↩

skshetry · 2025-12-07T04:15:57Z

Contributions are always welcome. Please go ahead and open the pull requests, and we can discuss details during review.

skshetry · 2025-12-07T04:21:54Z

src/dvc_data/index/index.py

+                else:
+                    for entry in callback.wrap(storage_entries):
+                        results[entry] = storage.exists(entry, **kwargs)


Please create bulk_exists with this naive implementation on the base class. We can optimize this in the future, and will also cleanup the code.

I've added a naive bulk_exists to the base class.

skshetry · 2025-12-07T04:23:27Z

src/dvc_data/index/index.py

+            # Maps from path to info
+            cached_info: dict[str, Any] = {
+                p: info if not isinstance(info, Exception) else None
+                for p, info in zip(all_paths, batch_info)
+            }


Why are we caching?

This is so that when we call bulk_exists for the other storage instances that have the same ODB path, we don't have to call fs.info again and instead re-use the info we've already retrieved.

for the other storage instances that have the same ODB path

What is the usecase? When would those different instances have the same path?

Short disclaimer, I have to admit I'm not that familiar with DVC internals here so it's entirely possible I'm misunderstanding some part of this 😅

But I noticed that it does sometimes happen that separate storage instances have the same remote path. As a concrete example, when I tried out the example repo I didn't actually see a speed-up with the bulk changes here (before implementing the caching part) when running dvc data status --not-in-remote. When I looked into it with the help of pdb, I saw that there were multiple remotes with the same path (in this case due to different outputs):

Output of `p list(by_storage.keys())` in index.py:546

[ObjectStorage(key=('model.pkl',), odb=HashFileDB(fs=<dvc_http.HTTPSFileSystem object at 0x7de6f062ccd0>, path='https://remote.dvc.org/get-started/files/md5', read_only=False), index=<dvc_data.index.index.DataIndex object at 0x7de6f07f3680>, read_only=False), ObjectStorage(key=('eval',), odb=HashFileDB(fs=<dvc_http.HTTPSFileSystem object at 0x7de6f05b63f0>, path='https://remote.dvc.org/get-started/files/md5', read_only=False), index=<dvc_data.index.index.DataIndex object at 0x7de6f060c6b0>, read_only=False), ObjectStorage(key=('data', 'prepared'), odb=HashFileDB(fs=<dvc_http.HTTPSFileSystem object at 0x7de6f0a4a900>, path='https://remote.dvc.org/get-started/files/md5', read_only=False), index=<dvc_data.index.index.DataIndex object at 0x7de6f05b5940>, read_only=False), ObjectStorage(key=('data', 'features'), odb=HashFileDB(fs=<dvc_http.HTTPSFileSystem object at 0x7de6f062c7d0>, path='https://remote.dvc.org/get-started/files/md5', read_only=False), index=<dvc_data.index.index.DataIndex object at 0x7de6f05d96d0>, read_only=False)]

skshetry · 2025-12-07T05:05:33Z

src/dvc_data/index/index.py

+            value = cast("str", entry.hash_info.value)
+            key = self.odb._oid_parts(value)
+
+            if isinstance(info, Exception) or info is None:


I think we should only handle FileNotFoundError, and fail on other cases.

Agreed, changed it accordingly.

Add optional bulk_remote_exists

7b41e5d

github-project-automation bot added this to DVC Dec 3, 2025

github-project-automation bot moved this to Backlog in DVC Dec 3, 2025

skshetry reviewed Dec 7, 2025

View reviewed changes

fixup! feat: improve bulk_remote_exists and bulk_cache_exists

58ff6bd

This was referenced Dec 7, 2025

Add return_exceptions parameter to fs.info treeverse/dvc-objects#365

Open

Use bulk calls for checking which entries are not in remote treeverse/dvc#10923

Draft

Add optional bulk_remote_exists #637

Are you sure you want to change the base?

Add optional bulk_remote_exists #637

Uh oh!

Conversation

sjawhar commented Dec 3, 2025

Uh oh!

CLAassistant commented Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

skshetry commented Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

skshetry commented Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

falko17 commented Dec 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Footnotes

Uh oh!

skshetry commented Dec 7, 2025

Uh oh!

skshetry Dec 7, 2025

Choose a reason for hiding this comment

Uh oh!

falko17 Dec 7, 2025

Choose a reason for hiding this comment

Uh oh!

skshetry Dec 7, 2025

Choose a reason for hiding this comment

Uh oh!

falko17 Dec 7, 2025

Choose a reason for hiding this comment

Uh oh!

skshetry Dec 7, 2025

Choose a reason for hiding this comment

Uh oh!

falko17 Dec 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

skshetry Dec 7, 2025

Choose a reason for hiding this comment

Uh oh!

falko17 Dec 7, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

CLAassistant commented Dec 3, 2025 •

edited

Loading

skshetry commented Dec 3, 2025 •

edited

Loading

skshetry commented Dec 3, 2025 •

edited

Loading

falko17 commented Dec 7, 2025 •

edited

Loading

falko17 Dec 7, 2025 •

edited

Loading