Cache and remote structure in the next major version #6702
iesahin
started this conversation in
New Features & Ideas
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
This is a rather detailed discussion on the cache and remote structure I have started in #6653. Please look for technical and UI related downsides and difficulties in implementation in this proposal.
Requirements
dvc initand tracked in Git.Preliminary design
This is preliminary and possible will undergo drastic changes after reviews.
Content directories
Content directories are those that contain the file contents as a whole, in parts or as merged in TAR files.
They are named after the hash value of content or TAR file they contain.
When the file is large, it looks like
When the file is small, the directory looks like
When the file is an archive file, the directory looks like
Content directory hierarchies
These content directories can all reside in hierarchies of various depth:
to
are possible.
When a DVC command looks for a file in the contents, they can either look for the exact directory, or the 2-digit directory that may contain the searched one.
Operation Logs
An
oplogis similar to Git's reflog, but it contains the operations performed by a DVC repository in a remote.These logs contain
GET/PUT/DELETEinstructions, the path of the file in DVC remote and the hash value that shows the content directory. It can be in JSON for easier parsing, here I used a flat structure.These logs are named per-DVC-project-GUID + per-access-id + timestamp. So, when I access to a remote from
192.168.1.1, with a project GUID ofZZ90-FF11, it will create two files in/META/logs/.dvc-begin==ZZ90-FF11==192.168.1.1==189798739837.logthat may contain some information about the transaction, like the command initialized it. This is the first file created by DVC indvc pushand similar operations. It signals the beginning of a transaction. It may also contain a plan for the transaction if its known.dvc-end==ZZ90-FF11==192.168.1.1==189798729837.logcontains all the operations successfully performed by DVC at the end of the transaction.When a
dvc-begin...file is present the '/META/logs/but notdvc-end..., it tells the transaction wasn't properly executed. DVC can try to reproduce the command or delete thisdvc-begin-...` file if the operation was cancelled intentionally.Another kind of file, named
dvc-base==...can be created time to time to list all the previous operations and to list the files. A command likedvc remote fsckto merge alldvc-begin...anddvc-end...files intodvc-base=files can allow DVC to load the most recentdvc-base=...and laterdvc-begin=...anddvc-end=...files to get all the files available in a remote, with their hash values and paths in the repository.These oplogs are duplicated in all cache and remotes, given the unique filenames. If there are N users of a repository, with M remotes, it's possible to get a list of each of N user's cache status and the files in M remotes without making a network request. So if a certain file is known to be available in remote A, but not remote B, DVC can automatically ask the file from remote A.
TBC
Beta Was this translation helpful? Give feedback.
All reactions