Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussion: Using shallow git clones #48

Open
jmholla opened this issue Jan 28, 2025 · 8 comments
Open

Discussion: Using shallow git clones #48

jmholla opened this issue Jan 28, 2025 · 8 comments

Comments

@jmholla
Copy link

jmholla commented Jan 28, 2025

I don't have any hard data, just from experience and looking at the code, it looks like an entire git clone is performed. I imagine it should be sufficient to shallow clones of the various branches one needs to request. And that should provide enough information to generate the commits necessary for locking the repositories. But there could also be a gap here in my understanding of git the tool, the SSH/HTTP APIs, and/or git internals.

My motivation for starting this discussion is that my own state repository has grown to be about 70 MB, and so I really notice slowness during wrapper runs. Other options I've considered are running the backend as a local service, though I really like the CLI wrapper style of usage. I also considered patching the code to cache repositories to disk locally for reuse, but this felt like a potentially more durable and widely useful solution, so I wanted to open this discussion before delving deeper into what that would look like.

I hope this is the right place for this.

@jmholla
Copy link
Author

jmholla commented Jan 28, 2025

I believe this could be take as far as only grabbing the relevant files from the repository as well, reducing necessary bandwidth further. I have this example lying around:

git clone -n --depth=1 --filter=tree:0 <repository>
cd ./<repo>
git sparse-checkout set <path>
git checkout

That only request the files in <path> from the tip of the default branch of <repository>.

@dee-kryvenko
Copy link
Member

Hi - thanks for the interest in this project. To clone the repo for read - would be one thing, I am not aware you can write back new commits from a shallow clone. Can you? I honestly have no idea, never put much thought into it. If you can then yeah - it's a no brainer really and the only reason it's not that way is simply that never occurred to me that it was possible.

Another option would be to create another implementation for the storage layer and talk directly to the APIs, but that would be a different implementation per provider. Out of curiosity - are you using GitHub or something else?

@jmholla
Copy link
Author

jmholla commented Jan 29, 2025

Hi - thanks for the interest in this project.

I LOVE this project! I can't imagine having to use any other backend.

Can you? I honestly have no idea, never put much thought into it. If you can then yeah - it's a no brainer really and the only reason it's not that way is simply that never occurred to me that it was possible.

This is where I figured my own knowledge might be lacking. I was literally playing with git mktree and other plumbing last night, and maybe I've gotten a bit too big for my britches. I'll look into it and report back. And if works, report back with a pull request and the hopefully not too distant future.

Another option would be to create another implementation for the storage layer and talk directly to the APIs, but that would be a different implementation per provider. Out of curiosity - are you using GitHub or something else?

Thanks. That's a route I hadn't considered and definitely feels like a good alternative. I'm using GitLab. But, I think much of the API is similar. [citation needed]

Thank you for the quick response and the insight!

@dee-kryvenko
Copy link
Member

They are similar but not exactly the same.. which would mean one implementation per provider to maintain. Authentication is different too. It also comes with another problem - APIs typically rate throttled. But - it would be fast no matter how big the repo is.

@jmholla
Copy link
Author

jmholla commented Jan 29, 2025

They are similar but not exactly the same.. which would mean one implementation per provider to maintain.

For sure. And even where they agree you can't depend on them to remain aligned.

Authentication is different too. It also comes with another problem - APIs typically rate throttled. But - it would be fast no matter how big the repo is.

Also great points.


But, I've seen very promising results with shallow clones and being able to push changes upstream. It does get nitty gritty with plumbing commands.

Here's a one-liner I put together to test it that changes the contents of the README.md in a repository to have the contents "Test custom plumbing" with the commit message "Shallow commit". And I was able to push it.

git reset --hard $(echo "Shallow commit" | git commit-tree "$(cat <(echo -e "100644 blob $(echo "Testing custom plumbing" | git hash-object -w --stdin)\tREADME.md") <(git cat-file -p $(git cat-file -p HEAD | grep tree | sed -e 's_.* __') | grep -v "README.md") | git mktree)" -p HEAD)

Basically, it (as a reference for myself and whoever else stumbles across this):

  1. Creates a blob to represent thenew contents of the README.md using git hash-object.
  2. Then, gets the commit tree for the current commit using two invocations of git cat-file -p and removes the line reference to the file we are modifying.
  3. Using cat and echo, it stitches 1 and 2 together into a single commit tree representation.
  4. Then it uses git mktree with -p HEAD to create the actual commit. The -p sets HEAD as the parent, and that can be modified as needed.
  5. Then we use git reset --hard to make the current branch point to the new commit. If we didn't have the commit checked out (I bet git fetch instead of git checkout in the shallow cloning from above would let us do that), we could use git update-ref.

But, then you should be able to push your current branch to the remote. (I bet there's even a way to bypass even updating the local reference and just push the reference directly to the remote, I just don't know it offhand.)

This feels promising, and I'd like to dive into it more, but I do have to table this for the next day or so.

@dee-kryvenko
Copy link
Member

This looks very promising. Thank you! And it wouldn't require a force push at the end and keep the remote branch clean and linear?

@jmholla
Copy link
Author

jmholla commented Jan 29, 2025

Yup. The only difference from what git normally does, is the file sorting in the tree. I do want to look into that for consistency's sake.

@jmholla
Copy link
Author

jmholla commented Jan 29, 2025

Ok, I went overboard at first. This is even easier than I expected. You can do that sparse checkout, only change that file, use git add and git commit and everything works the same. So, I think this is going to be mostly changing the cloning commands.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants