Optimize Git usage in CAS #3976

yhakbar · 2025-03-06T15:49:05Z

I got good feedback from @apparentlymart regarding an optimization we might be able to make in the CAS system we're introducing in #3929.

To my understanding, it boils down to preserving a single bare git repository as a database, and fetching references from it to minimize the work done over the network when fetching Git content.

I believe this optimization wouldn't remove the need to preserve the existing CAS store so that repositories can be recreated using hard links to a central store, but it would make it so that cache misses don't require a full shallow bare clone of repositories, and will instead fetch the relevant missing objects from the remote.

Adding support for this might require some sort of locking to prevent concurrent updates to the shared git database, but I'm not sure that it does. Thinking about this on a surface level makes me think that the content in the DB should be immutable, and safe to access concurrently, but testing would need to be done to validate that.

apparentlymart · 2025-03-06T16:52:59Z

In the little rough sketch I shared with you I think the main point of concurrency contention was the use of the FETCH_HEAD symbolic ref as a representation of "the commit we most recently fetched", since of course concurrent processes doing that fetch-and-update-ref step would clobber each other's FETCH_HEAD.

You might be able to mitigate even that by peeling off one abstraction layer and using git fetch-pack instead, but I've not experimented with that

yhakbar added enhancement New feature or request preserved Preserved issues never go stale labels Mar 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize Git usage in CAS #3976

Optimize Git usage in CAS #3976

yhakbar commented Mar 6, 2025

apparentlymart commented Mar 6, 2025

Optimize Git usage in CAS #3976

Optimize Git usage in CAS #3976

Comments

yhakbar commented Mar 6, 2025

apparentlymart commented Mar 6, 2025