-
Notifications
You must be signed in to change notification settings - Fork 100
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid reopening packfile on every object access #852
Conversation
c25ae03
to
d0a9a68
Compare
d0a9a68
to
aaeeb64
Compare
// It copies the relevant logic from git.PlainOpen, and enables the filesystem KeepDescriptors option. This | ||
// caches the packfile handles, preventing the packfile from being opened then closed on every object access. | ||
func openRepo(repoDir string) (*git.Repository, io.Closer, error) { | ||
fs := osfs.New(repoDir) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I included all the parts from git.PlainOpen
that seem relevant. I also checked the go-git
examples for KeepDescriptors
, and this is almost identical: https://github.com/go-git/go-git/blob/3a210b5157e7d2a57eeac29e492efa510d616a24/_examples/ls/main.go#L32-L42
However git.PlainOpen
is quite complex and I could be missing something. Here's what I purposefully omitted:
- Special handling for user home shortcuts ... I tested and this was not necessary 🤔
- Detecting when
.git
is a file that points to another git directory (like if you're trying to directly index submodule directory). I don't think we do this in indexing, or ever would.
I've tested zoekt-git-index
with all the following:
- Repos with and without
.git
subdirectories - Relative paths like
../megarepo
- Shortcut for user home like
~/code/megarepo
- Updating Sourcegraph to use this version and reindexing a bunch of repos locally
} | ||
|
||
s := filesystem.NewStorageWithOptions(fs, cache.NewObjectLRUDefault(), filesystem.Options{ | ||
KeepDescriptors: true, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's the point of the whole change :) This also sets me up to set LargeObjectThreshold
in a follow-up PR, which could prove even more important.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! This aligns with earlier research I did on go-git when analysing it.
On your macbook, could you maybe share the output of zoekt-git-index on the megarepo before and after this change, but prefixed with /usr/bin/time -al
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great tests!
This command is great, thanks for the pointer. My interpretation of the results below:
Before
After
|
Latest changes:
|
executeCommand(t, repoDir, exec.Command("git", "config", "user.name", "Thomas")) | ||
executeCommand(t, repoDir, exec.Command("git", "config", "user.email", "[email protected]")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a nice way to do it in a repo. FYI in the sourcegraph repo what we often do is set these envvars, which also allow you to get deterministic commits since it controls the timestamps:
c.Env = []string{
"GIT_CONFIG=" + path.Join(dir, ".git", "config"),
"GIT_COMMITTER_NAME=a",
"[email protected]",
"GIT_COMMITTER_DATE=2006-01-02T15:04:05Z",
"GIT_AUTHOR_NAME=a",
"[email protected]",
"GIT_AUTHOR_DATE=2006-01-02T15:04:05Z",
}
The setting of GIT_CONFIG is useful for the init command, since then it also avoids reading the users git config which can affect things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the pointer, this is nicer.
By default go-git maintains an LRU cache of git objects of size 96MB. When an object's contents are loaded, it's stored as a MemoryObject in this cache. This cache is not super useful in the indexing access pattern, which accesses each file only once. And in many profiles, we see a substantial number of allocations from these memory objects. This PR disables caching for most git objects by setting LargeObjectThreshold: 1. go-git still proactively caches packfile objects under 16KB (see smallObjectThreshold here). Follow up to #852. This change is also gated by the ZOEKT_ENABLE_GOGIT_OPTIMIZATION feature flag.
By default, the
go-git
library will open the packfile on every call toRepository.BlobObject
, then close it. During indexing, we collect the list of files to index, then iterate through each one callingRepository.BlobObject
. So on every object access the packfile reopened, andgo-git
reallocates some in-memory buffers.This PR bypasses
git.PlainOpen
to allow us to enable theKeepDescriptors
option. This option keeps packfile files open, and caches wrappers for them. The files then need to be explicitly closed when done with the repo.Benefits:
Relates to SPLF-636