Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

describe and diff: implement --no-optional-locks #1872

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

bgw
Copy link

@bgw bgw commented Mar 5, 2025

git describe and git diff may update the index in the background for
similar performance reasons to git-status. These patches implement the
--no-optional-locks option for these commands, which allows scripts to
bypass this behavior.

I'm implementing this as a solution to a stale lockfile issue we've
sporadically encountered due to a build script that runs git describe.
We're mitigating this issue by using libgit2 in our build script,
which does not have the same background update behavior for its
git_describe_workdir implementation, but it would be nice to have this
option supported in the git CLI.

Tests and documentation changes are included!

cc: Jeff King [email protected]

Copy link

gitgitgadget bot commented Mar 5, 2025

Welcome to GitGitGadget

Hi @bgw, and welcome to GitGitGadget, the GitHub App to send patch series to the Git mailing list from GitHub Pull Requests.

Please make sure that either:

  • Your Pull Request has a good description, if it consists of multiple commits, as it will be used as cover letter.
  • Your Pull Request description is empty, if it consists of a single commit, as the commit message should be descriptive enough by itself.

You can CC potential reviewers by adding a footer to the PR description with the following syntax:

CC: Revi Ewer <[email protected]>, Ill Takalook <[email protected]>

NOTE: DO NOT copy/paste your CC list from a previous GGG PR's description,
because it will result in a malformed CC list on the mailing list. See
example.

Also, it is a good idea to review the commit messages one last time, as the Git project expects them in a quite specific form:

  • the lines should not exceed 76 columns,
  • the first line should be like a header and typically start with a prefix like "tests:" or "revisions:" to state which subsystem the change is about, and
  • the commit messages' body should be describing the "why?" of the change.
  • Finally, the commit messages should end in a Signed-off-by: line matching the commits' author.

It is in general a good idea to await the automated test ("Checks") in this Pull Request before contributing the patches, e.g. to avoid trivial issues such as unportable code.

Contributing the patches

Before you can contribute the patches, your GitHub username needs to be added to the list of permitted users. Any already-permitted user can do that, by adding a comment to your PR of the form /allow. A good way to find other contributors is to locate recent pull requests where someone has been /allowed:

Both the person who commented /allow and the PR author are able to /allow you.

An alternative is the channel #git-devel on the Libera Chat IRC network:

<newcontributor> I've just created my first PR, could someone please /allow me? https://github.com/gitgitgadget/git/pull/12345
<veteran> newcontributor: it is done
<newcontributor> thanks!

Once on the list of permitted usernames, you can contribute the patches to the Git mailing list by adding a PR comment /submit.

If you want to see what email(s) would be sent for a /submit request, add a PR comment /preview to have the email(s) sent to you. You must have a public GitHub email address for this. Note that any reviewers CC'd via the list in the PR description will not actually be sent emails.

After you submit, GitGitGadget will respond with another comment that contains the link to the cover letter mail in the Git mailing list archive. Please make sure to monitor the discussion in that thread and to address comments and suggestions (while the comments and suggestions will be mirrored into the PR by GitGitGadget, you will still want to reply via mail).

If you do not want to subscribe to the Git mailing list just to be able to respond to a mail, you can download the mbox from the Git mailing list archive (click the (raw) link), then import it into your mail program. If you use GMail, you can do this via:

curl -g --user "<EMailAddress>:<Password>" \
    --url "imaps://imap.gmail.com/INBOX" -T /path/to/raw.txt

To iterate on your change, i.e. send a revised patch or patch series, you will first want to (force-)push to the same branch. You probably also want to modify your Pull Request description (or title). It is a good idea to summarize the revision by adding something like this to the cover letter (read: by editing the first comment on the PR, i.e. the PR description):

Changes since v1:
- Fixed a typo in the commit message (found by ...)
- Added a code comment to ... as suggested by ...
...

To send a new iteration, just add another PR comment with the contents: /submit.

Need help?

New contributors who want advice are encouraged to join [email protected], where volunteers who regularly contribute to Git are willing to answer newbie questions, give advice, or otherwise provide mentoring to interested contributors. You must join in order to post or view messages, but anyone can join.

You may also be able to find help in real time in the developer IRC channel, #git-devel on Libera Chat. Remember that IRC does not support offline messaging, so if you send someone a private message and log out, they cannot respond to you. The scrollback of #git-devel is archived, though.

@gitgitgadget gitgitgadget bot added the new user label Mar 5, 2025
Copy link

gitgitgadget bot commented Mar 5, 2025

There are issues in commit 77077aa:
describe,diff: Implement --no-optional-locks
Prefixed commit message must be in lower case

@bgw bgw force-pushed the bgw/diff-describe-optional-locks branch 3 times, most recently from 908b66e to 4b980d2 Compare March 5, 2025 00:47
@dscho
Copy link
Member

dscho commented Mar 5, 2025

/allow

Copy link

gitgitgadget bot commented Mar 5, 2025

User bgw is now allowed to use GitGitGadget.

bgw added 2 commits March 5, 2025 21:11
When used with `--dirty`, describe may update the index in the
background for similar performance reasons to `git-status`.

This commit implements the `--no-optional-locks` option for
`git describe`, which allows scripts to bypass this behavior.

Signed-off-by: Benjamin Woodruff <[email protected]>
When used with `autoRefreshIndex`, `git diff` may update the index in
the background for similar performance reasons to `git-status`.

This commit implements the `--no-optional-locks` option for `git diff`,
which allows scripts to bypass this behavior.

Signed-off-by: Benjamin Woodruff <[email protected]>
@bgw bgw force-pushed the bgw/diff-describe-optional-locks branch from 4b980d2 to 3ba3113 Compare March 6, 2025 05:13
@bgw
Copy link
Author

bgw commented Mar 6, 2025

/preview

Copy link

gitgitgadget bot commented Mar 6, 2025

Preview email sent as [email protected]

@bgw bgw changed the title describe,diff: Implement --no-optional-locks describe,diff: implement --no-optional-locks Mar 6, 2025
@bgw bgw changed the title describe,diff: implement --no-optional-locks describe and diff: implement --no-optional-locks Mar 6, 2025
@bgw
Copy link
Author

bgw commented Mar 6, 2025

/submit

Copy link

gitgitgadget bot commented Mar 6, 2025

Submitted as [email protected]

To fetch this version into FETCH_HEAD:

git fetch https://github.com/gitgitgadget/git/ pr-1872/bgw/bgw/diff-describe-optional-locks-v1

To fetch this version to local tag pr-1872/bgw/bgw/diff-describe-optional-locks-v1:

git fetch --no-tags https://github.com/gitgitgadget/git/ tag pr-1872/bgw/bgw/diff-describe-optional-locks-v1

Copy link

gitgitgadget bot commented Mar 6, 2025

On the Git mailing list, Junio C Hamano wrote (reply to this):

"Benjamin Woodruff via GitGitGadget" <[email protected]>
writes:

> git describe and git diff may update the index in the background for similar
> performance reasons to git-status.

That is a wrong reasoning that is completely opposite, though.

The commands at the Porcelain level, like "status" and "diff",
refresh the index for the CORRECTNESS purposes.

The commands at the plumbing level, which are designed to be used in
your own scripts, like "diff-files" and "diff-index", do not refresh
the index for the performance purposes.  If your own script wants to
produce correct result, your script IS responsible for refreshing
the index after its last modification to working tree files before
it starts to use the plumbing commands to inspect which ones are
modified and which ones are not.  This is so that your script has
more control over when the index is refreshed.  It does not have to
pay cost to run refresh for each Git command it invokes, if it knows
that it does not make any modification between the two invocations;
it can instead refresh just once and then run these two plumbing
commands.

So, it would be absolute no-no to make the Porcelain commands like
"describe" and "diff" not to refresh the index before they work by
default.  As an optional behaviour, it might be acceptable if there
is no other reasonable solutions (like, using plumbing commands if
the callers are not you typing but your scripts calling them).

Copy link

gitgitgadget bot commented Mar 9, 2025

On the Git mailing list, Jeff King wrote (reply to this):

On Thu, Mar 06, 2025 at 08:11:09AM -0800, Junio C Hamano wrote:

> "Benjamin Woodruff via GitGitGadget" <[email protected]>
> writes:
> 
> > git describe and git diff may update the index in the background for similar
> > performance reasons to git-status.
> 
> That is a wrong reasoning that is completely opposite, though.
> 
> The commands at the Porcelain level, like "status" and "diff",
> refresh the index for the CORRECTNESS purposes.

Right, but "status" supports --no-optional-locks already.

So I think these patches are wrong, but the goal is reasonable. What
git-status does with --no-optional-locks is to update the index
internally for its _own_ use (giving it the correct results), but not to
lock nor write out the resulting index (to avoid conflicting with other
running programs). So it's pessimal (losing the opportunity to share
what it learned) but prevents lock contention.

And I think other programs could do that. I even wrote back in
27344d6a6c (git: add --no-optional-locks option, 2017-09-27):

  I've punted here on finding more callers to convert, since "status" is
  the obvious one to call as a repeated background job. But "git diff"'s
  opportunistic refresh of the index may be a good candidate.

I must admit that I didn't imagine "describe" is something that somebody
would run a lot in the background, but there's not really any harm in
having it support optional locks if somebody cares about that case.

> The commands at the plumbing level, which are designed to be used in
> your own scripts, like "diff-files" and "diff-index", do not refresh
> the index for the performance purposes.  If your own script wants to
> produce correct result, your script IS responsible for refreshing
> the index after its last modification to working tree files before
> it starts to use the plumbing commands to inspect which ones are
> modified and which ones are not.  This is so that your script has
> more control over when the index is refreshed.  It does not have to
> pay cost to run refresh for each Git command it invokes, if it knows
> that it does not make any modification between the two invocations;
> it can instead refresh just once and then run these two plumbing
> commands.

Assuming --no-optional-locks is being used as intended, the idea is that
the script cannot touch the index at all, because it is running in the
background and does not want lock contention with things the user is
doing in the foreground. So it cannot naively do a single index refresh
followed by plumbing commands like diff-files. Either it must:

  - accept that diff-files might return stat-dirty results (yuck)

  - use its own index that is separate from the regular .git/index file.
    But that may be overly slow, since the index "update" would rewrite
    the whole thing from scratch. Of course all of our index writes are
    from scratch, but you'd pay the price even when there is nothing to
    update.

  - use a command which operates all in a single process, with an
    in-memory index that is updated but not written out (e.g., "git
    --no-optional-locks status").

-Peff

Copy link

gitgitgadget bot commented Mar 9, 2025

User Jeff King <[email protected]> has been added to the cc: list.

Copy link

gitgitgadget bot commented Mar 10, 2025

On the Git mailing list, Junio C Hamano wrote (reply to this):

Jeff King <[email protected]> writes:

> .... What
> git-status does with --no-optional-locks is to update the index
> internally for its _own_ use (giving it the correct results), but not to
> lock nor write out the resulting index (to avoid conflicting with other
> running programs). So it's pessimal (losing the opportunity to share
> what it learned) but prevents lock contention.

Yup, that sounds somewhat sensible.  I also have to wonder, other
than commands that are clearly about changing the repository state
like "add", the inspection commands like diff and status should
always opportunistically write the index back, without even being
asked?

> ... Either it must:
>
>   - accept that diff-files might return stat-dirty results (yuck)
>
>   - use its own index that is separate from the regular .git/index file.
>     But that may be overly slow, since the index "update" would rewrite
>     the whole thing from scratch. Of course all of our index writes are
>     from scratch, but you'd pay the price even when there is nothing to
>     update.
>
>   - use a command which operates all in a single process, with an
>     in-memory index that is updated but not written out (e.g., "git
>     --no-optional-locks status").

Copy link

gitgitgadget bot commented Mar 10, 2025

On the Git mailing list, Jeff King wrote (reply to this):

On Mon, Mar 10, 2025 at 05:25:32AM -0700, Junio C Hamano wrote:

> Jeff King <[email protected]> writes:
> 
> > .... What
> > git-status does with --no-optional-locks is to update the index
> > internally for its _own_ use (giving it the correct results), but not to
> > lock nor write out the resulting index (to avoid conflicting with other
> > running programs). So it's pessimal (losing the opportunity to share
> > what it learned) but prevents lock contention.
> 
> Yup, that sounds somewhat sensible.  I also have to wonder, other
> than commands that are clearly about changing the repository state
> like "add", the inspection commands like diff and status should
> always opportunistically write the index back, without even being
> asked?

Yeah, certainly it is a potential source of confusion to have a
conceptually read-only operation take locks or modify the repo state.

But I'm not sure we have a sense of how valuable that optimization is in
practice. After touching some files, every git-status, git-diff, etc
would end up re-hashing those files instead of using the stat cache.
But maybe that is lost in the noise of reading the files to actually do
diffs, etc? I dunno. I expect it is more important for status, which
probably does not need to read the whole file contents in most cases
(and which may be run a lot from the user's prompt, etc).

It seems like a big and possibly risky departure from what we've done
for so many years. I'm inclined not to rock the boat too much. ;)

-Peff

Copy link

gitgitgadget bot commented Mar 10, 2025

On the Git mailing list, Junio C Hamano wrote (reply to this):

Jeff King <[email protected]> writes:

> But maybe that is lost in the noise of reading the files to actually do
> diffs, etc? I dunno. I expect it is more important for status, which
> probably does not need to read the whole file contents in most cases
> (and which may be run a lot from the user's prompt, etc).

Yeah, and old timers who run "diff --raw" as if it were a quick
analogue for "status" also would notice.

> It seems like a big and possibly risky departure from what we've done
> for so many years. I'm inclined not to rock the boat too much. ;)

Certainly not right now.  But adding a command line option is even
worse as we would have to carry the support for it for practically
forever X-<.

Copy link

gitgitgadget bot commented Mar 10, 2025

On the Git mailing list, "Benjamin Woodruff" wrote (reply to this):

On Mon, Mar 10, 2025, at 11:53 AM, Junio C Hamano wrote:
> Jeff King <[email protected]> writes:
>
>> But maybe that is lost in the noise of reading the files to actually do
>> diffs, etc? I dunno. I expect it is more important for status, which
>> probably does not need to read the whole file contents in most cases
>> (and which may be run a lot from the user's prompt, etc).
>
> Yeah, and old timers who run "diff --raw" as if it were a quick
> analogue for "status" also would notice.

Thanks for your reviews, Junio and Jeff.

>>>> I must admit that I didn't imagine "describe" is something that
>>>> somebody would run a lot in the background

It might help if I start with some more context of why I wrote this
patch. We've got a tool that uses the Rust `vergen-gitcl` crate to call
`git describe --dirty`. You can see that code here:
https://github.com/vercel/next.js/pull/76889/files

We use `vergen-gitcl` to generate version identifiers for an on-disk
cache. This cache stores results of thousands of functions and has no
backwards compatibility. We want to invalidate it when *any* of the code
changes. `git-describe` felt like a good fit for that, as it gives us a
unique identifier that's still reasonably user-friendly.

However, we discovered that we'd frequently end up with stale git
lockfiles. This appeared to be due some combination of IDE tools that
run the build in the background (i.e. the rust-analyzer LSP), behavior
that causes builds to sometimes get killed before completion, and the
fact that `git describe --dirty` takes a lock.

I've worked around this on our end, by re-implementing `describe`'s
`--dirty` flag on top of `status`:
<https://github.com/rustyhorde/vergen/pull/406>

So, from my end, there's no urgency to get this change in, or to add
support for `diff` (we only care about `describe`). Rather, I felt like
this might be a footgun for other users, and wanted to do the right
thing by submitting an upstream change.

>>> git describe and git diff may update the index in the background for
>>> similar performance reasons to git-status.
>>
>> That is a wrong reasoning that is completely opposite, though.
>> 
>> The commands at the Porcelain level, like "status" and "diff",
>> refresh the index for the CORRECTNESS purposes.
>
> Right, but "status" supports --no-optional-locks already.

Does this mean the documentation in `git-status` is incorrect? It
implies that the background refresh is only for performance reasons.
That's where I got this idea from:
<https://git-scm.com/docs/git-status#_background_refresh>

It's also worth noting that libgit2 does not do this background refresh
by default (`GIT_DIFF_UPDATE_INDEX` and `GIT_STATUS_OPT_UPDATE_INDEX`).
I think that makes sense for libgit2's typical use-cases, but it is a
divergence in behavior.

>> Yeah, certainly it is a potential source of confusion to have a
>> conceptually read-only operation take locks or modify the repo state.
>>
>> [...]
>>
>> It seems like a big and possibly risky departure from what we've done
>> for so many years. I'm inclined not to rock the boat too much. ;)
>
> Certainly not right now.  But adding a command line option is even
> worse as we would have to carry the support for it for practically
> forever X-<.

What about if we reduce this to documentation changes? That alone
would've saved me a lot of pain of trying to figure out what does and
doesn't take a lock:

- Explicitly document that `--no-optional-locks` only changes behavior
  for `git status`.

- Document that the `--dirty` flag on `describe` will lead to
  `status`-like background refresh behavior.

- Change the documentation for status's background refresh to indicate
  that there is a subtle difference in behavior caused by
  `--no-optional-locks`? I lack sufficient context on how best to
  describe or explain this.

- Leave `git-diff` documentation as-is. I think the current
  documentation of `diff.autoRefreshIndex` is sufficient?

Copy link

gitgitgadget bot commented Mar 10, 2025

On the Git mailing list, Junio C Hamano wrote (reply to this):

"Benjamin Woodruff" <[email protected]> writes:

>>> The commands at the Porcelain level, like "status" and "diff",
>>> refresh the index for the CORRECTNESS purposes.
>>
>> Right, but "status" supports --no-optional-locks already.
>
> Does this mean the documentation in `git-status` is incorrect? It
> implies that the background refresh is only for performance reasons.

It has to do, and it does, refresh the in-core index for correctness
reasons.  Writing the result out to the on-disk file for _other_
processes that will use the index later is a performance hack, and
may or may not happen (it is up to what repo_update_index_if_able()
does).

If we do not care about redundant refreshing these other processes
need to do before they start, we do not have to write the resulting
in-core index out to the disk.

Writing the in-core index out for other processes has one downside,
which is that for correctness it has to be done under lock.  We
cannot say "we tried to be nice to others by writing the index file
out, but we just found out that there is a competing process trying
to open/write the index so let's skip writing it and let them do
their thing---they are responsible for refreshing the index for
their own use".  There needs some coordination, i.e. when we start
writing the result of our refreshing the index for others, if other
processes need to pay attention to that fact and wait for a while
before reading the index, then they benefit because they do not have
to refresh the index themselves.  Even though our programs, by
calling hold_lock_file_for_update(), know how to wait for a bit when
this happens, not everybody cooperates.

Copy link

gitgitgadget bot commented Mar 11, 2025

On the Git mailing list, Jeff King wrote (reply to this):

On Mon, Mar 10, 2025 at 01:50:36PM -0700, Benjamin Woodruff wrote:

> It might help if I start with some more context of why I wrote this
> patch. We've got a tool that uses the Rust `vergen-gitcl` crate to call
> `git describe --dirty`. You can see that code here:
> https://github.com/vercel/next.js/pull/76889/files
> 
> We use `vergen-gitcl` to generate version identifiers for an on-disk
> cache. This cache stores results of thousands of functions and has no
> backwards compatibility. We want to invalidate it when *any* of the code
> changes. `git-describe` felt like a good fit for that, as it gives us a
> unique identifier that's still reasonably user-friendly.
> 
> However, we discovered that we'd frequently end up with stale git
> lockfiles. This appeared to be due some combination of IDE tools that
> run the build in the background (i.e. the rust-analyzer LSP), behavior
> that causes builds to sometimes get killed before completion, and the
> fact that `git describe --dirty` takes a lock.

Yeah, that is not quite the original use case that --no-optional-locks
was designed for (i.e., simultaneous contention), but I think it is a
reasonable application of the flag.

> >>> git describe and git diff may update the index in the background for
> >>> similar performance reasons to git-status.
> >>
> >> That is a wrong reasoning that is completely opposite, though.
> >> 
> >> The commands at the Porcelain level, like "status" and "diff",
> >> refresh the index for the CORRECTNESS purposes.
> >
> > Right, but "status" supports --no-optional-locks already.
> 
> Does this mean the documentation in `git-status` is incorrect? It
> implies that the background refresh is only for performance reasons.
> That's where I got this idea from:
> <https://git-scm.com/docs/git-status#_background_refresh>

I think Junio gave an explanation here, so I won't repeat that. But I
also think both of us may have been a bit confused about the changes
your patches are making, because there's some subtlety.

The important thing to keep in mind is that there are _two_ steps:
refreshing the in-core index and writing the result out to the on-disk
file. With --no-optional-locks we must continue to do the first step
(for correctness), and skip the second step.

So looking at your patch 1/2 for git-describe, it is doing the right
thing: we still call refresh_index() always, and only skip the calls to
repo_hold_locked_index() and repo_update_index_if_able().

But one thing that puzzles me is that we read and refresh the index
first and only _then_ take a lock. Which seems wrong to me, as we could
racily overwrite an intermediate write from somebody else that we never
even saw (e.g., imagine you call "git add" at just the wrong moment).

That is not a bug in your code, but an existing problem that I think
made it harder to understand your change (and probably one we should
fix regardless).

Your patch 2/2 for git-diff is what I thought was actually wrong, but
after digging further, I'm not so sure.

In your patch we return early from refresh_index_quietly(), without
actually refreshing the in-core index. So I _thought_ that meant we'd
produce a wrong answer for something like this:

  $ touch git.c
  $ ./git --no-optional-locks diff

where we should report "no changes", but would instead find the
stat-dirty git.c (just like a plumbing "git diff-files" would). But
that doesn't happen!

That's because refresh_index_quietly() runs after the diff has completed
anyway. The real magic is in diffcore_skip_stat_unmatch(), which
processes individual stat-dirty entries and suppresses them (when
there's no actual content change).

So the call in refresh_index_quietly() really is just about updating
what we're about to write out, and your patch is correct to bail from
the whole function (if we are not writing it out, there is no purpose in
refreshing at that point).

So as far as I can tell the patches are doing the right thing. But I
think the commit messages probably need to describe those subtleties and
argue that the change is correct. Bonus points if a preparatory patch
fixes the race in git-describe. ;)

> It's also worth noting that libgit2 does not do this background refresh
> by default (`GIT_DIFF_UPDATE_INDEX` and `GIT_STATUS_OPT_UPDATE_INDEX`).
> I think that makes sense for libgit2's typical use-cases, but it is a
> divergence in behavior.

Yes, I think that is probably a reasonable default for libgit2, where
you'd expect everything to happen in a single process (that can share
the in-core index).

-Peff

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants