-
-
Notifications
You must be signed in to change notification settings - Fork 905
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Git.execute's kill_after_timeout callback assumes procps #1756
Comments
Thanks a lot for bringing this up, and for all the detective-work that went into this incredibly thorough analysis! From todays point of view, I think trying to kill child processes like that is utterly unacceptable as it's obviously racy. From my experience, on MacOS, sending a signal to the parent process also sends it to child processes - Thus, I truly hope there is ways to use signals properly instead of trying to be even more elaborate here. If for some reason it's not possible to send signals to the parent process, one could at least get the IDs of spawned child processes for killing them later. The Rust standard library makes it as easy as calling All in all, I really hope that
Just wanted to amend that I love this approach - reverse shells are so powerful and even though I never tried it, I absolutely will once an opportunity presents itself. Thanks so much for sharing! |
I don't think sending a signal to a specific process should automatically cause its children to receive it. Is it possible that you were sending it in a way that really sent it to a process group? For example, pressing Ctrl+C in a terminal sends This small C program demonstrates that, with a process tree P → Q → R, when P terminates Q, R continues running. This seems to work the same on macOS as other Unix-like systems. However, none of what I am saying here is necessarily applicable to Windows.
GitPython's own PID is really the PID of whatever application is using the GitPython library, so I think anything that could cause all of that program's subprocesses to be terminated, including subprocesses that are not related to GitPython, should be strongly avoided.
If you're talking about direct children of the process using GitPython (the This is in contrast to the handling of indirect subprocesses, where because the process in which GitPython runs did not create them, it is not the process that waits for them, and therefore it cannot ensure they have not died and been waited on, when sending signals to them by PID.
Yes, GitPython is using Lines 1025 to 1030 in f0e7e41
You're welcome! In case you're interested, the repository for that C program I mentioned has the tmate debugger set up for optional use when using the |
Thanks so much for the clarification! Indeed, I believe I was confused by process groups which naturally do the right thing. And even if that wouldn't happen, sub-processes would naturally shut-down once they fail to write their output as the parent process closed the pipe. And yes, it's quite foolish to to assume a library should send signals to its parent process under the assumption it owns it. However, it seems like process-groups, assuming these are inherited automatically even for processes that aren't spawned by GitPython, could be a good way to send signals to everyone concerned in a race-free manner. Probably there are a lot of intricacies to figure out, but at least in theory, it should be one way to avoid having to 'list-and-kill' anything. Even if that makes sub-processes unkillable by others, it seems like having it as option seems like a step forward. Maybe there are other combinations of features that I am not seeing that will work, and will even be portable to Windows in some shape or form. As a personal note, it's amazing how I keep forgetting about process groups and all the 'magic' shells are doing to make process control seem so natural. It's easy to conflate this with the much more naive process control that one then applies in applications or libraries. Thanks as well for the test-program and the elaborate CI setup, it was a pleasure to take a look. |
You're welcome! Double-checking the details of this was also a good review for me.
This may be the best available option. I expect that, in addition to avoiding a race condition on PIDs, it would also allow the code to be simpler than other approaches. They will be inherited automatically by indirect subprocesses, such as when one
My guess is that the main intricacy to figure out might be whether GitPython should handlers for signals like
Windows works differently in some significant ways, but it looks like this approach of putting the subprocess in its own process group is already being used on Windows: Lines 231 to 236 in 2b69bac
Based on the note in the linked documentation, however, I am unsure exactly what is going on there, because it doesn't seem to say
You're welcome! Thanks for taking a look.
This may not be what you're referring to, and I don't know if you had looked at it during the time I had some really overcomplicated and bad YAML code for trying to customize a step title based on the event trigger, but I have fortunately fixed that. :) |
Indeed, signal handling support is opt-in in For GitPython, it would probably be the same, but if so, the alternative code-path would have to remain the same. Then it's the question how many people opt-in to the new behaviour, even if it's better they might simply not try. It's also still a bit strange to imagine what would happen if the parent-process is terminated, even though it should have a child-process group that it manages. I would expect something to happen with that one, too, automatically, to honor the parent-child relationship.
😁 I think I'd need to write my own cross-platform shell to finally understand how all that is really working, and that's not going to happen anytime soon 😅.
Yes, I agree that this topic here is sufficiently complex to better find lower-hanging and maybe even more valuable topics to work on. I will surely be learning a lot once you do tackle this issue, so I am looking forward to when that happens. |
I wouldn't expect anything to happen automatically. Unless you mean you would expect the parent process to install a signal handler to do something on a best-effort basis. Although this is a contrived example, this sort of thing can be useful:
Whether or not a child process is in a new process group, I don't think it would be a generally desirable default for the system to send a signal to it automatically when its parent terminates. This at least seems to me to be at odds with the Unix design decisions made in how signals work with parent and child processes. The way the parent-child relationship is reflected in signal handling is instead the other way around: when a child process exits (including due to receiving a signal), is stopped, or is resumed, its parent is sent
I think that, in practice, if by cross-platform you mean "not just Unix," then cross-platform shells abstract away from the way processes and signals work, for at least some of the operating systems they target. For Unix-style shells, the high-level abstractions shells provide, like pipes and jobs, tend to correspond roughly to the abstractions provided by the kernel. Native ports of such shells to Windows work around the gaps one way or another. Often the ports are not native, but instead rely on a separate translation layer, like It is true, though, that writing a shell should entail engaging with any of those details that are not handled by a translation layer, on whatever platforms are targeted. Writing a fully functional shell is, I believe, quite difficult. Sometimes people write very basic shells. A less time-consuming option, if one is interested, might be to examine the code of a shell that is production-quality but far simpler than most shells, such as the Almquist Shell implementations
I think the greatest complexity pertains to avoiding a breaking change, predicting the effect on existing use, and dealing with unanticipated breakages (that might be a result of an unintended breaking change and thus a bug, or a result of ill-founded assumptions baked into some code that uses GitPython, or a combination of the two). It might be worthwhile to insert an additional caveat into the part of the In addition, since really improving this area of the code may wait a while, I may try to figure out a way to make clearer that the code in Running This would still not be a problem when actually running in a Cygwin build of the Python interpreter, because facilities such as
I don't promise definitely to do so. For one thing, perhaps someone else will come along and contribute the improvement first! If not, however, then I hope to do it eventually. |
Thanks so much for sharing your insights on signals and shells. As always, way ahead of me :)!
This feels like a good first step towards enabling changes to that machinery, and to fill in documentation that better explains the current implementation along with its shortcomings that one way or another people might rely upon. Of course, it's only good if people read it and check its applicability to their own usage, and even then it will be unclear if they unknowingly rely on side-effects. No matter how I think about it, avoiding accidental breakage or surprises when touching this topic seems like a gamble for which one would wish to have a beta-track of sorts, or opt-in to possible future features, in the hopes that people use such a track and provide feedback. I let that topic rest here :D.
This sounds like it's definitely valuable! |
This changes the code in Git.execute's local kill_process function, which it uses as the timed callback for kill_after_timeout, to remove code that is unnecessary because kill_process doesn't support Windows, and to avoid giving the false impression that its code could be used unmodified on Windows without serious problems. - Raise AssertionError explicitly if it is called on Windows. This is done with "raise" rather than "assert" so its behavior doesn't vary depending on "-O". - Don't pass process creation flags, because they were 0 except on Windows. - Don't fall back to SIGTERM if Python's signal module doesn't know about SIGKILL. This was specifically for Windows which has no SIGKILL. See gitpython-developers#1756 for discussion.
In #1756 (comment) I had mentioned:
I turns out that's an understatement. |
Background
Calling
Git.execute
—whether directly, or indirectly by calling the dynamic attributes of aGit
instance—and passingkill_after_timeout
with a non-None
value, creates a timer on a separate thread that calls the localkill_process
function. This callback function usesos.kill
to kill the process. Before killing the process, it enumerates the process's direct children. If sending the first signal succeeds (basically, if the parent process still existed), it also attempts to kill the child processes.The children are enumerated with
ps --ppid
:GitPython/git/cmd.py
Lines 1010 to 1014 in fe082ad
The problem
The
--ppid
option is not POSIX. Most GNU/Linux systems have procps, whoseps
implementation supports--ppid
. I am unsure if any other implementations ofps
support it. The procps tools generally run only on Linux-based systems, because they use the/proc
filesystem (and assume it is laid out as in Linux). Although they can run on any such system, Alpine Linux and some minimal GNU/Linux environments do not ship them, defaulting tops
from busybox instead.As demonstrated below, macOS
ps
does not support--ppid
. Nor do FreeBSD, NetBSD, OpenBSD, or DragonFly. AIX does not have--ppid
. illumos does not have--ppid
; nor does Solaris, though-ppid
(with one-
) can be used in 11.4.27 or higher. Although Cygwin mimics Linux where feasible, its/proc
filesystem is different, and itsps
does not support--ppid
either (nor even some important POSIX options like-o
).The callback parses stdout from that
ps
command, but does not examine the exit status or stderr. The effect is that an error message on a system without procps (or anotherps
supporting--ppid
, if there is one) is printed, and the parent process is still sentSIGKILL
, but its children are never found or sent signals.As detailed below, although the
fetch
,pull
, andpush
methods of theRemote
class accept akill_after_timeout
argument, they do not useGit.execute
, so they are unaffected by this bug.Steps to reproduce
On macOS 13 (on a GitHub Actions CI runner with tmate), I created this script in a directory in
$PATH
, named itgit-sleep
, and marked it executable:Then I called
sleep
on aGit
instance with akill_after_timeout
argument specifying a shorter duration than the sleep:Impact
1. The other
kill_after_timeout
is unaffectedThere are two callables defined in
git/cmd.py
that accept an optionalkill_after_timeout
argument: the "internal" top-levelhandle_process_output
function that is not listed in__all__
but is used throughout GitPython, and the publicGit.execute
method (also used when dynamicGit
methods are called). The meaning of this argument is subtly different, and the associated implementations completely different.This bug affects only the one in the
Git
class. Thus it does not affect common uses of timeouts in interacting with remotes: theRemote.fetch
,Remote.push
, andRemote.pull
methods acceptkill_after_timeout
arguments, but they forward them tohandle_process_output
.2. But this one should work on all Unix-like systems
From context, I think it is unintended not to support common Unix-like systems such as macOS. The
Git.execute
docstring says "This feature is not supported on Windows" and makes no other claims about compatibility, from which I think readers will reasonably infer that other platforms are believed supported. When called on a native Windows system (not Cygwin) with a non-None
value forkill_after_timeout
, it raises aGitCommandError
. Other systems, including Cygwin, raise no exception and register thekill_process
callback.kill_after_timeout
is thus in effect documented to work on all systems except native Windows.3. What happens if the child processes aren't sent
SIGKILL
?I don't know how much of a problem it is for
SIGKILL
to be sent only to the parent and not to its direct children. I am not confident I know why that is being done, as opposed to killing only the parent process, or attempting to kill its entire process tree. My guess is that this is because manygit
commands use a subprocess to do their work. If so, then it may in practice be important—in situations where people passkill_after_timeout
—that the child processes are killed as well.However,
git
subprocesses do sometimes use their own subprocesses:In that example, the
git-remote-http
process may not receiveSIGKILL
. I am unsure how much this matters, but if it is a problem, then the more severe it is, the less severe this bug is, because the intended behavior wouldn't help anyway. Likewise, in situations where killing the parent process is sufficient, this bug also does not cause a problem.That lower descendants are not killed has been reported as #895. That was observed in GitPython 2.0.2, which had the current approach of killing just the direct child processes.
A minor race condition…
One thing I'm a little worried about is a race condition that is currently present, and that I think may not be possible to fix, but that I worry finding child processes in a more portable way may exacerbate. Unless it can be solved or mitigated more deeply, it is a reason, unrelated to performance, to prefer that a portable substitute for the existing use of
ps --ppid
not be too much slower than the current way. (I likewise worry that if the approach were changed to kill all descendants, then the added time to traverse the whole subtree might exacerbate this race condition.)Suppose we plan to kill a process P and all its direct child processes including Q, and we find the PID of Q, but before killing Q, all the following happen:
init
--causing its entry in the process table to be removed and its PID to be available for use by a future process.Then when when we try to kill Q, we kill R.
This situation is rare, because in practice the time between when a process is reaped and when a new process is given its PID is only short when the process table is nearly full so the kernel has no less recently relinquished PIDs to give out. But I think it would be best to avoid increasing the risk of it.
There may be other related race conditions, but this is the one that seems it could be worsened by replacing the existing unportable use of
ps --ppid
with some other technique, if that other technique is markedly slower.Finding/killing the the subprocesses portably
I am unsure if this should be done, because it is not clear to me that killing the parent process and its direct child processes, as is currently attempted (generally successfully on GNU/Linux and unsuccessfully elsewhere), is necessarily what should happen. Doing anything else might risk incompatibility for some existing use cases on some systems, so I would want to be cautious about doing something altogether different, but I think it should still be considered before proceeding.
However, assuming the current approach of killing the child processes should be preserved, I think there are three cases:
pgrep
/pkill
.ps
is POSIX-compliant, or at least supports-A
and-o
.sys.platform == "cygwin"
still covers that.)Case 1 could be folded into case 2 if a speed regression is acceptable (but see above on the race condition), or if testing reveals using
pgrep
orpkill
is not significantly faster. Case 3 could be dropped in lieu of modifying the docstring to document thatkill_after_timeout
is less effective on Cygwin, if reducing complexity is regarded as more important than covering it.Whether to cover case 3 or not is more a matter of code complexity than time to write and review the code. With or without it, I think most of the time and effort would be on the tests. Currently none cover passing
kill_after_timeout
toGit.execute
or to a dynamic method of aGit
object. Only the otherkill_after_timeout
—ofhandle_process_output
—has test coverage. Because this project has CI on Cygwin, I don't think the tests have to do much to accommodate it—its challenges are ready-made.(An alternative to dealing with these details is to use psutil, but I'm unsure if the impact of this issue is sufficient to justify adding it as a dependency. It doesn't support all systems, but systems it doesn't support are rare. I think it could be made conditional on the systems it is installable on, and the features that use it be documented as unavailable on other systems. I think this is probably not worth doing just for this, but if it turns out it would help in various other places, and increasing rather than decreasing OS compatibility—as it would here—then it might make sense to consider it. On the other hand, one benefit of GitPython is that it has very few dependencies.)
1. If we have
pgrep
/pkill
pgrep
andpkill
are not POSIX, but they are available on many more systems thanps --ppid
. Furthermore, it is likely that all systems that supportps --ppid
also havepgrep
andpkill
, because not only are they very common, but procps (which provides the onlyps
with--ppid
I can find, as discussed above) includes an implementation of them. Of course, it's possible (odd, but possible) for a distribution to use procps forps
but not includepgrep
andpkill
. Whetherpkill
can be used to consolidate the steps, orpgrep
must be used together with something what is already there, is a design decision that should be influenced by a decision about the best order for sendingSIGKILL
.If it's acceptable to send
SIGKILL
to the child processes first, then instead of running["ps", "--ppid", str(pid)]
and most of what comes after it, one option is to run["pkill", "-P", str(pid)]
and then:os.kill(pid, signal.SIGKILL)
andkill_check.set()
.pkill
not existing—or if it was checked first and found absent—proceed to case 2 (usingps
).If it's not acceptable to send
SIGKILL
to the child processes first, or if either order is acceptable but it is desirable to share more code with the fallback case 2, then instead of running["ps", "--ppid", str(pid)]
, run["pgrep", "-P", str(pid)]
, then:pgrep
not existing—or if it was checked first and found absent—proceed to case 2.child_pids
.kill_process
function as it already exists.2. If
ps
supports-A
and-o
This is almost every Unix-like system used today; POSIX requires these options.
Instead of running
["ps", "--ppid", str(pid)]
, run["ps", "-A", "-o", "pid,ppid"]
, then:PID
andPPID
to safeguard against unexpectedly nonstandardps
.child_pids
with the PIDs from the first column.kill_process
function as it already exists.If using this is as fast a
pkill
/pgrep
, or slower but not by a lot, or code simplicity is considered more important than the small worsening of the rare race condition, then this could be used on all systems except Cygwin. The truth is that it is only out of fear of worsening things in weird situations on GNU/Linux systems with procps that I have even proposed case 1. This is the portable way to do it (except Cygwin).It may be possible to optimize this with
-U
to filter the real user ID toos.getuid()
, or-u
to filter the effective user ID toos.geteuid()
, though-u
seems to be an XSI extension. I don't know if this would actually make things faster. I don't know if the added complexity, though modest, would be worthwhile even if it does. When doing this,-A
would not also be passed.The reason not to simply omit
-A
without replacing it, which gets processes that share the caller's EUID, is that it also only shows processes with the same controlling terminal. The reason not to use-a
instead is that it doesn't show processes not associated with any terminal. The reason I prefer-A
to its synonym-e
is that-e
seems to be an XSI extension.3. Cygwin
Running
ps
on Cygwin gives output that looks like:This can be modified by various options, but
-o
is not supported. (-A
is not supported either, but it is not needed.)Instead of running
["ps", "--ppid", str(pid)]
, we can run["ps"]
, then:PID
andPPID
headers that we are going to use, to safeguard against unexpectedly non-Cygwinps
or future changes to Cygwinps
.Perspective
I think the what is more important than the how, because:
Test coverage
Unlike the other
kill_after_timeout
(inhandle_process_output
), the code path whereGit.execute
is passedkill_after_timeout
has no test coverage. It would be good to test it even if this bug is not fixed. But at the root of both is figuring out if killing the parent process and its direct child processes are what is wanted.Maintainability
The
kill_process
callback is never called on (native) Windows, where callingGit.execute
with a non-None
value forkill_after_timeout
raisesGitCommandError
. But it contains what seem to be the remains of an attempt to support Windows: it passesPROC_CREATIONFLAGS
(which is 0 except on Windows) when runningps
, and it falls back tosignal.SIGTERM
whensignal.SIGKILL
is absent (which it is on Windows).I discovered this whole issue because I want to remove that code, which I think could lead to future bugs, and I was looking into whether there is any reason not to. A possible reason not to is if
kill_process
can be easily modified to support Windows—which it could, if it is acceptable to kill either only the parent process, or the whole process tree, though whether it should is another question. Because figuring out what to do about this issue entails figuring that out too, it would openkill_process
up to that improvement—dropping its vestigial Windows code if it is not going to support Windows—and possibly others.The text was updated successfully, but these errors were encountered: