Skip to content

fix(macOS): reap entire codegraph process tree on exit (Setpgid + negative-PID kill)#3735

Closed
ttmouse wants to merge 1 commit into
esengine:main-v2from
ttmouse:pr/fix-codegraph-orphan-processes
Closed

fix(macOS): reap entire codegraph process tree on exit (Setpgid + negative-PID kill)#3735
ttmouse wants to merge 1 commit into
esengine:main-v2from
ttmouse:pr/fix-codegraph-orphan-processes

Conversation

@ttmouse

@ttmouse ttmouse commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

问题

macOS/Linux 上,Reasonix 退出后 codegraph 的孙进程残留。累计 40+ 孤儿进程导致系统严重卡顿。

根因

internal/proc/kill_other.goKillTree 只调 cmd.Process.Kill() — 只杀直接子进程(codegraph 的 shell launcher),不杀它启动的 node 运行时和工作进程。

对比 Windows 版:internal/proc/kill_windows.go 用 Job Object (JOB_OBJECT_LIMIT_KILL_ON_JOB_CLOSE) + taskkill /T 能完整清理整个进程树,没有此问题。

修复

两个改动,复用代码库中已有的模式(bash_kill_other.go / shell_kill_other.go):

  1. internal/proc/kill_other.goKillTreecmd.Process.Kill() 改为 syscall.Kill(-cmd.Process.Pid, syscall.SIGKILL),杀死整个进程组
  2. internal/plugin/transport_stdio.gonewStdioTransport 中在 cmd.Start() 前设 cmd.SysProcAttr = &syscall.SysProcAttr{Setpgid: true} 并设 cmd.Cancel handler 杀进程组

这样无论正常退出还是 context 取消,codegraph 的进程树都会被完整清理。

Fixes #3734

@github-actions github-actions Bot added v2 Go rewrite (1.x) — main-v2 branch, active development mcp MCP servers / plugins (internal/plugin, codegraph) labels Jun 9, 2026
@ttmouse ttmouse force-pushed the pr/fix-codegraph-orphan-processes branch from 7d42f64 to 86321f7 Compare June 9, 2026 17:00
@ttmouse ttmouse changed the title fix(macos): kill codegraph orphan processes on exit (Setpgid + kill process group) fix(macOS): reap entire codegraph process tree on exit (Setpgid + negative-PID kill) Jun 9, 2026
@esengine

Copy link
Copy Markdown
Owner

Thanks — the Setpgid + negative-PID reaping is the right approach for the macOS/Unix process tree. One blocker: the Windows vet/build fails because syscall.SysProcAttr.Setpgid and syscall.Kill are Unix-only and they're referenced directly in internal/plugin/transport_stdio.go (no build constraint):

internal/plugin/transport_stdio.go:65: unknown field Setpgid in struct literal of type syscall.SysProcAttr
internal/plugin/transport_stdio.go:70: undefined: syscall.Kill

Could you move the process-group setup + group-kill behind a build tag — e.g. a transport_stdio_unix.go (//go:build !windows) with the Setpgid/Kill(-pgid) path and a transport_stdio_windows.go stub (or route it through the existing internal/proc platform split that kill_other.go is part of)? Once Windows builds clean the rest looks good.

@ttmouse ttmouse force-pushed the pr/fix-codegraph-orphan-processes branch from 86321f7 to 78ecb8e Compare June 10, 2026 04:08
…ative-PID kill)

On Unix (macOS/Linux), KillTree only killed the direct child process
(cmd.Process.Kill()), leaving orphan grandchildren alive — e.g. the
codegraph launcher shell's bundled node runtime. Over time, 40+ orphan
processes accumulated, causing severe system slowdown.

Fix: Set Setpgid on the child before starting it, making it the leader
of a new process group. KillTree (and the Cancel handler) now uses
syscall.Kill with a negative PID to reap the entire process group.

This matches the pattern already proven in bash_kill_other.go and
shell_kill_other.go, and is symmetric with the Windows Job Object
approach in kill_windows.go.
@ttmouse ttmouse force-pushed the pr/fix-codegraph-orphan-processes branch from 78ecb8e to 509aa55 Compare June 10, 2026 04:15
@ttmouse

ttmouse commented Jun 10, 2026

Copy link
Copy Markdown
Contributor Author

更新说明

根据 review 反馈做了以下调整:

1. Windows 编译修复
transport_stdio.go 中直接引用的 syscall.SysProcAttr.Setpgidsyscall.Kill 移到了 internal/proc 包,利用已有的 //go:build !windows / //go:build windows 分拆:

  • kill_other.go(Unix):新增 SetProcessGroupKill(cmd) → 设 Setpgid + cmd.Cancel 用负 PID 杀整个进程组
  • kill_windows.go(Windows):新增 SetProcessGroupKill(cmd) → no-op,Windows 已有 Job Object 机制兜底
  • transport_stdio.go:删除 "syscall" import,改为调用 proc.SetProcessGroupKill(cmd)

2. EnsureInit 的隐患(在 code review 中发现的 bug)
internal/codegraph/codegraph.go:EnsureInit 自定义了 cmd.Cancel 调用 KillTree,但没有设 Setpgid。这意味着:

  • 负 PID 的 syscall.Kill 因找不到进程组而 ESRCH 静默失败
  • 自定义 Cancel 覆盖了 Go 默认的直接子进程 kill,导致 context 超时后子进程未被杀死
  • 已修复:在设 cmd.Cancel 前加 proc.SetProcessGroupKill(cmd)

3. KillTree 错误日志
KillTree 不再静默吞掉 syscall.Kill 的失败,改为 slog.Warn 输出 pid + 错误信息,方便将来调试类似问题。

关于单元测试

review 还建议为 SetProcessGroupKill 加单元测试,但该函数的验证需要启动真实子进程(shell → sleep)并检查进程组行为——这是一段系统调用级别的集成测试,超出了当前 kill_other_test.goTestKillTreeTerminatesChild 模式。现有的 kill 路径测试 + go vet 双平台验证已足够覆盖。如果有其他想法欢迎继续讨论。

@esengine

Copy link
Copy Markdown
Owner

Thanks @ttmouse — re-landed as #3787, with credit. Your Setpgid + negative-pid approach is exactly right; I just integrated it into the StartTracked path that #3755 added for the Windows Job Object, so the two reaping mechanisms sit side by side (group kill off Windows, Job Object on Windows) and transport_stdio needs no per-call change. Added a grandchild-reaping regression test too. Appreciate the fix!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

mcp MCP servers / plugins (internal/plugin, codegraph) v2 Go rewrite (1.x) — main-v2 branch, active development

Projects

None yet

Development

Successfully merging this pull request may close these issues.

macOS: codegraph 进程在 Reasonix 退出后残留,导致系统卡顿

2 participants