fix: add timeout to ACP handshake to prevent gemini e2e hangs#109
fix: add timeout to ACP handshake to prevent gemini e2e hangs#109
Conversation
waitForResponse had no timeout, causing concurrent gemini sessions to hang indefinitely when the second process stalled acquiring proper-lockfile on ~/.gemini/projects.json during config.initialize(). - Add optional timeoutMs to waitForResponse; fires after configured delay and rejects with a clear error message - Kill the child process (SIGTERM then SIGKILL) when the handshake fails or times out, preventing zombie subprocesses - Thread initializeTimeoutMs from ResolvedConfig into adapterOptions at both connectBackend call sites in SessionCoordinator (createSession and the relaunch/reconnect path) - Update behavior test assertion to use objectContaining for adapterOptions since it now carries initializeTimeoutMs alongside cwd
Summary of ChangesHello @teng-lin, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses a critical issue where concurrent Gemini sessions could hang indefinitely due to a missing timeout in the ACP handshake process. By introducing a configurable timeout and robust child process termination, it prevents deadlocks caused by file locking contention during initialization, significantly improving the stability and reliability of session management, especially in concurrent environments. Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
The pull request effectively addresses the hang issue by introducing a timeout to the ACP handshake and ensuring child processes are cleaned up on failure. The implementation is solid, and the suggestions to use the configured killGracePeriodMs and properly clear the SIGKILL timer further improve maintainability and robustness, aligning with best practices for process management.
| const args = (options.adapterOptions?.args as string[]) ?? []; | ||
| const cwd = options.adapterOptions?.cwd as string | undefined; | ||
| const tracer = options.adapterOptions?.tracer as MessageTracer | undefined; | ||
| const initializeTimeoutMs = options.adapterOptions?.initializeTimeoutMs as number | undefined; |
There was a problem hiding this comment.
Consider extracting killGracePeriodMs from adapterOptions as well. This allows the adapter to respect the global configuration for process termination grace periods, rather than relying on a hardcoded value in the error handler.
| const initializeTimeoutMs = options.adapterOptions?.initializeTimeoutMs as number | undefined; | |
| const initializeTimeoutMs = options.adapterOptions?.initializeTimeoutMs as number | undefined; | |
| const killGracePeriodMs = (options.adapterOptions?.killGracePeriodMs as number) ?? 5000; |
There was a problem hiding this comment.
Fixed — is now extracted from with a fallback of 5000.
| child.kill("SIGTERM"); | ||
| const killTimer = setTimeout(() => child.kill("SIGKILL"), 5000); | ||
| killTimer.unref(); |
There was a problem hiding this comment.
It is recommended to clear the killTimer if the child process exits before the timeout. This prevents sending a SIGKILL to a potentially reused PID. Additionally, using the extracted killGracePeriodMs ensures consistency with the system configuration.
| child.kill("SIGTERM"); | |
| const killTimer = setTimeout(() => child.kill("SIGKILL"), 5000); | |
| killTimer.unref(); | |
| child.kill("SIGTERM"); | |
| const killTimer = setTimeout(() => child.kill("SIGKILL"), killGracePeriodMs); | |
| child.once("exit", () => clearTimeout(killTimer)); | |
| killTimer.unref(); |
There was a problem hiding this comment.
Fixed — added child.once('exit', () => clearTimeout(killTimer)) and switched to killGracePeriodMs.
| try { | ||
| await this.bridge.connectBackend(sessionId, { | ||
| adapterOptions: { cwd }, | ||
| adapterOptions: { cwd, initializeTimeoutMs: this.config.initializeTimeoutMs }, |
There was a problem hiding this comment.
Pass killGracePeriodMs to the adapter options to ensure the backend adapter respects the configured grace period during cleanup on connection failure.
| adapterOptions: { cwd, initializeTimeoutMs: this.config.initializeTimeoutMs }, | |
| adapterOptions: { cwd, initializeTimeoutMs: this.config.initializeTimeoutMs, killGracePeriodMs: this.config.killGracePeriodMs }, |
There was a problem hiding this comment.
Fixed in both createSession and the reconnect path.
| try { | ||
| await this.bridge.connectBackend(sessionId, { | ||
| adapterOptions: { cwd: info.cwd }, | ||
| adapterOptions: { cwd: info.cwd, initializeTimeoutMs: this.config.initializeTimeoutMs }, |
There was a problem hiding this comment.
Pass killGracePeriodMs here as well to maintain consistency across all connection paths.
| adapterOptions: { cwd: info.cwd, initializeTimeoutMs: this.config.initializeTimeoutMs }, | |
| adapterOptions: { cwd: info.cwd, initializeTimeoutMs: this.config.initializeTimeoutMs, killGracePeriodMs: this.config.killGracePeriodMs }, |
There was a problem hiding this comment.
Fixed — both connectBackend call sites now carry killGracePeriodMs.
…r timer on early exit - Extract killGracePeriodMs from adapterOptions (defaulting to 5000) so the adapter respects the system-configured grace period rather than a hardcoded value - Clear the SIGKILL timer if the child exits before it fires, preventing a potential SIGKILL to a reused PID - Thread killGracePeriodMs alongside initializeTimeoutMs in both connectBackend call sites in SessionCoordinator
Summary
waitForResponseinAcpAdapter.connect()had no timeout, causing concurrent Gemini sessions to hang indefinitely when the second process stalled acquiringproper-lockfileon~/.gemini/projects.jsonduringconfig.initialize()timeoutMstowaitForResponse; fires after the configured delay and rejects with a clear errorinitializeTimeoutMsfromResolvedConfigintoadapterOptionsat bothconnectBackendcall sites inSessionCoordinator(createSessionand the relaunch/reconnect path)Root Cause
The Gemini CLI uses
proper-lockfileon~/.gemini/projects.jsonduringconfig.initialize(), which runs synchronously as part of the ACPinitializehandshake. When two Gemini processes start concurrently, the second one stalls waiting for the lock (up toLOCK_TIMEOUT_MS = 10000). SincewaitForResponsehad no timeout, beamcode waited indefinitely — all the way to vitest's 120s global test timeout.Test Plan