fix: stagger vLLM engine startup to avoid EADDRINUSE by penfever · Pull Request #1356 · NovaSky-AI/SkyRL

penfever · 2026-03-20T13:52:46Z

Summary

When multiple inference engines start on the same node simultaneously, vLLM's get_open_port() can return the same port to different engines (TOCTOU race), causing EADDRINUSE failures during engine init
Adds a random 1.5-3.0s delay before AsyncLLMEngine.from_engine_args() to desynchronise port allocation across engines on the same node

Test plan

Verify multi-engine startup on a single node no longer hits EADDRINUSE
Verify single-engine startup is unaffected (just a brief delay)

🤖 Generated with Claude Code

When multiple inference engines start on the same node simultaneously, vLLM's get_open_port() can return the same port to different engines (TOCTOU race). This causes EADDRINUSE failures during engine init. Add a random 1.5-3.0s delay before engine creation to desynchronise the port allocation calls across engines on the same node. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

devin-ai-integration

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 2 additional findings.

gemini-code-assist

Code Review

This pull request addresses a race condition during vLLM engine startup by introducing a random delay. While this is a valid approach to reduce the likelihood of port collisions, it introduces a performance overhead for all startups and doesn't completely eliminate the race condition. I've suggested a more robust solution using a file lock to properly serialize the engine initialization, which would solve the problem deterministically without unnecessary delays.

gemini-code-assist · 2026-03-20T13:54:54Z

skyrl/backends/skyrl_train/inference_engines/vllm/vllm_engine.py

+        # Stagger engine startup to avoid TOCTOU port collisions (EADDRINUSE).
+        # vLLM's get_open_port() queries a free port then releases the socket;
+        # if multiple engines on the same node call it simultaneously, they can
+        # get the same port. A random delay desynchronises the calls.
+        import random
+
+        _stagger = random.uniform(1.5, 3.0)
+        logger.info(f"Engine startup stagger: sleeping {_stagger:.2f}s to avoid port collisions")
+        time.sleep(_stagger)
+
        engine = vllm.AsyncLLMEngine.from_engine_args(engine_args, stat_loggers=stat_loggers)


Using time.sleep with a random delay is a good first step to mitigate the race condition, but it has a few drawbacks:

It adds a significant startup delay (1.5-3.0s) even for single-engine scenarios, which is a performance regression.

It reduces the probability of a port collision but doesn't eliminate it. The race condition can still occur, albeit less frequently.

A more robust and efficient solution would be to use a file-based lock to serialize the engine initialization across different processes on the same node. This ensures that only one engine attempts to allocate a port at a time, completely avoiding the race condition without introducing an unnecessary delay when only one engine is starting.

You can use the filelock library for this. This approach would require adding filelock as a new dependency. Also, the from filelock import FileLock import should be moved to the top of the file for better code style.

Suggested change

# Stagger engine startup to avoid TOCTOU port collisions (EADDRINUSE).

# vLLM's get_open_port() queries a free port then releases the socket;

# if multiple engines on the same node call it simultaneously, they can

# get the same port. A random delay desynchronises the calls.

import random

_stagger = random.uniform(1.5, 3.0)

logger.info(f"Engine startup stagger: sleeping {_stagger:.2f}s to avoid port collisions")

time.sleep(_stagger)

engine = vllm.AsyncLLMEngine.from_engine_args(engine_args, stat_loggers=stat_loggers)

# Stagger engine startup to avoid TOCTOU port collisions (EADDRINUSE).

# vLLM's get_open_port() queries a free port then releases the socket;

# if multiple engines on the same node call it simultaneously, they can

# get the same port. A file lock serialises port allocation.

from filelock import FileLock

# A timeout is added to prevent indefinite waiting.

lock_path = "/tmp/vllm_engine_init.lock"

logger.info(f"Attempting to acquire lock {lock_path} for vLLM engine startup to avoid port collisions.")

with FileLock(lock_path, timeout=60):

logger.info(f"Acquired lock {lock_path}, proceeding with engine startup.")

engine = vllm.AsyncLLMEngine.from_engine_args(engine_args, stat_loggers=stat_loggers)

devin-ai-integration bot reviewed Mar 20, 2026

View reviewed changes

gemini-code-assist bot reviewed Mar 20, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: stagger vLLM engine startup to avoid EADDRINUSE#1356

fix: stagger vLLM engine startup to avoid EADDRINUSE#1356
penfever wants to merge 1 commit intoNovaSky-AI:mainfrom
penfever:penfever/vllm-engine-startup-stagger

penfever commented Mar 20, 2026 •

edited by devin-ai-integration bot

Loading

Uh oh!

devin-ai-integration bot left a comment

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Mar 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

penfever commented Mar 20, 2026 • edited by devin-ai-integration bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

devin-ai-integration bot left a comment

Choose a reason for hiding this comment

✅ Devin Review: No Issues Found

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

penfever commented Mar 20, 2026 •

edited by devin-ai-integration bot

Loading