fix: resolve multi-node training hanging in Kubernetes environments #6377

amyanger · 2025-08-05T21:05:30Z

Description

Addresses issue #6349 where multi-node training gets stuck during distributed initialization when using torchrun in Kubernetes.

Root Cause

Missing rendezvous backend configuration in torchrun
No master node readiness checks in K8s pod startup
Insufficient timeout configuration for container networking
Lack of Kubernetes-specific networking setup

Solution

Enhanced Initialization (colossalai/initialize.py)

Add master node readiness checks for non-master ranks
Implement configurable timeouts via environment variables
Provide detailed error messages with troubleshooting guidance
Add robust error handling for distributed process group init

Kubernetes Utilities (colossalai/utils/k8s_distributed.py)

Environment variable validation with helpful errors
Automatic K8s networking configuration (NCCL, Gloo)
YAML generation for headless services and training jobs
Comprehensive diagnostics and troubleshooting tools

Documentation & Examples

Complete K8s multi-node training guide
Minimal 2-node test setup for validation
Working example with distributed operations testing
Test suite for validation

Usage

Replace basic torchrun with enhanced configuration:

torchrun --nnodes=4 --nproc_per_node=8 --node_rank=$NODE_RANK \
  --rdzv_backend=c10d --rdzv_endpoint=$MASTER_ADDR:$MASTER_PORT \
  --rdzv_id=$JOB_ID --rdzv_conf="timeout=1800,read_timeout=120" \
  scripts/diffusion/train.py

Backward Compatibility

-  100% backward compatible - no breaking changes
-  Enhanced error messages guide users to solutions
-  New features opt-in via environment variables

Testing

- Tested with logic validation
- 2-node test configuration provided
- Unit tests included

Fixes #6349

Addresses issue hpcaitech#6349 where multi-node training gets stuck during distributed initialization when using torchrun in Kubernetes. Root Cause: - Missing rendezvous backend configuration in torchrun - No master node readiness checks in K8s pod startup - Insufficient timeout configuration for container networking - Lack of Kubernetes-specific networking setup Solution: Enhanced Initialization (colossalai/initialize.py): - Add master node readiness checks for non-master ranks - Implement configurable timeouts via environment variables - Provide detailed error messages with troubleshooting guidance - Add robust error handling for distributed process group init Kubernetes Utilities (colossalai/utils/k8s_distributed.py): - Environment variable validation with helpful errors - Automatic K8s networking configuration (NCCL, Gloo) - YAML generation for headless services and training jobs - Comprehensive diagnostics and troubleshooting tools Documentation & Examples: - Complete K8s multi-node training guide - Minimal 2-node test setup for validation - Working example with distributed operations testing - Test suite for validation Usage: Replace basic torchrun with enhanced configuration: torchrun --nnodes=4 --nproc_per_node=8 --node_rank=\ --rdzv_backend=c10d --rdzv_endpoint=\:\ --rdzv_id=\ --rdzv_conf='timeout=1800,read_timeout=120' scripts/diffusion/train.py Backward Compatibility: - 100% backward compatible - no breaking changes - Enhanced error messages guide users to solutions - New features opt-in via environment variables

for more information, see https://pre-commit.ci

amyanger requested a review from a team as a code owner August 5, 2025 21:05

[pre-commit.ci] auto fixes from pre-commit.com hooks

fa14399

for more information, see https://pre-commit.ci

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: resolve multi-node training hanging in Kubernetes environments #6377

fix: resolve multi-node training hanging in Kubernetes environments #6377

Uh oh!

amyanger commented Aug 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

fix: resolve multi-node training hanging in Kubernetes environments #6377

Are you sure you want to change the base?

fix: resolve multi-node training hanging in Kubernetes environments #6377

Uh oh!

Conversation

amyanger commented Aug 5, 2025

Description

Root Cause

Solution

Enhanced Initialization (colossalai/initialize.py)

Kubernetes Utilities (colossalai/utils/k8s_distributed.py)

Documentation & Examples

Usage

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant