Skip to content

r2cc-project/R-2CCL

Repository files navigation

R2CCL: Reliable and Resilient Collective Communication

Overview

R2CCL is a fault tolerant communication library that provides lossless, low overhead failover by exploiting multi-NIC hardware. It is designed as a drop in replacement for NCCL to minimize full job terminations from network failures.

📢 Update 02/23/2025: Added a CloudLab image and public profile for quick reproduction — see Demo.


Megatron Training Performance Evaluation

Features

🔥 Zero-Downtime Hot Repair: Automatically detects and mitigates network failures mid-collective. By utilizing multi-NIC GPU buffer registration and DMA-buffer rollback, R2CC live-migrates failed connections to backup links without losing in-flight data.

⚖️ Topology-Aware Load Balancing (R2CC-Balance): After a failure, R2CC dynamically redistributes traffic across the remaining healthy NICs. It is fully aware of PCIe, NUMA, and NVLink (PXN) topology to maximize remaining bandwidth.

🚀 Failure-Optimized AllReduce (R2CC-AllReduce): Introduces a novel schedule that prevents degraded servers from bottlenecking the cluster by intelligently combining global and partial AllReduce operations.


Demo

R2CC_Demo_with_caption.mov

We provide a pre-built CloudLab image and a public profile to quickly reproduce the demo experiment. See the CloudLab setup guide for details.

Todo List

  1. Live Migration: Seamless failover via multi-NIC registration and DMA rollback. ✔️
  2. R2CCL-Balance: Load-balancing for remaining healthy interfaces. ✔️
  3. Simulated R2CCL-AllReduce: Performance-equivalent implementation via AllReduce + Broadcast ✔️
  4. Clean up legacy code and add examples / test scripts for common platforms.
  5. Native implementation of R2CCL-AllReduce with customized kernel.
  6. Optimization: Further performance tuning.

How to use R2CCL

Build

git clone https://github.com/r2cc-project/R-2CCL.git
cd R-2CCL
make -j

Test

Similar to NCCL, R2CCL can be benchmarked using nccl-tests. Below we provide compilation commands for nccl-tests and an example of performance testing using allreduce.

Build nccl-tests

git clone https://github.com/NVIDIA/nccl-tests.git
cd nccl-tests
make
mpirun -np 4 -host A,B ./build/all_reduce_perf -b 8K -e 8G -f 2 -t 1 -g 1 

Testing with Environment Variables

To simplify testing the performance and reduce the complexity of triggering failures (e.g., using SmartNICs to disable specific routing at runtime), we provide environment variables to directly simulate specific scenarios and measure performance.

R2CC_MODE:

  • 0: NCCL baseline
  • 1: Live Migration
  • 2: R2CCL-Balance
  • 3: R2CCL-AllReduce

Example1: Live Migration

Test Migration Performance. Requirements: 2 nodes, >=2 NICs per node.

# no failure
mpirun -x -np 4 -host A,B ./build/all_reduce_perf -b 8K -e 8G -f 2 -t 1 -g 1

# 1 failure
mpirun -x R2CC_MODE=1  -np 4 -host A,B ./build/all_reduce_perf -b 8K -e 8G -f 2 -t 1 -g 1

# or run the first command and disable the routing

Example2: Failure Aware Scheduling Performance

When only one NIC remains on each node, the performance of different strategies is identical. Therefore, we recommend using machines with 8 NICs and 8 GPUs for testing. Below are the performance tests for R2CCL-Balance and R2CCL-AllReduce, respectively.

mpirun -x R2CC_MODE=2  -np 16 -host A,B ./build/all_reduce_perf -b 8K -e 8G -f 2 -t 1 -g 1 

mpirun -x R2CC_MODE=3  -np 16 -host A,B ./build/all_reduce_perf -b 8K -e 8G -f 2 -t 1 -g 1 

Citation

@article{wang2025reliable,
  title={Reliable and Resilient Collective Communication Library for LLM Training and Serving},
  author={Wang, Wei and Yu, Nengneng and Xiong, Sixian and Liu, Zaoxing},
  journal={arXiv preprint arXiv:2512.25059},
  year={2025}
}

About

A Reliable and Resilient Collective Communication Library for NCCL and others

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages