R2CCL is a fault tolerant communication library that provides lossless, low overhead failover by exploiting multi-NIC hardware. It is designed as a drop in replacement for NCCL to minimize full job terminations from network failures.
📢 Update 02/23/2025: Added a CloudLab image and public profile for quick reproduction — see Demo.

Megatron Training Performance Evaluation
🔥 Zero-Downtime Hot Repair: Automatically detects and mitigates network failures mid-collective. By utilizing multi-NIC GPU buffer registration and DMA-buffer rollback, R2CC live-migrates failed connections to backup links without losing in-flight data.
⚖️ Topology-Aware Load Balancing (R2CC-Balance): After a failure, R2CC dynamically redistributes traffic across the remaining healthy NICs. It is fully aware of PCIe, NUMA, and NVLink (PXN) topology to maximize remaining bandwidth.
🚀 Failure-Optimized AllReduce (R2CC-AllReduce): Introduces a novel schedule that prevents degraded servers from bottlenecking the cluster by intelligently combining global and partial AllReduce operations.
R2CC_Demo_with_caption.mov
We provide a pre-built CloudLab image and a public profile to quickly reproduce the demo experiment. See the CloudLab setup guide for details.
- Live Migration: Seamless failover via multi-NIC registration and DMA rollback. ✔️
- R2CCL-Balance: Load-balancing for remaining healthy interfaces. ✔️
- Simulated R2CCL-AllReduce: Performance-equivalent implementation via AllReduce + Broadcast ✔️
- Clean up legacy code and add examples / test scripts for common platforms.
- Native implementation of R2CCL-AllReduce with customized kernel.
- Optimization: Further performance tuning.
git clone https://github.com/r2cc-project/R-2CCL.git
cd R-2CCL
make -jSimilar to NCCL, R2CCL can be benchmarked using nccl-tests. Below we provide compilation commands for nccl-tests and an example of performance testing using allreduce.
git clone https://github.com/NVIDIA/nccl-tests.git
cd nccl-tests
make
mpirun -np 4 -host A,B ./build/all_reduce_perf -b 8K -e 8G -f 2 -t 1 -g 1 To simplify testing the performance and reduce the complexity of triggering failures (e.g., using SmartNICs to disable specific routing at runtime), we provide environment variables to directly simulate specific scenarios and measure performance.
R2CC_MODE:
0: NCCL baseline1: Live Migration2: R2CCL-Balance3: R2CCL-AllReduce
Test Migration Performance. Requirements: 2 nodes, >=2 NICs per node.
# no failure
mpirun -x -np 4 -host A,B ./build/all_reduce_perf -b 8K -e 8G -f 2 -t 1 -g 1
# 1 failure
mpirun -x R2CC_MODE=1 -np 4 -host A,B ./build/all_reduce_perf -b 8K -e 8G -f 2 -t 1 -g 1
# or run the first command and disable the routingWhen only one NIC remains on each node, the performance of different strategies is identical. Therefore, we recommend using machines with 8 NICs and 8 GPUs for testing. Below are the performance tests for R2CCL-Balance and R2CCL-AllReduce, respectively.
mpirun -x R2CC_MODE=2 -np 16 -host A,B ./build/all_reduce_perf -b 8K -e 8G -f 2 -t 1 -g 1
mpirun -x R2CC_MODE=3 -np 16 -host A,B ./build/all_reduce_perf -b 8K -e 8G -f 2 -t 1 -g 1 @article{wang2025reliable,
title={Reliable and Resilient Collective Communication Library for LLM Training and Serving},
author={Wang, Wei and Yu, Nengneng and Xiong, Sixian and Liu, Zaoxing},
journal={arXiv preprint arXiv:2512.25059},
year={2025}
}
