Description
We are running an application on a single node using CUDA 12.8. We are using openMPI 5.0.7 with UCS 1.8.0 and gdrcopy 1.5.0 and nv_peer_mem v 1.3.
We are getting the error below:
ib_mlx5_log.c:179 Remote operation error on mlx5_4:1/IB (synd 0x14 vend 0x89 hw_synd 0/0)
ib_mlx5_log.c:179 RC QP 0x53 wqe[0]: RDMA_READ s-- [rva 0x7fe7d1a00000 rkey 0x180ac9] [va 0x7f15bfe00000 len 10020 lkey 0x182fee] [rqpn 0x5f dlid=65 sl=0 port=1 src_path_bits=0]
==== backtrace (tid: 6594) ====
0 /shared_data/third_party/openmpi-5.0.7/lib/libucs.so.0(ucs_handle_error+0x2e4) [0x7f18ebadf934]
1 /shared_data/third_party/openmpi-5.0.7/lib/libucs.so.0(ucs_fatal_error_message+0xc2) [0x7f18ebadc9c2]
2 /shared_data/third_party/openmpi-5.0.7/lib/libucs.so.0(ucs_log_default_handler+0xf7e) [0x7f18ebae15ee]
3 /shared_data/third_party/openmpi-5.0.7/lib/libucs.so.0(ucs_log_dispatch+0xe4) [0x7f18ebae1a14]
4 /shared_data/third_party/openmpi-5.0.7/lib/ucx/libuct_ib_mlx5.so.0(uct_ib_mlx5_completion_with_err+0x60d) [0x7f160210036d]
5 /shared_data/third_party/openmpi-5.0.7/lib/ucx/libuct_ib_mlx5.so.0(uct_rc_mlx5_iface_handle_failure+0x134) [0x7f1602115f24]
6 /shared_data/third_party/openmpi-5.0.7/lib/ucx/libuct_ib_mlx5.so.0(uct_ib_mlx5_check_completion+0x3d) [0x7f160210153d]
7 /shared_data/third_party/openmpi-5.0.7/lib/ucx/libuct_ib_mlx5.so.0(+0x2b2f7) [0x7f16021172f7]
8 /shared_data/third_party/openmpi-5.0.7/lib/libucp.so.0(ucp_worker_progress+0x2a) [0x7f18ebb6c05a]
9 /shared_data/third_party/openmpi-5.0.7/lib/libopen-pal.so.80(opal_progress+0x34) [0x7f18ebc2e734]
10 /shared_data/third_party/openmpi-5.0.7/lib/libmpi.so.40(ompi_request_default_wait+0x140) [0x7f1904112020]
11 /shared_data/third_party/openmpi-5.0.7/lib/libmpi.so.40(MPI_Wait+0x54) [0x7f190415de84]
Background information
What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)
5.0.7, UCX 1.8.0
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
OpenMPI was compiled from source
Please describe the system on which you are running
- Operating system/version: Ubuntu 22.04.4 LTS
- Computer hardware: NVIDIA DGX B200
- Network type: Infiniband NDR
Details of the problem
The command we are issuing is:
/third_party/openmpi-5.0.7/bin/mpirun -np 8 -hostfile ./hostfile --report-bindings --bind-to core --map-by ppr:8:node:PE=14 --mca pml ucx --mca btl ^openib -x UCX_TLS=self,sm,cma,cuda_copy,gdr_copy,rc_v -x UCX_IB_GPU_DIRECT_RDMA=1 ./mpi_rail_mapping_b200.sh /install_path/openmpi507-25.6.0/bin/OurExecutable
and the contents of mpi_rail_mapping.sh is:
#!/bin/bash
export LOCAL_RANK=$OMPI_COMM_WORLD_LOCAL_RANK
IB_DEVS=(4 7 8 9 10 13 14 15)
CUDA_DEV=$LOCAL_RANK
IB_DEV=${IB_DEVS[$LOCAL_RANK]}
export UCX_NET_DEVICES=mlx5_$IB_DEV:1
echo "local rank $CUDA_DEV: using hca $IB_DEV"
exec $*
The IB_DEVS is set to map each GPU to the nearest IB card according to the output of nvidia-smi topo -m shown below:
nvidia-smi topo -m
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NIC0 NIC1 NIC2 NIC3 NIC4 NIC5 NIC6 NIC7 NIC8 NIC9 NIC10 NIC11 NIC12 NIC13 NIC14 NIC15 CPU Affinit
y NUMA Affinity GPU NUMA ID
GPU0 X NV18 NV18 NV18 NV18 NV18 NV18 NV18 NODE NODE NODE NODE PXB NODE NODE NODE NODE NODE SYS SYS SYS SYS SYS SYS 0-55 0 N
/A
GPU1 NV18 X NV18 NV18 NV18 NV18 NV18 NV18 NODE NODE NODE NODE NODE NODE NODE PXB NODE NODE SYS SYS SYS SYS SYS SYS 0-55 0 N
/A
GPU2 NV18 NV18 X NV18 NV18 NV18 NV18 NV18 NODE NODE NODE NODE NODE NODE NODE NODE PXB NODE SYS SYS SYS SYS SYS SYS 0-55 0 N
/A
GPU3 NV18 NV18 NV18 X NV18 NV18 NV18 NV18 NODE NODE NODE NODE NODE NODE NODE NODE NODE PXB SYS SYS SYS SYS SYS SYS 0-55 0 N
/A
GPU4 NV18 NV18 NV18 NV18 X NV18 NV18 NV18 SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS PXB NODE NODE NODE NODE NODE 56-111 1 N
/A
GPU5 NV18 NV18 NV18 NV18 NV18 X NV18 NV18 SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS NODE NODE NODE PXB NODE NODE 56-111 1 N
/A
GPU6 NV18 NV18 NV18 NV18 NV18 NV18 X NV18 SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS NODE NODE NODE NODE PXB NODE 56-111 1 N
/A
GPU7 NV18 NV18 NV18 NV18 NV18 NV18 NV18 X SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS NODE NODE NODE NODE NODE PXB 56-111 1 N
/A
NIC0 NODE NODE NODE NODE SYS SYS SYS SYS X PIX PIX PIX NODE NODE NODE NODE NODE NODE SYS SYS SYS SYS SYS SYS
NIC1 NODE NODE NODE NODE SYS SYS SYS SYS PIX X PIX PIX NODE NODE NODE NODE NODE NODE SYS SYS SYS SYS SYS SYS
NIC2 NODE NODE NODE NODE SYS SYS SYS SYS PIX PIX X PIX NODE NODE NODE NODE NODE NODE SYS SYS SYS SYS SYS SYS
NIC3 NODE NODE NODE NODE SYS SYS SYS SYS PIX PIX PIX X NODE NODE NODE NODE NODE NODE SYS SYS SYS SYS SYS SYS
NIC4 PXB NODE NODE NODE SYS SYS SYS SYS NODE NODE NODE NODE X NODE NODE NODE NODE NODE SYS SYS SYS SYS SYS SYS
NIC5 NODE NODE NODE NODE SYS SYS SYS SYS NODE NODE NODE NODE NODE X PIX NODE NODE NODE SYS SYS SYS SYS SYS SYS
NIC6 NODE NODE NODE NODE SYS SYS SYS SYS NODE NODE NODE NODE NODE PIX X NODE NODE NODE SYS SYS SYS SYS SYS SYS
NIC7 NODE PXB NODE NODE SYS SYS SYS SYS NODE NODE NODE NODE NODE NODE NODE X NODE NODE SYS SYS SYS SYS SYS SYS
NIC8 NODE NODE PXB NODE SYS SYS SYS SYS NODE NODE NODE NODE NODE NODE NODE NODE X NODE SYS SYS SYS SYS SYS SYS
NIC9 NODE NODE NODE PXB SYS SYS SYS SYS NODE NODE NODE NODE NODE NODE NODE NODE NODE X SYS SYS SYS SYS SYS SYS
NIC10 SYS SYS SYS SYS PXB NODE NODE NODE SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS X NODE NODE NODE NODE NODE
NIC11 SYS SYS SYS SYS NODE NODE NODE NODE SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS NODE X PIX NODE NODE NODE
NIC12 SYS SYS SYS SYS NODE NODE NODE NODE SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS NODE PIX X NODE NODE NODE
NIC13 SYS SYS SYS SYS NODE PXB NODE NODE SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS NODE NODE NODE X NODE NODE
NIC14 SYS SYS SYS SYS NODE NODE PXB NODE SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS NODE NODE NODE NODE X NODE
NIC15 SYS SYS SYS SYS NODE NODE NODE PXB SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS NODE NODE NODE NODE NODE X
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
NIC Legend:
NIC0: mlx5_0
NIC1: mlx5_1
NIC2: mlx5_2
NIC3: mlx5_3
NIC4: mlx5_4
NIC5: mlx5_5
NIC6: mlx5_6
NIC7: mlx5_7
NIC8: mlx5_8
NIC9: mlx5_9
NIC10: mlx5_10
NIC11: mlx5_11
NIC12: mlx5_12
NIC13: mlx5_13
NIC14: mlx5_14
NIC15: mlx5_15
If we permute the IB_DEV mapping, to say (4,8,9,10,13,14,15,7) it will run, but I assume that is because the messages are going through the NUMA node, and no longer using GPU RDMA.
Also, if I add cuda_ipc to the UCX_TLS it will work, but this is because it is using NVLINK instead of the IB cards.
Any advice would be appreciated.