segfault when running Device to Device transfers OpenMPI + LinkX provider on NVidia GPUs

I'm seeing a segfault when doing Device - Device transfers (OSU) between NVidia GPUs using the libfabric `mtl` and LinkX provider. Here is what I do:

- compile newest libfabric master (or 2.0.0 release) with CUDA support, and with
  either with CXI provider (for a Slingshot system), or with verbs (for an IB-based system)

- compile OpenMPI 5.0.7 with PR https://github.com/open-mpi/ompi/pull/12290

As discussed previously in https://github.com/open-mpi/ompi/issues/13048, this PR is needed to make OpenMPI + LinkX work correctly on HIP GPUs (without it OpenMPI fails with `Required key not available` error). However, on a system with NVidia GPUs + CXI the same test fails with a segfault, as reported to OFI in https://github.com/ofiwg/libfabric/issues/10865.

Now I reproduced the same segfault on an Infiniband-based system, with verbs instead of CXI. To reproduce this I have to extend the patch in https://github.com/open-mpi/ompi/pull/12290 to

```
// mtl_ofi_component.c
         if ((NULL != strstr(prov->fabric_attr->prov_name, "cxi")) ||
             (NULL != strstr(prov->fabric_attr->prov_name, "CXI")) ||
             (NULL != strstr(prov->fabric_attr->prov_name, "verbs"))) {
             ompi_mtl_ofi.hmem_needs_reg = false;
         }
```
Otherwise I get the familiar memory registration error:
```
--------------------------------------------------------------------------
Open MPI failed to register your buffer.
This error is fatal, your job will abort

  Buffer Type: cuda
  Buffer Address: 0x148b43600000
  Buffer Length: 1
  Error: Required key not available (4294967030)
--------------------------------------------------------------------------
```
When I extend the patch to also set `ompi_mtl_ofi.hmem_needs_reg = false` for `verbs` and run OSU benchmark as follows:
```
export FI_LNX_PROV_LINKS=shm+verbs
mpirun -np 2 -mca pml cm -mca mtl ofi -mca opal_common_ofi_provider_include "shm" -map-by numa ./osu_bibw D D
```
I get the exact same segfault as with CUDA+CXI (https://github.com/ofiwg/libfabric/issues/10865)
```
# OSU MPI-CUDA Bi-Directional Bandwidth Test v7.1
# Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
# Size      Bandwidth (MB/s)
# Datatype: MPI_CHAR.
1                       0.15
[gpu-12-6:1952376:0:1952376] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x18)
==== backtrace (tid:1952376) ====
 0 0x0000000000054d90 __GI___sigaction()  :0
 1 0x000000000002d18a cuda_gdrcopy_dev_unregister()  :0
 2 0x000000000005318a ofi_mr_cache_flush()  :0
 3 0x000000000005336b ofi_mr_cache_search()  :0
 4 0x0000000000106b0a lnx_trecv()  :0
 5 0x00000000001cbefb fi_trecv()  /cluster/home/marcink/software/libfabric/include/rdma/fi_tagged.h:101
 6 0x00000000001cbefb ompi_mtl_ofi_irecv_true()  /tmp/openmpi-5.0.7/ompi/mca/mtl/ofi/mtl_ofi_irecv_opt.c:38
 7 0x000000000026f2e7 mca_pml_cm_irecv()  /tmp/openmpi-5.0.7/ompi/mca/pml/cm/pml_cm.h:121
 8 0x00000000000ba78b PMPI_Irecv()  /tmp/openmpi-5.0.7/ompi/mpi/c/irecv.c:89
 9 0x0000000000403d56 main()  /cluster/home/marcink/src/osu-micro-benchmarks-7.1-1/c/mpi/pt2pt/standard/osu_bibw.c:286
10 0x000000000003feb0 __libc_start_call_main()  ???:0
11 0x000000000003ff60 __libc_start_main_alias_2()  :0
12 0x00000000004047a5 _start()  ???:0
=================================
```
I understand my change to `mtl_ofi_component.c` was an experiment, probably wrong, and libfabric is not the main execution path on this system, but the point is that this now results in the same failure on two systems: CUDA+CXI and CUDA+IB, and CUDA+CXI should work. So is it possible that there is something not entirely correct with PR https://github.com/open-mpi/ompi/pull/12290 on NVidia GPUs? Can it be that the buffers should be registered in this case, because the LinkX/CXI providers handle them differently? And if so, should this be fixed in OpenMPI, or libfabric?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

segfault when running Device to Device transfers OpenMPI + LinkX provider on NVidia GPUs #13156

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

segfault when running Device to Device transfers OpenMPI + LinkX provider on NVidia GPUs #13156

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions