Skip to content

IPC handles not copy by value between processes #448

Open
@paboyle

Description

@paboyle

Source:
reproducer_IPC_bandwidth.cpp

Problem:
zeMemOpenIpcHandle and zeMemGetIpcHandle do not return a handle that can be copied by value between
processes and used in a distinct process using either MPI or any other means of communication between processes.

Level zero example codes like zello_ipc_copy_dma_buf.cpp reinterpret the first four bytes of the opaque handle as a file
descriptor and pass this descriptor through Unix domain sockets between processes. The value in the receiving process differs
in general and is inserted into the first four bytes of an (uninitialized) handle in the receiving process and used as a key to
open the Ipc memory window.

MPI has no facilities for file descriptor passing, and MPI process creation does not have an opportunity for socketpair and fork,
making this mechanism problematic in an HPC or MPI environment.

See lines 24:76 of Level Zero “zello_ipc_copy_dma_buf.cpp” and routines:
static int sendmsg_fd(int socket, int fd)
and
static int recvmsg_fd(int socket)

for the cumbersome implementation.

Proposed solution:

L284 through L358.

Sufficient data, copied by value to obtain the IPC file descriptor can be copied by value between
MPI processes without use of a Unix domain socketpair or filedescriptor passing.

The data structure

typedef struct { int fd; pid_t pid ; } clone_mem_t;

Containing both the source process PID and the FD number within that PID can be used in combination with
the Linux system call “pidfd_getfd” can use this pair to obtain the same file descriptor in another process
with the same UID or sufficient permissions.

This code demonstrates transmission of a clone_mem_t between MPI ranks and the opening of the
IPC file descriptor (using pidfd_getfd) in the receiving process and then passing this into the Level Zero IPC
Routine.

This represents a proof of principle, that if the opaque IPC handle used by Level Zero instead contains an encoding
of “PID” and “FD” then the Level Zero IPC API handles could then be copied simply by value, through MPI or otherwise
and passed to the IPC open routine without any additional and unnecessary complexity.

Code available via Camilo Moreno or Patrick Steinbrecher @ intel.

Realistically this is enough to go in, suspect. Man pidf_getfd under Linux.

https://man7.org/linux/man-pages/man2/pidfd_getfd.2.html

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions