-
Notifications
You must be signed in to change notification settings - Fork 288
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ch4: separate multi-vci/nic initialization #7255
base: main
Are you sure you want to change the base?
Conversation
f80dcfc
to
0383144
Compare
93ec02c
to
a2a8b95
Compare
Replace the rather ambiguous field is_tainted with vcis_enabled, thus allowing vci activation on a per-comm basis. It is inherited if the new comm is created from within an parent comm that has vcis_enabled. If the parent comm vcis_enabled=false, then all its descendents will have vcis_enabled until they are turned via separate APIs (to be added in the future). Intercomm and intercomm_merge may include processes outside originating comm, thus vcis_enabled=false by default. MPI_Comm_create may create an intercomm that inherits vcis_enabled=true. This is an exception because both local processes and remote processes are from within originating comm that has vcis_enabled. For now, we switch on vcis_enabled in comm_wrold after post_init. With future extension, it is possible to allow user explicitly set up multi-vics on a smaller comm than comm world.
Rather than initialize per-vci mutexes in ch4 and register with request pools, directly use MPIR-layer request pool mutexes.
Move multiple vci related init/finalize code into ch4_vci.c. Wrap per-vci code into a function and only deal with vci 0 in ch4_init.c and additional vcis in ch4_vci.c.
Gather all multiple vci init code in MPIDI_Comm_set_vcis. So - 1. The rest of the init code only deal with root vci. 2. Prepare for future dynamic and per-comm vci.
We may relax the av insertion order which may require the full (vci_local, nic_local, vci_remote, nic_remote) to look up an actual destination address. Add MPIDI_OFI_av_to_phys_root for convenience and quick survey on where we restrict in only root vci (such as the init and spawn paths). Remove MPIDI_OFI_comm_to_phys and prefer an explicit MPIDIU_comm_rank_to_av and then MPIDI_OFI_av_to_phys. Refactor MPIDI_OFI_SET_AM_HDR_COMMON in ofi_am_impl.h to directly use dst_addr (as remote_id) rather than to recalculate it.
Let MPIDI_OFI_addr_t only contain field for root vci address, and only allocate more space for additional addresses when multiple vci and nic is enabled -- potentially at runtime. This avoids wasting memory for multiple vcis unelss it is actually needed.
Move code that are related to multiple-vci setup to ofi_vci.c.
Set up and enable multiple vcis in MPIDI_Comm_set_vcis instead of during post_init. TODO: separate the refactoring in ofi_vci.c
We exchange non-root endpoints in comm_set_vcis. Because we can't use multiple nics before the address exchange, we only can activate multiple nics in comm_set_vcis.
MPIDI_OFI_global.num_nics affects runtime paths such as ofi progress and large message striping. Only set it in MPIDI_OFI_init_vcis so we won't have complications when multi-nics is not ready.
This has been superseded by MPIDI_NM_comm_set_vcis.
Consolidate the shmem allocations in iqueue to 2 slabs. One root slab that is initialized at world_init. The other all_slab for per-vci transport, initialized at the time of init vcis. The goal is to eventually allow more flexible shm creation, potentially allow init within a non-world communicator.
Transition from world init to per-comm vci init.
MPIDU_shm_seg_t was used by mpidu_shm_alloc.c, mpidu_init_shm.c, and mpidu_init_shm_alloc.c. However, the usages are all slightly different and some fields are only used in one but not the other. It is simpler to locally define it or, in the case of mpidu_init_shm.c, just use static globals.
Add routine to support allocating a shared memory by a comm, which allows - * create shared memory by a smaller comm than a comm_world * attach the shared memory by later processes * potentially allowing shm communication with dynamic processes - we need a way to discover and attach to Init_shm (via intercomm) and the initial shared memory need pre-allocate to account for new processes. For now, we need this to support MPIDI_POSIX_comm_set_vcis.
We need ensure the extra fields, such as MPIDI_OFI_AV(av, all_dest), are initialized to NULL.
Because we insert all remote endpoints to all local endpoints at the same time, thus follow the exact same insertion order, they will share the same av table index except for the local root endpoint because it has inserted other remote root endpoints at init time. The local root to remote non-root endpoints will have a fixed offset from that of local non-root.
test:mpich/ch4/ofi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've made some comments. I'm not sure if there are any natural PR split points such as nm, then shm, but it might help if there are.
if (!comm->vcis_enabled) { | ||
/* address exchange may not have performed on this comm */ | ||
nic_idx = 0; | ||
} else if (MPIDI_OFI_COMM(comm).pref_nic) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Q for clarification: can a process still use multiple nics when a comm is using a single vci?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently we call MPIDI_Comm_set_vcis
in MPID_Comm_commit_post_hook
for MPI_COMM_WORLD
, so it (multiple vcis and multiple nics) is always enabled for comm world. The future plan is to make comm_set_vcis
optional -- so users need explicitly set vcis in order to activate multiple nics -- calling com_set_vcis
with a single vci is allowed, so multiple nics can be used with a single vci.
The design is to add some flexibility so that user or in the future we can enable vcis (and multiple nics) in a smaller comm than comm world and also when user need it.
@@ -32,4 +28,13 @@ int MPIDU_Init_shm_alloc(size_t len, void **ptr); | |||
int MPIDU_Init_shm_free(void *ptr); | |||
int MPIDU_Init_shm_is_symm(void *ptr); | |||
|
|||
/* support routines for MPIDU_Init_shm_root_alloc */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be MPIDU_Init_shm_comm_alloc
?
int MPIDU_Init_shm_atomic_count(void) | ||
{ | ||
return MPL_atomic_load_int(&barrier->alloc_count); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not grokking the naming for this function. This is returning a boolean value of whether or not an allocation exists? Maybe MPIDU_Init_shm_allocated
or MPIDI_Init_shm_initialized
? Or could the value of alloc_count be more than 1 in the future?
@@ -101,7 +101,7 @@ int MPIDI_POSIX_iqueue_init(int rank, int size) | |||
int nprocs = MPIR_Process.local_size; | |||
|
|||
int pool_size = MPIDU_genq_shmem_pool_size(cell_size, num_cells, nprocs, num_free_queue); | |||
int terminal_size = num_proc * sizeof(MPIDU_genq_shmem_queue_u); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like a leftover from some other commit, not related to [9577a84]
if (comm->node_comm) { | ||
mpi_errno = MPIR_Barrier_impl(comm->node_comm, MPIR_ERR_NONE); | ||
MPIR_ERR_CHECK(mpi_errno); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it necessary to change from MPIDU_Init_shm_barrier
here? For the single process case maybe?
#define MPIDI_OFI_AV_ADDR(av, local_vci, local_nic, vci, nic) \ | ||
((local_vci==0 && local_nic==0) ? \ | ||
((vci == 0 && nic == 0) ? MPIDI_OFI_AV_ADDR_ROOT(av) : MPIDI_OFI_AV_ADDR_OFFSET(av, vci, nic)) : \ | ||
MPIDI_OFI_AV_ADDR_NO_OFFSET(av, vci, nic)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[f990be2] needs a commit message
Let me try... |
Pull Request Description
There are two goals of this PR:
Separate setup of multiple vci/nic from the regular initialization. When multiple vci/nic are not enabled by the environment CVARs, the multi-vci/nic components should be skipped completely. This means, we should not incur any cost of multiple vci/nic unless it is opted in by user/application.
Allow per-comm setup of multiple vci/nic. This is necessary to support true session model where
comm_world
may never get created nor initialized.[skip warnings]
Author Checklist
Particularly focus on why, not what. Reference background, issues, test failures, xfail entries, etc.
Commits are self-contained and do not do two things at once.
Commit message is of the form:
module: short description
Commit message explains what's in the commit.
Whitespace checker. Warnings test. Additional tests via comments.
For non-Argonne authors, check contribution agreement.
If necessary, request an explicit comment from your companies PR approval manager.