Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ch4: separate multi-vci/nic initialization #7255

Draft
wants to merge 24 commits into
base: main
Choose a base branch
from

Conversation

hzhou
Copy link
Contributor

@hzhou hzhou commented Jan 3, 2025

Pull Request Description

There are two goals of this PR:

  1. Separate setup of multiple vci/nic from the regular initialization. When multiple vci/nic are not enabled by the environment CVARs, the multi-vci/nic components should be skipped completely. This means, we should not incur any cost of multiple vci/nic unless it is opted in by user/application.

  2. Allow per-comm setup of multiple vci/nic. This is necessary to support true session model where comm_world may never get created nor initialized.

[skip warnings]

Author Checklist

  • Provide Description
    Particularly focus on why, not what. Reference background, issues, test failures, xfail entries, etc.
  • Commits Follow Good Practice
    Commits are self-contained and do not do two things at once.
    Commit message is of the form: module: short description
    Commit message explains what's in the commit.
  • Passes All Tests
    Whitespace checker. Warnings test. Additional tests via comments.
  • Contribution Agreement
    For non-Argonne authors, check contribution agreement.
    If necessary, request an explicit comment from your companies PR approval manager.

@hzhou hzhou force-pushed the 2412_comm_init branch 9 times, most recently from f80dcfc to 0383144 Compare January 9, 2025 03:33
@hzhou hzhou force-pushed the 2412_comm_init branch 3 times, most recently from 93ec02c to a2a8b95 Compare February 13, 2025 16:14
@hzhou hzhou requested a review from raffenet February 13, 2025 19:48
hzhou added 17 commits March 5, 2025 10:18
Replace the rather ambiguous field is_tainted with vcis_enabled, thus
allowing vci activation on a per-comm basis.

It is inherited if the new comm is created from within an parent comm
that has vcis_enabled. If the parent comm vcis_enabled=false, then all
its descendents will have vcis_enabled until they are turned via separate
APIs (to be added in the future).

Intercomm and intercomm_merge may include processes outside originating
comm, thus vcis_enabled=false by default. MPI_Comm_create may create an
intercomm that inherits vcis_enabled=true. This is an exception because
both local processes and remote processes are from within originating
comm that has vcis_enabled.

For now, we switch on vcis_enabled in comm_wrold after post_init. With
future extension, it is possible to allow user explicitly set up
multi-vics on a smaller comm than comm world.
Rather than initialize per-vci mutexes in ch4 and register with request
pools, directly use MPIR-layer request pool mutexes.
Move multiple vci related init/finalize code into ch4_vci.c. Wrap
per-vci code into a function and only deal with vci 0 in ch4_init.c and
additional vcis in ch4_vci.c.
Gather all multiple vci init code in MPIDI_Comm_set_vcis. So -

1. The rest of the init code only deal with root vci.

2. Prepare for future dynamic and per-comm vci.
We may relax the av insertion order which may require the full
(vci_local, nic_local, vci_remote, nic_remote) to look up an actual
destination address.

Add MPIDI_OFI_av_to_phys_root for convenience and quick survey on where
we restrict in only root vci (such as the init and spawn paths).

Remove MPIDI_OFI_comm_to_phys and prefer an explicit
MPIDIU_comm_rank_to_av and then MPIDI_OFI_av_to_phys.

Refactor MPIDI_OFI_SET_AM_HDR_COMMON in ofi_am_impl.h to directly use
dst_addr (as remote_id) rather than to recalculate it.
Let MPIDI_OFI_addr_t only contain field for root vci address, and only
allocate more space for additional addresses when multiple vci and nic is
enabled -- potentially at runtime. This avoids wasting memory for
multiple vcis unelss it is actually needed.
Move code that are related to multiple-vci setup to ofi_vci.c.
Set up and enable multiple vcis in MPIDI_Comm_set_vcis instead of during
post_init.

TODO: separate the refactoring in ofi_vci.c
We exchange non-root endpoints in comm_set_vcis. Because we can't use
multiple nics before the address exchange, we only can activate multiple
nics in comm_set_vcis.
MPIDI_OFI_global.num_nics affects runtime paths such as ofi progress and
large message striping. Only set it in MPIDI_OFI_init_vcis so we won't
have complications when multi-nics is not ready.
This has been superseded by MPIDI_NM_comm_set_vcis.
Consolidate the shmem allocations in iqueue to 2 slabs. One root slab
that is initialized at world_init. The other all_slab for per-vci
transport, initialized at the time of init vcis.

The goal is to eventually allow more flexible shm creation, potentially
allow init within a non-world communicator.
hzhou added 2 commits March 5, 2025 11:44
Transition from world init to per-comm vci init.
MPIDU_shm_seg_t was used by mpidu_shm_alloc.c, mpidu_init_shm.c, and
mpidu_init_shm_alloc.c. However, the usages are all slightly different
and some fields are only used in one but not the other. It is simpler to
locally define it or, in the case of mpidu_init_shm.c, just use static
globals.
hzhou added 5 commits March 5, 2025 15:30
Add routine to support allocating a shared memory by a comm, which allows -

* create shared memory by a smaller comm than a comm_world
* attach the shared memory by later processes
* potentially allowing shm communication with dynamic processes - we
need a way to discover and attach to Init_shm (via intercomm) and the
initial shared memory need pre-allocate to account for new processes.

For now, we need this to support MPIDI_POSIX_comm_set_vcis.
We need ensure the extra fields, such as MPIDI_OFI_AV(av, all_dest), are
initialized to NULL.
Because we insert all remote endpoints to all local endpoints at the
same time, thus follow the exact same insertion order, they will share
the same av table index except for the local root endpoint because it
has inserted other remote root endpoints at init time. The local root to
remote non-root endpoints will have a fixed offset from that of local
non-root.
@hzhou
Copy link
Contributor Author

hzhou commented Mar 6, 2025

test:mpich/ch4/ofi

Copy link
Contributor

@raffenet raffenet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've made some comments. I'm not sure if there are any natural PR split points such as nm, then shm, but it might help if there are.

Comment on lines +583 to +586
if (!comm->vcis_enabled) {
/* address exchange may not have performed on this comm */
nic_idx = 0;
} else if (MPIDI_OFI_COMM(comm).pref_nic) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q for clarification: can a process still use multiple nics when a comm is using a single vci?

Copy link
Contributor Author

@hzhou hzhou Mar 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently we call MPIDI_Comm_set_vcis in MPID_Comm_commit_post_hook for MPI_COMM_WORLD, so it (multiple vcis and multiple nics) is always enabled for comm world. The future plan is to make comm_set_vcis optional -- so users need explicitly set vcis in order to activate multiple nics -- calling com_set_vcis with a single vci is allowed, so multiple nics can be used with a single vci.

The design is to add some flexibility so that user or in the future we can enable vcis (and multiple nics) in a smaller comm than comm world and also when user need it.

@@ -32,4 +28,13 @@ int MPIDU_Init_shm_alloc(size_t len, void **ptr);
int MPIDU_Init_shm_free(void *ptr);
int MPIDU_Init_shm_is_symm(void *ptr);

/* support routines for MPIDU_Init_shm_root_alloc */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be MPIDU_Init_shm_comm_alloc?

Comment on lines +300 to +303
int MPIDU_Init_shm_atomic_count(void)
{
return MPL_atomic_load_int(&barrier->alloc_count);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not grokking the naming for this function. This is returning a boolean value of whether or not an allocation exists? Maybe MPIDU_Init_shm_allocated or MPIDI_Init_shm_initialized? Or could the value of alloc_count be more than 1 in the future?

@@ -101,7 +101,7 @@ int MPIDI_POSIX_iqueue_init(int rank, int size)
int nprocs = MPIR_Process.local_size;

int pool_size = MPIDU_genq_shmem_pool_size(cell_size, num_cells, nprocs, num_free_queue);
int terminal_size = num_proc * sizeof(MPIDU_genq_shmem_queue_u);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like a leftover from some other commit, not related to [9577a84]

Comment on lines +168 to +171
if (comm->node_comm) {
mpi_errno = MPIR_Barrier_impl(comm->node_comm, MPIR_ERR_NONE);
MPIR_ERR_CHECK(mpi_errno);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it necessary to change from MPIDU_Init_shm_barrier here? For the single process case maybe?

Comment on lines +54 to +57
#define MPIDI_OFI_AV_ADDR(av, local_vci, local_nic, vci, nic) \
((local_vci==0 && local_nic==0) ? \
((vci == 0 && nic == 0) ? MPIDI_OFI_AV_ADDR_ROOT(av) : MPIDI_OFI_AV_ADDR_OFFSET(av, vci, nic)) : \
MPIDI_OFI_AV_ADDR_NO_OFFSET(av, vci, nic))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[f990be2] needs a commit message

@hzhou
Copy link
Contributor Author

hzhou commented Mar 7, 2025

I've made some comments. I'm not sure if there are any natural PR split points such as nm, then shm, but it might help if there are.

Let me try...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants