-
Notifications
You must be signed in to change notification settings - Fork 289
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DO NOT MERGE] Initial NCCL Allreduce Backend Prototype #7298
base: main
Are you sure you want to change the base?
Conversation
4de713b
to
debe5b4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
look interesting! i made some comments. it would be great to also see the integration point of the CCL collectives into the allreduce path for evaluation.
src/include/mpir_cclcomm.h
Outdated
#include <nccl.h> | ||
|
||
typedef struct MPIR_CCLcomm { | ||
MPIR_OBJECT_HEADER; | ||
MPIR_Comm *comm; | ||
ncclUniqueId id; | ||
ncclComm_t ncclcomm; | ||
cudaStream_t stream; | ||
} MPIR_CCLcomm; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will need the right autoconf/automake defines to only include this stuff when its available.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. I added temporary macros to show where the ifdefs will go in the new files, but I am not confident how to modify the autotools files to properly detect CCLs. I would appreciate your help with this!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In configure.ac
, somewhere after we cofigured MPL,
if test "$GPU_SUPPORT" = "CUDA" ; then
PAC_CHECK_HEADER_LIB_OPTIONAL([nccl], [nccl.h], [nccl], [ncclCommInitAll])
if test "$pac_have_nccl" = "yes" ; then
AC_DEFINE([HAVE_NCCL], 1, [define if have libnccl])
fi
fi
I think another question down the road is whether or not there is enough commonality in the CCL impls that it can be abstracted in |
@raffenet Thanks for the comments! I will take a look at the individual feedback and make changes. |
For my testing, I added it to |
Ah, right 😄. In the case of |
src/mpi/ccl/Makefile.mk
Outdated
## | ||
|
||
mpi_core_sources += \ | ||
src/mpi/ccl/cclcomm.c |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Place it under src/util
or src/mpid/common
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the logic we should use to decide where to store these new files? Note there are now a few new source files in this directory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
src/util
are device-independent, and its target use is in the mpir-
layer.
src/mpid/common
are targeting device layer. MPIR-
layer should not use them. Although this is not exactly true for the sched
and thread
in src/mpid/common
. I believe they are legacy origin. They were likely started in the device layer, then later refactored to mpir-layer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, src/mpi
corresponds to MPI's API categories, so anything related to collective should be under src/mpi/coll
. Utilites, on the other hand, goes into src/util
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, thanks for the explanation! I moved the ccl
directory to src/util
.
src/mpi/ccl/cclcomm.c
Outdated
CUDA_ERR_CHECK(ret); | ||
ret = cudaStreamCreate(&(cclcomm->stream)); | ||
CUDA_ERR_CHECK(ret); | ||
ret = ncclCommInitRank(&(cclcomm->ncclcomm), comm_size, cclcomm->id, rank); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does NCCL work with duplicate/shared devices? Does it work across multiple nodes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know NVIDIA has released some new features recently to enable multiple processes to share a GPU, but I think it is rare. My expectation is that this feature (and GPU-enabled MPI in general) is one process per GPU.
NCCL does work across multiple nodes. It natively supports infiniband and some others natively and there are plugins for other transports like OFI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the question is whether nccl's internode support has advantage. I think it is simpler if we limit MPICH's usage of nccl to intranode, and rely/implement heirarchical algorithm for internode collectives.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I plan to support intranode and internode using CCLs. NCCL has performance advantages for internode, and other CCLs do too.
Lazy initialization should work.
After #7264, you only need deal with internal datatypes which can be placed into a switch clause.
Something similar to how we wrap the PMIs (
|
debe5b4
to
ee76028
Compare
b237081
to
645aef1
Compare
I have updated this prototype and am re-requesting reviews. The implementation is now much more fleshed out. I separated the CCL framework and the NCCL calls, and I have integrated them in the allreduce path via a new "ccl" algorithm. The main thing missing at this point is the autoconf/automake detection of the NCCL path. If someone can point me in the right direction on this, I would greatly appreciate it! Regarding:
It is currently implemented as the second option, but long-term I would like to do the first. Let me know what you think! |
This PR is a proof-of-concept of how we can use NCCL as a backend of MPI collectives. This PR is missing many necessary features and is not meant to be merged. I am interested in gathering feedback before continuing development. Some open questions/points for discussion I have: * When/where to properly init/free the `CCLcomm` structure * How to handle operation and data types portably (copy+pasting them into a new switch statement feels frail, and I am not 100% I covered all of the relevant datatypes) * How to design this extensibly so we can add RCCL, OneCCL, etc. (My intuition here is a base abstract class + derived classes, but this is not OOP.) * How to modify `src/mpi/coll/mpir_coll.c` to consider `CCL_Allreduce`. * How to pull-in NCCL from the environment and/or a configure argument `--with-nccl=` *Function/variable/file names and locations And anything else you can think of. Let me know what you think!
Pull Request Description
This PR is a proof-of-concept of how we can use NCCL as a backend of MPI collectives. This PR is missing many necessary features and is not meant to be merged. I am interested in gathering feedback before continuing development. Some open questions/points for discussion I have:
CCLcomm
structuresrc/mpi/coll/mpir_coll.c
to considerCCL_Allreduce
.--with-nccl=
*Function/variable/file names and locations
And anything else you can think of. Let me know what you think!
Author Checklist
Particularly focus on why, not what. Reference background, issues, test failures, xfail entries, etc.
Commits are self-contained and do not do two things at once.
Commit message is of the form:
module: short description
Commit message explains what's in the commit.
Whitespace checker. Warnings test. Additional tests via comments.
For non-Argonne authors, check contribution agreement.
If necessary, request an explicit comment from your companies PR approval manager.