-
Notifications
You must be signed in to change notification settings - Fork 897
Updated the documentation about shared memory. #13218
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Hello! The Git Commit Checker CI bot found a few problems with this PR: aaf5f27: Updated the documentation about shared memory.
Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks! |
Changes to be committed: modified: docs/launching-apps/localhost.rst modified: docs/tuning-apps/networking/shared-memory.rst Signed-off-by: xbw <[email protected]>
aaf5f27
to
0861121
Compare
@bosilca Hello, this is my first pr and I need to be approved for workflow execution. |
lowest performance of the single-copy mechanisms. However, CMA | ||
is likely the most widely available because it is enabled by | ||
default in several modern Linux distributions. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: do we need to add in here something about the smsc/accelerator component that will be available in 6.0?
Might be worth explaining a tad more about the session directory and this shmem backing file. The reason we created a session directory (way back in the beginning of OMPI - predates PMIx by more than 10 years) was to provide a location where we could collect all files OMPI creates, thus providing a simple way to clean them all up upon termination. The daemon just whack the entire session directory tree after the job completes (normally or abnormally) before it exits. We changed that for the shmem backing file (to use /dev, if it exists) due to size limitations (IIRC). The problem with moving that file is that we now lose the ability to ensure cleanup. If the process itself abnormally terminates, the daemon that shepherds it has no idea that the shmem backing file was created somewhere outside the session directory - and therefore has no way to clean it up. This has generated some user complaints, but I'm not sure of the best solution. Perhaps we should use the PMIx_Job_ctrl API to register the backing file for cleanup upon termination when it isn't in the session directory? Or maybe that has already been done? Regardless, might be worth noting that in the docs somewhere. If we don't register the cleanup location and your app abnormally terminates, then you may need to ensure (somehow) that you go back and remove those shmem backing files. Otherwise, they will just build up over time (as other jobs abnormally die), consuming disk space. |
We can certainly do work to improve this documentation., But as a first contribution this is already awesome as it is. |
@@ -71,21 +69,107 @@ performance for memory. | |||
to resource congestion, but you can increase this parameter to | |||
pre-reserve space for more fragments. | |||
|
|||
* ``btl_sm_backing_directory``: Directory to place backing files for | |||
shared memory communication. This directory should be on a local | |||
filesystem such as /tmp or /dev/shm (default: (linux) /dev/shm, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
filesystem such as /tmp or /dev/shm (default: (linux) /dev/shm, | |
filesystem such as ``/tmp`` or ``/dev/shm`` (default: (linux) ``/dev/shm``, |
To place the session directory in a non-default location, use the MCA parameter | ||
``orte_tmpdir_base``. | ||
.. note:: The session directory is defined in PMIx. You can | ||
use ``--pmixmca orte_tmpdir_base "/path/to/somewhere"`` to place the session |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the right MCA param name? I didn't think we had any orte_
primary names any more -- is that an alias?
This PR fixes errors regarding the shared memory file path in the document, and also updates two other parts of the shared memory documentation:
docs/launching-apps/localhost.rst
todocs/tuning-apps/networking/shared-memory.rst
.Shared Memory
section.Additional Notes on Modification 1:
1. The default output path is
/dev/shm
with filenamesm_segment.nodename.user_id.job_id.my_node_rank
, not the session directory:2.About the session directory:
The "session directory" is a concept from OpenPMIx. It can be set manually via
--pmixmca
, or automatically by mpirun/prun or Open MPI (in singleton mode). While not central to this doc section, it's relevant because fallback paths may use the session dir. So, the MCA param is mentioned for completeness.3.About single-copy methods and the backing file
After enabling single-copy methods, it might seem that a shared memory mapping file is no longer needed.
However, from reading the source code, I confirmed that it is still required.
This information is relevant for performance tuning, so I added it to the documentation.