-
Notifications
You must be signed in to change notification settings - Fork 936
Always populate failed procs in comms' groups #13501
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
The HDF5 failure does not seem to be from me |
|
yep i agree wrt hdf5 problem. being investigated. |
c52849c to
f2dbf1e
Compare
ompi/communicator/ft/comm_ft.c
Outdated
| ompi_proc_t __opal_attribute_unused__ *ompi_proc = ompi_group_get_proc_ptr( | ||
| (remote ? comm->c_remote_group : comm->c_local_group), peer_id, true | ||
| ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes the whole thing pretty convoluted. Maybe separate the declaration from the assignment?
| ompi_proc_t __opal_attribute_unused__ *ompi_proc = ompi_group_get_proc_ptr( | |
| (remote ? comm->c_remote_group : comm->c_local_group), peer_id, true | |
| ); | |
| ompi_proc_t *ompi_proc __opal_attribute_unused__; | |
| ompi_proc = ompi_group_get_proc_ptr((remote ? comm->c_remote_group : comm->c_local_group), | |
| peer_id, true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or just do (void)ompi_proc;
Signed-off-by: Matthew Whitlock <[email protected]>
f2dbf1e to
3571f8c
Compare
|
Good catch, the proc struct must exists in order to mark them as failed. While the proposed fix is correct, it has an impact as we will completely create and populate a proc structure when the only thing we will ever used it for is to mark the proc as failed. Unfortunately, I don't see a better solution that would be easy to implement. |
|
If it's a concern, you could consider replacing (or supplementing) the |
This behavior was changed to avoid an unused variable during release compilation, but calling the function itself is important.
Earlier in this file (here),
ompi_comm_is_proc_activeassumes that any processes not populated in the group have not failed. This can lead to deadlocks in agree operations, which will mistakenly expect that some ranks are alive.