Skip to content

Conversation

@Matthew-Whitlock
Copy link
Contributor

This behavior was changed to avoid an unused variable during release compilation, but calling the function itself is important.

Earlier in this file (here), ompi_comm_is_proc_active assumes that any processes not populated in the group have not failed. This can lead to deadlocks in agree operations, which will mistakenly expect that some ranks are alive.

@Matthew-Whitlock
Copy link
Contributor Author

The HDF5 failure does not seem to be from me

@hppritcha
Copy link
Member

yep i agree wrt hdf5 problem. being investigated.

Comment on lines 797 to 799
ompi_proc_t __opal_attribute_unused__ *ompi_proc = ompi_group_get_proc_ptr(
(remote ? comm->c_remote_group : comm->c_local_group), peer_id, true
);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes the whole thing pretty convoluted. Maybe separate the declaration from the assignment?

Suggested change
ompi_proc_t __opal_attribute_unused__ *ompi_proc = ompi_group_get_proc_ptr(
(remote ? comm->c_remote_group : comm->c_local_group), peer_id, true
);
ompi_proc_t *ompi_proc __opal_attribute_unused__;
ompi_proc = ompi_group_get_proc_ptr((remote ? comm->c_remote_group : comm->c_local_group),
peer_id, true);

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or just do (void)ompi_proc;

@bosilca
Copy link
Member

bosilca commented Nov 12, 2025

Good catch, the proc struct must exists in order to mark them as failed. While the proposed fix is correct, it has an impact as we will completely create and populate a proc structure when the only thing we will ever used it for is to mark the proc as failed. Unfortunately, I don't see a better solution that would be easy to implement.

@bosilca bosilca merged commit 21db9e9 into open-mpi:main Nov 12, 2025
16 checks passed
@Matthew-Whitlock
Copy link
Contributor Author

If it's a concern, you could consider replacing (or supplementing) the proc_active field inside ompi_proc_t with a static opal_hash_table_t ompi_proc_failed_hash. You'd decide if a process is failed by checking for its key in the hash table, using a sentinel for the key just like the ompi_proc_hash table does.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants