Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

group: refactor MPIR_Group #7235

Open
wants to merge 10 commits into
base: main
Choose a base branch
from
Open

group: refactor MPIR_Group #7235

wants to merge 10 commits into from

Conversation

hzhou
Copy link
Contributor

@hzhou hzhou commented Dec 10, 2024

Pull Request Description

Hide the internal fields of MPIR_Group from unnecessary access.

Outside group_util.c and group_impl.c, it only need assume the MPIR_Lpid
integer type, creation routines based on lpid map or lpid stride
description, and access routine to look up lpid from a group rank.

Add feature to use stride to describe group composition

Remove the linked list design in MPIR_Group_pmap_t

[skip warnings]

Plan

  1. Refactor MPIR_Group so it can be memory-efficient (strided rank map) -- this PR
  2. Always create a communicator from MPIR_Group rather than the other way around
    • This requires lpid to be device independent, and device layer perform address exchange upon communicator creation.
    • Initially, we can require the first non-trivial communicator always to be MPI_COMM_WORLD.
  3. Extend MPIR_Group to represent MPIR_Pset

Author Checklist

  • Provide Description
    Particularly focus on why, not what. Reference background, issues, test failures, xfail entries, etc.
  • Commits Follow Good Practice
    Commits are self-contained and do not do two things at once.
    Commit message is of the form: module: short description
    Commit message explains what's in the commit.
  • Passes All Tests
    Whitespace checker. Warnings test. Additional tests via comments.
  • Contribution Agreement
    For non-Argonne authors, check contribution agreement.
    If necessary, request an explicit comment from your companies PR approval manager.

@hzhou hzhou force-pushed the 2412_group branch 4 times, most recently from 576d5c7 to 0794b2f Compare December 11, 2024 14:23
@hzhou
Copy link
Contributor Author

hzhou commented Dec 11, 2024

test:mpich/ch3/most
test:mpich/ch4/most
All ✔️ other than 2 node crashes

@hzhou hzhou requested a review from yfguo December 11, 2024 19:56
@hzhou
Copy link
Contributor Author

hzhou commented Dec 12, 2024

test:mpich/ch3/most

Copy link
Contributor

@yfguo yfguo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good overall. Need some attend w.r.t the reusing of map across groups.

src/mpi/group/grouputil.c Outdated Show resolved Hide resolved
src/mpi/group/grouputil.c Outdated Show resolved Hide resolved
newgrp->rank = rank;
MPIR_Group_set_session_ptr(newgrp, session_ptr);
newgrp->pmap.use_map = true;
newgrp->pmap.u.map = map;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this shallow copy of map safe? Can MPICH code free a map before all groups using it are freed? We probably should refcount the map to safely reuse it across multiple MPIR_Group.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The map is owned by the group. The map is never shared. On the other hand, the MPIR_Group can be shared and it is reference counted.

src/mpi/group/grouputil.c Show resolved Hide resolved
src/mpi/group/grouputil.c Outdated Show resolved Hide resolved
@hzhou hzhou force-pushed the 2412_group branch 2 times, most recently from 1535d80 to 351cd7f Compare February 7, 2025 22:44
@hzhou
Copy link
Contributor Author

hzhou commented Feb 7, 2025

test:mpich/ch3/most
test:mpich/ch4/most ✔️

@hzhou
Copy link
Contributor Author

hzhou commented Feb 10, 2025

The last commit is likely missing some dependencies from the next PRs. Please review without the last commit. I'll drop the commit if I cannot resolve the issue in time.

Copy link
Contributor

@raffenet raffenet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM up until last commit

This test requires to access MPICH internals, thus won't be used with
the current design.
We no longer use this file.
Hide the internal fields of MPIR_Group from unnecessary access.

Outside group_util.c and group_impl.c, it only need assume the MPIR_Lpid
integer type, creation routines based on lpid map or lpid stride
description, and access routine to look up lpid from a group rank.
For most external usages, we only need MPIR_Group_rank_to_lpid.
Avoid access group internal fields.
Group similar functions together to facilitate refactoring.
There is no changes in this commit other than moving functions around.

The 4 incl/excl functions are very similar.

The 3 difference/intersection/union functions are very similar.
Use MPIR_Group_{rank_to_lpid,lpid_to_rank} to avoid directly access
MPIR_Group internal fields.

For most group creation routines, just populate an lpid lookup map and
call MPIR_Group_create_map to create the group.
* add option to use stride to describe group composition
* remove the linked list design
The unsigned type uint64_t is dangerous as we perform pmap stride math.
Signed integer should always work and we can use assertions to check its
range.
* Add check_map_is_strided to detect strided pattern and convert a map into a
strided pmap.

* Move internal static routines to the bottom of grouputil.c.
@hzhou
Copy link
Contributor Author

hzhou commented Feb 11, 2025

test:mpich/ch3/most
test:mpich/ch4/most

@hzhou
Copy link
Contributor Author

hzhou commented Feb 11, 2025

Fixed the last commit. The culprit was MPIR_Lpid typed as uint64_t. Changing it int64_t made the math (the stride pmap) work.

@hzhou hzhou requested a review from raffenet February 11, 2025 05:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants