[Profiler] Add group_info output #206

phoenixdong · 2024-09-04T09:35:00Z

This PR adds functionality to output group information for large model execution, helping to track and manage task distribution during runtime.

parallelism_to_groups.json
Defines how tasks are grouped across various parallelism strategies (data, tensor, pipeline, etc.) for large model execution.
rank_to_parallelism_to_group_id.json
Maps device ranks to group IDs for different parallelism strategies.
rank_to_host_and_device.json
Provides mapping from device ranks to specific hardware (host IP, device ID, and GPU name).

This PR enables the output of parallel group information for both decoder and encoder modes.

To enable the output of parallel group information during model training, add the following configuration to your training file:

system:
  ...
  analyze:
    analyze_save_dir: group_info_output_path

analyze_save_dir: Specifies the directory where the group information will be saved. Replace group_info_output_path with your desired output path for storing the parallelism group details.

Add group_info output

e59d2e5

aoyulong requested a review from a team as a code owner February 19, 2025 06:43

aoyulong closed this Feb 28, 2025

Provide feedback