Skip to content

Comments

resolve contentious AMD components#548

Open
dbarry9 wants to merge 2 commits intoicl-utk-edu:masterfrom
dbarry9:2026.01.29_resolve-rocm-and-rocp_sdk
Open

resolve contentious AMD components#548
dbarry9 wants to merge 2 commits intoicl-utk-edu:masterfrom
dbarry9:2026.01.29_resolve-rocm-and-rocp_sdk

Conversation

@dbarry9
Copy link
Contributor

@dbarry9 dbarry9 commented Jan 30, 2026

Pull Request Description

  • framework: PAPI_DISABLE_COMPONENTS env var
  • amd comps: contentious components in same config

This pull request resolves issues #416 and #478.

ROCm version >= 6.3.2 being the "cutoff" for making rocp_sdk active by default over rocm was chosen due to known bugs in the ROCProfiler SDK in prior releases.

These changes have been tested using ROCm 7.0.2 on the Frontier supercomputer, which contains the AMD MI250X architecture.

Author Checklist

  • Description
    Why this PR exists. Reference all relevant information, including background, issues, test failures, etc
  • Commits
    Commits are self contained and only do one thing
    Commits have a header of the form: module: short description
    Commits have a body (whenever relevant) containing a detailed description of the addressed problem and its solution
  • Tests
    The PR needs to pass all the tests

@dbarry9 dbarry9 added component-rocm PRs and Issues related to the rocm component component-rocm_smi PRs and Issues related to the rocm_smi component component-rocp_sdk PRs and Issues related to the rocp_sdk component component-amd_smi PRs and Issues related to the amd_smi component labels Jan 30, 2026
@dbarry9 dbarry9 force-pushed the 2026.01.29_resolve-rocm-and-rocp_sdk branch 2 times, most recently from 2e74c2c to e3d7a5d Compare January 30, 2026 17:52
@dbarry9 dbarry9 changed the title 2026.01.29 resolve rocm and rocp sdk 2026.01.29 resolve contentious AMD components Jan 31, 2026
@Treece-Burgess
Copy link
Contributor

I am reviewing this PR.

@dbarry9 dbarry9 force-pushed the 2026.01.29_resolve-rocm-and-rocp_sdk branch 5 times, most recently from de23255 to 7e93f50 Compare February 3, 2026 21:27
Copy link
Contributor

@sbkathpe sbkathpe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I believe this follows the spirit and intent of this env variable.

I would do a free(penv) though since this allocated space is no longer needed.

@dbarry9 dbarry9 force-pushed the 2026.01.29_resolve-rocm-and-rocp_sdk branch 4 times, most recently from cf0616f to b56a689 Compare February 4, 2026 16:13
Copy link
Contributor

@sbkathpe sbkathpe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to make sure, the "PAPI_DISABLE_COMPONENTS" code should NOT be compile-time dependent on "#if defined(DEFAULT_TO_ROCP_SDK)" but be unconditionally supported.
It is hard for me to tell if this condition is part of the final code or not.

@dbarry9
Copy link
Contributor Author

dbarry9 commented Feb 4, 2026

Just to make sure, the "PAPI_DISABLE_COMPONENTS" code should NOT be compile-time dependent on "#if defined(DEFAULT_TO_ROCP_SDK)" but be unconditionally supported. It is hard for me to tell if this condition is part of the final code or not.

The code that parses PAPI_DISABLE_COMPONENTS and disables the components unconditionally has been added to PAPI_library_init in papi.c.

However, there is other code, which is contained within the #if defined in the rocm, rocp_sdk, rocm_smi, and amd_smi components. This code will be compile-time dependent due to the following conflicts:

  • rocm conflicts with rocp_sdk (if both active)
  • rocm_smi conflicts with amd_smi (if both active)

Example: Suppose the rocm and rocp_sdk components are both configured with a ROCm 7.0.1 module loaded. PAPI's configure chooses rocp_sdk to be active by default due to the ROCm version, and at the same time, PAPI's configure disables rocm by default using the #if defined(DEFAULT_TO_ROCP_SDK) inside of the rocm component code. However, if the user really wants the rocm component to be active, then they can set PAPI_DISABLE_COMPONENTS=rocp_sdk. Following this variable being set, in order for the rocm component to now be active, it needs to know that the rocp_sdk component is disabled (disabled unconditionally by PAPI_library_init). The rocm component knows the rocp_sdk component is disabled by parsing PAPI_DISABLE_COMPONENTS (very similarly to how PAPI_library_init does it), but the rocm component only checks whether rocp_sdk is listed in PAPI_DISABLE_COMPONENTS.

Summary: PAPI_DISABLE_COMPONENTS=x,y,z will indeed disable the components x,y, and z by PAPI_library_init parsing PAPI_DISABLE_COMPONENTS. In addition to this, PAPI_DISABLE_COMPONENTS is also parsed independently by certain components to resolve conflicts between specific components, but the actual disabling of components takes place unconditionally in PAPI_library_init.

@dbarry9 dbarry9 changed the title 2026.01.29 resolve contentious AMD components resolve contentious AMD components Feb 5, 2026
@dbarry9 dbarry9 force-pushed the 2026.01.29_resolve-rocm-and-rocp_sdk branch 7 times, most recently from b3ba6a4 to b6c7d7f Compare February 5, 2026 16:31
@dbarry9 dbarry9 requested a review from sbkathpe February 5, 2026 18:28
@dbarry9 dbarry9 force-pushed the 2026.01.29_resolve-rocm-and-rocp_sdk branch from b6c7d7f to 8e769c5 Compare February 6, 2026 01:55
@dbarry9 dbarry9 force-pushed the 2026.01.29_resolve-rocm-and-rocp_sdk branch 2 times, most recently from af2ee48 to 0e9389f Compare February 10, 2026 22:38
sbkathpe and others added 2 commits February 10, 2026 14:40
This change introduces an environment variable to allow the user to
disable components at runtime.

Example Usage: export PAPI_DISABLE_COMPONENTS=rocm,rocm_smi

These changes have been tested using ROCm 7.0.2 on the Frontier
supercomputer, which contains the AMD MI250X architecture.

Signed-off-by: Daniel Barry <dbarry@vols.utk.edu>
Allow the user to configure contentious component pairs (e.g., rocm &
rocp_sdk, rocm_smi & amd_smi), but only allow one from each pair to be
active at runtime. The ROCm version determines which components are
active by default. This can be overridden by the PAPI_DISABLE_COMPONENTS
environment variable.

These changes have been tested using ROCm 7.0.2 on the Frontier
supercomputer, which contains the AMD MI250X architecture.
@dbarry9 dbarry9 force-pushed the 2026.01.29_resolve-rocm-and-rocp_sdk branch from 0e9389f to fb14952 Compare February 10, 2026 22:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

component-amd_smi PRs and Issues related to the amd_smi component component-rocm_smi PRs and Issues related to the rocm_smi component component-rocm PRs and Issues related to the rocm component component-rocp_sdk PRs and Issues related to the rocp_sdk component

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support a PAPI_DISABLE_COMPONENTS environment variable rocp_sdk: provide a way to disable component in rocprofiler_configure

4 participants