Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
166 commits
Select commit Hold shift + click to select a range
4dde756
initial
May 6, 2025
88fc124
configure
May 6, 2025
3d33234
changes
May 6, 2025
a940429
update
May 6, 2025
33891fa
update
May 6, 2025
23e26c1
add simple testing
May 6, 2025
48fdc91
add updates
May 8, 2025
f0a90ca
fix power related events
May 9, 2025
2a3cfff
updating tests
May 9, 2025
b876053
updating tests
May 9, 2025
c643253
supporting more events
May 12, 2025
78c8857
esmi funcitons
May 12, 2025
896b962
update tests
May 12, 2025
9ce0ece
disable virtualization function
May 12, 2025
7f82e53
update tests
May 12, 2025
ecc1027
esmi error & bound check for functions
May 13, 2025
45da117
test
Jul 13, 2025
9a55093
test2
Jul 13, 2025
9fd3545
test3
Jul 13, 2025
74520f1
test4
Jul 14, 2025
475a8a7
test5
Jul 14, 2025
69db5f7
test6
Jul 14, 2025
ca36d93
test7
Jul 14, 2025
749426e
test8
Jul 14, 2025
f16c31f
test8
Jul 14, 2025
f5ad454
test8
Jul 14, 2025
41c203a
test9
Jul 14, 2025
40f6ab8
test
Jul 14, 2025
de6718f
test
Jul 14, 2025
1b49e90
update
Jul 14, 2025
643f8c1
update
Jul 14, 2025
a6f417c
change rules and define variable
Aug 13, 2025
f73229d
fix # define
Aug 13, 2025
417be18
add gitignore
Aug 13, 2025
a277a52
remove git ignore files
Aug 13, 2025
743cbf9
modify tests
Aug 18, 2025
275248c
fix enviormental variables
Aug 19, 2025
0284375
use papi_strdup
Aug 19, 2025
4cf48bc
adding more functions
Aug 26, 2025
659d720
events more
Aug 26, 2025
855177f
take bios hash
Aug 26, 2025
9190986
more events
Aug 26, 2025
755b115
Fix internal AMD SMI refs and include paths
djwoun Aug 27, 2025
7d79984
Fix spacing in AMD SMI build rules
djwoun Aug 27, 2025
43ceebf
Use tabs for AMD SMI build recipes
djwoun Aug 27, 2025
26d26d1
amdsmi.h
Aug 27, 2025
41c1806
refactor
djwoun Aug 27, 2025
37ce455
amd_smi: add virtualization and isolation metrics
djwoun Aug 28, 2025
2fe3bf1
Add AMD SMI event, RAS, and power telemetry
djwoun Aug 28, 2025
0454b40
Expose additional AMD SMI GPU telemetry
djwoun Aug 28, 2025
dc331e3
Add RAS block and VRAM usage metrics
djwoun Aug 28, 2025
15ad8f9
Probe all GPU power sensors and expose detailed metrics
djwoun Aug 28, 2025
702d122
Expose XGMI PLPD policy count
djwoun Aug 28, 2025
dd7fbd3
Limit power sensor probing to two indices
djwoun Aug 28, 2025
22ef66c
Expose retired GPU page record details
djwoun Aug 28, 2025
41e4e95
Expose default and DPM power cap metrics
djwoun Aug 28, 2025
7d1265e
Support additional AMD SMI clock domains
djwoun Aug 28, 2025
05cd5fe
Expose throttle status and PCIe metrics via GPU metrics table
djwoun Aug 28, 2025
d3481fa
Expose GPU firmware versions via amdsmi_get_fw_info
djwoun Aug 28, 2025
a89734d
Expose GPU UUID length via new event
djwoun Aug 28, 2025
87ab705
Expand link metrics events
djwoun Aug 28, 2025
4b28ad1
Expose GPU process list details
djwoun Aug 28, 2025
972d832
Expose per-block GPU ECC error counts
djwoun Aug 28, 2025
e95a9d3
Add per-block ECC status events
djwoun Aug 28, 2025
1c42529
Expose energy counter resolution and timestamp
djwoun Aug 28, 2025
1ce97c2
Expose overdrive voltage curve ranges
djwoun Aug 28, 2025
c23569e
Expose overdrive voltage info metrics
djwoun Aug 28, 2025
3bc258a
Enumerate all GPU voltage sensors and metrics
djwoun Aug 28, 2025
ad02d4a
Expose PCIe information metrics
djwoun Aug 28, 2025
e51c700
Expose SoC P-state policy count
djwoun Aug 28, 2025
e91d702
Format AMD SMI sources with 140-column limit
djwoun Aug 28, 2025
1d20feb
centralize amdsmi function pointer declarations
djwoun Aug 28, 2025
c8842bb
refactor: reposition AMD SMI table helpers
djwoun Aug 28, 2025
12ff615
Merge pull request #18 from djwoun/codex/review-amdsmi.h-for-unsuppor…
djwoun Aug 28, 2025
8615570
refactor amd_smi event helpers
djwoun Aug 28, 2025
439d534
Limit AMD SMI globals to internal scope
djwoun Aug 28, 2025
c08ced0
Merge pull request #20 from djwoun/codex/review-style-for-abnormaliti…
djwoun Aug 28, 2025
5b0e36a
Fix AMD SMI accessor signatures and link order
djwoun Aug 28, 2025
878a86d
refactor init itable
Aug 28, 2025
f6c9cef
fix-function-pointer-type-mismatches
djwoun Aug 28, 2025
c770de7
Move AMD SMI helper functions earlier
djwoun Aug 28, 2025
c53fa92
fix-compiler-warnings-during-build
djwoun Aug 28, 2025
c1a0bd7
update amdsmi.h reference
Aug 29, 2025
b5c70b2
Add AMD SMI version gating
djwoun Aug 31, 2025
4b0f0ac
Gate additional AMD SMI features by version
djwoun Aug 31, 2025
4995820
add-version-check-to-functions
djwoun Aug 31, 2025
3ad83e0
Add AMD SMI XGMI info, power management, and RAS validation events
djwoun Sep 3, 2025
5a94204
Declare missing accessors and fix partition config
djwoun Sep 3, 2025
dabd89a
add-support-for-amdsmi_get-functions-in-papi-6v2zxo
djwoun Sep 3, 2025
b61b4d1
test update
Sep 3, 2025
eaee92d
tests amd_smi
Sep 8, 2025
5a0322b
remove executable
Sep 8, 2025
6ae86d4
harden AMD SMI memory partition queries
djwoun Sep 8, 2025
fc7310e
Avoid probing AMD SMI partition getters at init
djwoun Sep 8, 2025
159d76d
Gracefully handle AMD SMI read errors
djwoun Sep 8, 2025
378549c
Propagate AMD SMI read errors
djwoun Sep 8, 2025
b4875f9
Handle TESTS_QUIET token and fix AMD SMI read
djwoun Sep 8, 2025
84fc315
troubleshoot-segmentation-fault-during-tests
djwoun Sep 8, 2025
4204a2a
Probe memory partition events at init
djwoun Sep 9, 2025
591dd94
Guard newer AMD SMI APIs with version checks
djwoun Sep 9, 2025
16e1db7
Ignore unexpected TESTS_QUIET tokens
djwoun Sep 9, 2025
ef313c6
Sanitize TESTS_QUIET safely and drop runtest executable bit
djwoun Sep 9, 2025
07a1e9c
investigate-tests_quiet-impact-on-tests
djwoun Sep 9, 2025
586209b
probing-memory-partition-during-event-addition
djwoun Sep 9, 2025
1e15247
extend test cycle
Sep 9, 2025
c4f1b07
use common event
Sep 9, 2025
1b28071
initial changes to some bugs
Sep 10, 2025
21874e7
amd smi root path
Sep 10, 2025
e1116d6
change update_native_events
Sep 10, 2025
89fc8ed
duplicate loop:
Sep 10, 2025
e31a1ef
context control stop clean stop and shutdown
Sep 10, 2025
3eccf70
Null terminate
Sep 11, 2025
c9a3910
Limit htable fix to AMD SMI and drop stderr silencing
djwoun Sep 11, 2025
f0499db
style: drop braces on single-line if statements
djwoun Sep 11, 2025
037c396
fix-pointer-check-bug-in-htable_init
djwoun Sep 11, 2025
ae78805
Use global lock for AMD SMI device mask
djwoun Sep 11, 2025
98a7f90
Check device handles in AMD SMI CPU metrics
djwoun Sep 11, 2025
68c7ed4
ensure-input-validation-consistency-in-amds_accessors.c
djwoun Sep 11, 2025
0c39fef
compare-thread-safety-in-amd_smi-and-rocm_smi-qq05t9
djwoun Sep 11, 2025
7354fc8
memory amd smi gemm
Sep 14, 2025
db809af
docs: clarify stderr redirection comment
djwoun Sep 14, 2025
12acdbc
reverse-stderr-suppression-change
djwoun Sep 14, 2025
4dd472e
Align amd_smi runtest script with PAPI test verbosity
djwoun Sep 14, 2025
08e6b9b
Simplify AMD SMI runtest helper
djwoun Sep 14, 2025
c47ecdc
Remove executable bit from AMD SMI runtest helper
djwoun Sep 14, 2025
6017d2e
integrate-verbose-output-in-amd_smi-tests
djwoun Sep 14, 2025
9bd462b
Free AMD SMI allocated buffers with system free
djwoun Sep 14, 2025
5989acd
avoid-strict-aliasing-in-amd_smi-component
djwoun Sep 14, 2025
893792e
context running
Sep 14, 2025
aa92d77
init uniit values in struct
Sep 14, 2025
83b6b30
Zero output structs before AMD SMI queries
djwoun Sep 14, 2025
c8e0b2b
zero-output-structs-before-get_-api-calls
djwoun Sep 14, 2025
c545e34
take out p2p event
Sep 14, 2025
6616f83
zero init
Sep 14, 2025
c0560b8
init
Sep 14, 2025
af1c301
read all
Sep 14, 2025
d234eac
access_amdsmi_vram_usage
Sep 14, 2025
ef10f35
return first error
Sep 15, 2025
8499179
fix leaks htable_insert create_table(2), bitmasking and minor changes
Sep 15, 2025
03f9c52
format refactor
Sep 15, 2025
6b370f7
fail on memeory deletion
Sep 15, 2025
501d2ff
fail on rehash fail
Sep 15, 2025
0707eb1
check state during read
Sep 15, 2025
d752446
papi_errno amds ctx
Sep 15, 2025
26b4740
typo
Sep 15, 2025
7adbcf4
author
Sep 16, 2025
6742bed
readme
Sep 16, 2025
b4fdd2d
readme
Sep 16, 2025
d643958
readme
Sep 16, 2025
4d7e8fd
remove git ignore
Sep 16, 2025
8ddd646
update comments on test
Sep 16, 2025
edeb962
Use HIPFLAGS for AMD SMI Makefile
djwoun Sep 16, 2025
9e23e94
Fix AMD SMI C harness feature usage
djwoun Sep 16, 2025
fab23bb
minor makefile change
Sep 16, 2025
bc3fe4a
makefile
Sep 16, 2025
e1eab8c
adjust read me
Sep 16, 2025
eaf78e7
read me edit
Sep 16, 2025
5fb2ef0
comments
Sep 16, 2025
31d012f
Standardize AMD SMI return error variables
djwoun Sep 16, 2025
d537317
updated comments
Sep 16, 2025
04b22aa
fix-return-variable-names-in-amd-smi
djwoun Sep 16, 2025
61718af
read me update
Sep 16, 2025
5ec461b
Normalize AMD SMI return code variable names
djwoun Sep 16, 2025
0a7b7db
fix-return-checks-for-papi_errno
djwoun Sep 16, 2025
ff032fa
device mask int size
Sep 16, 2025
e0164da
gilgamesh 0 1
Sep 17, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
94 changes: 94 additions & 0 deletions src/components/amd_smi/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
# AMD_SMI Component

The **AMD_SMI** (AMD System Management Interface) component exposes hardware
management counters (and selected controls) for AMD GPUs — e.g., power usage,
temperatures, clocks, PCIe link metrics, VRAM information, and RAS/ECC status —
by querying the AMD SMI library at runtime (ROCm ≥ 6.3.4).

- [Environment Variables](#environment-variables)
- [Enabling the AMD_SMI Component](#enabling-the-amd_smi-component)

---

## Environment Variables

For AMD_SMI, PAPI requires the environment variable `PAPI_AMDSMI_ROOT` to be set
so that the AMD SMI shared library and headers can be found. This variable is
required at both **compile** and **run** time.

There is a single case to consider (AMD SMI is available on ROCm ≥ 6.0):

1. **For ROCm versions 6.0 and newer:**
Set `PAPI_AMDSMI_ROOT` to the top-level ROCm directory. For example:

```bash
export PAPI_AMDSMI_ROOT=/opt/rocm-6.4.0
# or
export PAPI_AMDSMI_ROOT=/opt/rocm
```

The directory specified by `PAPI_AMDSMI_ROOT` **must contain** the following
subdirectories:

- `PAPI_AMDSMI_ROOT/lib` (which should include the dynamic library `libamd_smi.so`)
- `PAPI_AMDSMI_ROOT/include/amd_smi` (AMD SMI headers)

If the library is not found or is not functional at runtime, the component will
appear as "disabled" in `papi_component_avail`, with a message describing the
problem (e.g., library not found).

---

## Enabling the AMD_SMI Component

To enable reading (and where supported, writing) of AMD_SMI counters, build
PAPI with this component enabled. For example:

```bash
./configure --with-components="amd_smi"
make
```

You can verify availability with the utilities in `papi/src/utils/`:

```bash
papi_component_avail # shows enabled/disabled components
papi_native_avail -i amd_smi # lists native events for this component
```

---

## File-by-file Summary

- **`linux-amd-smi.c`**
Declares the `papi_vector_t` for this component; initializes on first use; hands off work to `amds_*` for device/event management; implements PAPI hooks (`init_component`, `update_control_state`, `start`, `read`, `stop`, `reset`, `shutdown`, and native-event queries).

- **`amds.c`**
Dynamically loads `libamd_smi.so`, resolves AMD SMI symbols, discovers sockets/devices, and **builds the native event table**. Defines helpers to add simple and counter-based events. Manages global teardown (destroy event table, close library).

- **`amds_accessors.c`**
Implements the **accessors** that read/write individual metrics (e.g., temperatures, fans, PCIe, energy, power caps, RAS/ECC, clocks, VRAM, link topology, XGMI/PCIe metrics, firmware/board info, etc.). Each accessor maps an event’s `(variant, subvariant)` to the right SMI call and returns the value.

- **`amds_ctx.c`**
Provides the **per-eventset context**:
- `amds_ctx_open/close` — acquire/release devices, run per-event open/close hooks.
- `amds_ctx_start/stop` — start/stop counters where needed.
- `amds_ctx_read/write/reset` — read current values, optionally write supported controls (e.g., power cap), zero software view.

- **`amds_evtapi.c`**
Implements native-event enumeration for PAPI (`enum`, `code_to_name`, `name_to_code`, `code_to_descr`) using the in-memory event table and a small hash map for fast lookups.

- **`amds_priv.h`**
Internal definitions: `native_event_t` (name/descr/device/mode/value + open/close/start/stop/access callbacks), global getters, and the AMD SMI function-pointer declarations (via `amds_funcs.h`).

- **`amds_funcs.h`**
Centralized macro list of **AMD SMI APIs** used by the component; generates function-pointer declarations/definitions so `amds.c` can `dlsym()` them at runtime. Conditional entries handle newer SMI features.

- **`htable.h`**
Minimal chained hash table for **name→event** mapping; used by `amds_evtapi.c` to resolve native event names quickly.

- **`amds.h`**
Public, component-internal API across files: init/shutdown, native-event queries, context ops, and error-string retrieval.

- **`Rules.amd_smi`**
Build integration for PAPI’s make system; compiles this component and sets include/library paths for AMD SMI.
111 changes: 111 additions & 0 deletions src/components/amd_smi/Rules.amd_smi
Original file line number Diff line number Diff line change
@@ -0,0 +1,111 @@
# Set default if the root environment variable is not already set.
# Note PAPI_AMDSMI_ROOT is an environment variable that must be set.
# There are four other environment variables that must be exported
# for runtime operation; see the README file.

PAPI_AMDSMI_ROOT ?= /opt/rocm

# There is one library used by the AMD_SMI component: libamd_smi64.so
# By default, the software tries to find this in system paths, including
# those listed in the environment variable LD_LIBRARY_PATH. If not found
# there it looks in $(PAPI_AMDSMI_ROOT)/lib/libamd_smi64.so

# However, this can be overridden by exporting PAPI_AMD_SMI_LIB as
# something else. It would still need to be a full path and library name.
# If it is exported, it must work or the component will be disabled. e.g.
# export PAPI_AMD_SMI_LIB=$(PAPI_AMD_SMI_LIB)/lib/libamd_smi64.so
# This allows users to overcome non-standard ROCM installs or specify
# specific version of the libamd_smi64.so library.

# PAPI_AMDSMI_ROOT is used at both at compile time and run time.

# There are many ways to cause this path to be known. Spack is a package
# manager used on supercomputers, Linux and MacOS. If Spack is aware of ROCM,
# it encodes the paths to the necessary libraries.

# The environment variable LD_LIBRARY_PATH encodes a list of paths to
# search for libraries; separated by a colon (:). New paths can be
# added to LD_LIBRARY_PATH.
#
# Warning: LD_LIBRARY_PATH often contains directories that apply to other
# installed packages you may be using. Always add to LD_LIBRARY_PATH
# recursively; for example:

# >export LD_LIBRARY_PATH=someNewLibraryDirectory:$LD_LIBRARY_PATH which would
# append the existing LD_LIBRARY_PATH to the new directory you wish to add.
# Alternatively, you can prepend it:
# >export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:someNewLibraryDirectory Which will
# search the existing libraries first, then your new directory.

# You can check on the value of LD_LIBRARY_PATH with
# >echo $LD_LIBRARY_PATH

# There may be other package managers or utilities, for example on a system
# with modules; the command 'module load rocm' may modify LD_LIBRARY_PATH.

# A Linux system will also search for libraries by default in the directories
# listed by /etc/ld.so.conf, and /usr/lib64, /lib64, /usr/lib, /lib.

# Note: If you change the exports, PAPI should be rebuilt from scratch; see
# note below.

# Note: AMD_SMI is typically provided with the ROCM libraries, but in PAPI
# ROCM and AMD_SMI are treated as separate components, and must be given
# separately on the configure option --with-components. e.g.

# From within the papi/src/ director:
# make clobber
# ./configure --with-components="amd_smi"
# make

# An alternative, for both rocm and amd_smi components:
# ./configure --with-components="rocm amd_smi"

# OPERATION, per library:
# 1) If an override is not empty, we will use it explicitly and fail if it
# does not work. This means disabling the component; a reason for disabling
# is shown using the papi utility, papi/src/utils/papi_component_avail

# 2) We will attempt to open the library using the normal system library search
# paths; if Spack is present and configured correctly it should deliver the
# proper library. A failure here will be silent; we will proceed to (3).

# 3) If that fails, we will try to find the library in the standard installed
# locations listed above. If this fails, we disable the component, the reason
# for disabling is shown using the papi utility,
# papi/src/utils/papi_component_avail.

COMPSRCS += components/amd_smi/amds.c \
components/amd_smi/linux-amd-smi.c \
components/amd_smi/amds_accessors.c \
components/amd_smi/amds_evtapi.c \
components/amd_smi/amds_ctx.c
COMPOBJS += amds.o \
linux-amd-smi.o \
amds_accessors.o \
amds_evtapi.o \
amds_ctx.o

# CFLAGS specifies compile flags; need include files here, and macro defines.
# Where to find amd_smi.h varied in early ROCM releases. If it changes again,
# for backward compatibility add *more* -I paths, do not just replace this one.

CFLAGS += -I$(PAPI_AMDSMI_ROOT)/include/amd_smi
CFLAGS += -I$(PAPI_AMDSMI_ROOT)/include
CFLAGS += -g
LDFLAGS += $(LDL) -g

linux-amd-smi.o: components/amd_smi/linux-amd-smi.c $(HEADERS)
$(CC) $(LIBCFLAGS) $(OPTFLAGS) -c components/amd_smi/linux-amd-smi.c -o linux-amd-smi.o

amds.o: components/amd_smi/amds.c $(HEADERS)
$(CC) $(LIBCFLAGS) $(OPTFLAGS) -c components/amd_smi/amds.c -o amds.o

amds_accessors.o: components/amd_smi/amds_accessors.c $(HEADERS)
$(CC) $(LIBCFLAGS) $(OPTFLAGS) -c components/amd_smi/amds_accessors.c -o amds_accessors.o

amds_evtapi.o: components/amd_smi/amds_evtapi.c $(HEADERS)
$(CC) $(LIBCFLAGS) $(OPTFLAGS) -c components/amd_smi/amds_evtapi.c -o amds_evtapi.o

amds_ctx.o: components/amd_smi/amds_ctx.c $(HEADERS)
$(CC) $(LIBCFLAGS) $(OPTFLAGS) -c components/amd_smi/amds_ctx.c -o amds_ctx.o
Loading
Loading