Releases: ROCm/TransferBench
Releases · ROCm/TransferBench
TransferBench v1.37
Changes
- USE_SINGLE_STREAM is enabled by default now. (Disable via USE_SINGLE_STREAM=0)
Fixes
- Fix unrecognized token error when XCC_PREF_TABLE is unspecified
TransferBench v1.35
Additions
- USE_FINE_GRAIN also applies to a2a preset
TransferBench v1.34
Added
- Set GPU_KERNEL=3 to default for gfx942
TransferBench v1.33
Adding ALWAYS_VALIDATE env var to allow for validation after every iteration instead of just once at end of all iterations
TransferBench v1.32
Modified
- Increased line limit from 2048 to 32768
TransferBench v1.31
Modified
- SHOW_ITERATIONS now show XCC:CU instead of just CU ID
- SHOW_ITERATIONS also printed when USE_SINGLE_STREAM=1
TransferBench v1.30
Added
- BLOCK_SIZE added to control threadblock size (Must be multiple of 64, up to 512)
- BLOCK_ORDER added to control how work is ordered for GFX-executors running USE_SINGLE_STREAM=1
- 0 - Threadblocks for Transfers are ordered sequentially (Default)
- 1 - Threadblocks for Transfers are interleaved
- 2 - Threadblocks for Transfers are ordered randomly
TransferBench v1.29
Added
- a2a preset config now responds to USE_REMOTE_READ
Fixed
- Race-condition during wall-clock initialization caused "inf" during single stream runs
- CU numbering output after CU masking
Modified
- Default number of warmups reverted to 3
- Default unroll factor for gfx940/941 set to 6
TransferBench v1.28
Added
- Added A2A_DIRECT which only executes all-to-all only directly connected GPUs (on by default now)
- Added average statistics for p2p and a2a benchmarks
- Added USE_FINE_GRAIN for p2p benchmark.
- With older devices, p2p performance with default coarse grain device memory stops timing as soon as request sent to data fabric,
not actually when it arrives remotely, which may artificially inflate bandwidth numbers, especially when sending small amounts of data
- With older devices, p2p performance with default coarse grain device memory stops timing as soon as request sent to data fabric,
Modified
- Modified P2P output to help distinguish between CPU / GPU devices
Fixed
- Fixed Makefile target to prevent unnecessary re-compilation
TransferBench v1.27
Added
- Adding cmdline preset to allow specify simple tests on command line
- E.g. ./TransferBench cmdline 64M "1 4 G0->G0->G1"
- Adding environment variable HIDE_ENV, which skips printing of environment variable values
- Adding environment variable CU_MASK, which allows selection of which CUs to execute on
- CU_MASK is specified in CU indices (0-#CUs-1), and '-' can be used to denote ranges of values
- E.g.: CU_MASK=3-8,16 would request Transfer be executed only CUs 3,4,5,6,7,8,16
- NOTE: This is somewhat experimental and may not work on all hardware
- SHOW_ITERATIONS now shows CU usage for that iteration (experimental)
Modified
- Adding extra comments on commonly missing includes with details on how to install them
Fixed
- CUDA compilation should work again (wall_clock64 CUDA alias was not defined)