Skip to content

Releases: ROCm/TransferBench

TransferBench v1.37

24 Nov 13:52
9e3a04c
Compare
Choose a tag to compare

Changes

  • USE_SINGLE_STREAM is enabled by default now. (Disable via USE_SINGLE_STREAM=0)

Fixes

  • Fix unrecognized token error when XCC_PREF_TABLE is unspecified

TransferBench v1.35

22 Nov 23:38
e047656
Compare
Choose a tag to compare

Additions

  • USE_FINE_GRAIN also applies to a2a preset

TransferBench v1.34

07 Nov 23:37
004710f
Compare
Choose a tag to compare

Added

  • Set GPU_KERNEL=3 to default for gfx942

TransferBench v1.33

30 Oct 17:42
1d34a19
Compare
Choose a tag to compare

Adding ALWAYS_VALIDATE env var to allow for validation after every iteration instead of just once at end of all iterations

TransferBench v1.32

19 Oct 22:20
9c2ecae
Compare
Choose a tag to compare

Modified

  • Increased line limit from 2048 to 32768

TransferBench v1.31

17 Oct 19:40
79a3a00
Compare
Choose a tag to compare

Modified

  • SHOW_ITERATIONS now show XCC:CU instead of just CU ID
  • SHOW_ITERATIONS also printed when USE_SINGLE_STREAM=1

TransferBench v1.30

16 Oct 14:22
e7cfab7
Compare
Choose a tag to compare

Added

  • BLOCK_SIZE added to control threadblock size (Must be multiple of 64, up to 512)
  • BLOCK_ORDER added to control how work is ordered for GFX-executors running USE_SINGLE_STREAM=1
    • 0 - Threadblocks for Transfers are ordered sequentially (Default)
    • 1 - Threadblocks for Transfers are interleaved
    • 2 - Threadblocks for Transfers are ordered randomly

TransferBench v1.29

16 Oct 14:18
0b29707
Compare
Choose a tag to compare

Added

  • a2a preset config now responds to USE_REMOTE_READ

Fixed

  • Race-condition during wall-clock initialization caused "inf" during single stream runs
  • CU numbering output after CU masking

Modified

  • Default number of warmups reverted to 3
  • Default unroll factor for gfx940/941 set to 6

TransferBench v1.28

16 Oct 14:17
0b7b979
Compare
Choose a tag to compare

Added

  • Added A2A_DIRECT which only executes all-to-all only directly connected GPUs (on by default now)
  • Added average statistics for p2p and a2a benchmarks
  • Added USE_FINE_GRAIN for p2p benchmark.
    • With older devices, p2p performance with default coarse grain device memory stops timing as soon as request sent to data fabric,
      not actually when it arrives remotely, which may artificially inflate bandwidth numbers, especially when sending small amounts of data

Modified

  • Modified P2P output to help distinguish between CPU / GPU devices

Fixed

  • Fixed Makefile target to prevent unnecessary re-compilation

TransferBench v1.27

16 Oct 14:16
a9cb3a2
Compare
Choose a tag to compare

Added

  • Adding cmdline preset to allow specify simple tests on command line
  • E.g. ./TransferBench cmdline 64M "1 4 G0->G0->G1"
  • Adding environment variable HIDE_ENV, which skips printing of environment variable values
  • Adding environment variable CU_MASK, which allows selection of which CUs to execute on
  • CU_MASK is specified in CU indices (0-#CUs-1), and '-' can be used to denote ranges of values
    • E.g.: CU_MASK=3-8,16 would request Transfer be executed only CUs 3,4,5,6,7,8,16
    • NOTE: This is somewhat experimental and may not work on all hardware
  • SHOW_ITERATIONS now shows CU usage for that iteration (experimental)

Modified

  • Adding extra comments on commonly missing includes with details on how to install them

Fixed

  • CUDA compilation should work again (wall_clock64 CUDA alias was not defined)