Skip to content

Releases: ROCm/TransferBench

TransferBench v1.61.00

28 Feb 23:57
cd80b3a
Compare
Choose a tag to compare

v1.61.00

Added

  • Added a2a_n preset which conducts alltoall GPU-to-GPU tranfers over nearest NIC executors
  • Re-implemented GFX_BLOCK_ORDER which allows for control over how threadblocks of multiple transfers are ordered
    • 0 = sequential, 1 = interleaved, 2 = random
  • Added a2asweep preset which tries various CU/unroll options for GFX-executed all-to-all
  • Rewrite main GID index detection logic
  • Show the GID index and description in the topology table. It is helpful for debugging purposes
  • Added GFX_WORD_SIZE to allow for different packed float sizes to use for GFX kernel. Must be either 4 (default), 2 or 1

Fixed

  • Avoid build errors for CMake and Makefile if infiniband/verbs.h header is not present and disable NIC executor in such case
  • Have a priority list of which GID entry to go for instead of hardcoding choices based on underdocumented user input (such as RoCE version and IP address family)
  • Use link-local when it is the only choice (i.e. when routing information is not available beyond local link)

rocm-6.3.3

19 Feb 17:46
56a2d6f
Compare
Choose a tag to compare

ROCm release v6.3.3

TransferBench v1.60.00

30 Jan 19:24
bedd2a2
Compare
Choose a tag to compare

v1.60.00

Modified

  • Reverted GFX_SINGLE_TEAM default back to 1

Fixed

  • Fixed bug where peer memory access was not enabled for DMA transfers, which would break specific DMA engine transfers

rocm-6.3.2

28 Jan 15:43
56a2d6f
Compare
Choose a tag to compare

ROCm release v6.3.2

TransferBench v1.59.01

24 Jan 20:15
b311f02
Compare
Choose a tag to compare

v1.59.01

Added

  • The a2a preset A2A_MODE variable has been enhanced to allow for customizing the number of srcs/dsts to use
    This is specified by setting A2A_MODE to numSrcs:numDsts. Extra destinations past 1 will be "local" writes (i.e. if one sets A2A_MODE=1:3, then transfers will follow this pattern: Fx Gx FyFxFx) to simulate similar conditions normally used during collective algorithms such as ring-based AllReduce

TransferBench v1.59.00

21 Jan 19:40
5984f49
Compare
Choose a tag to compare

v1.59.00

Added

  • Adding in support for NIC executor, which allows for RDMA copies on NICs that support IBVerbs
    By default, NIC executor will be enabled if IBVerbs is found in the dynamic linker cache
  • NIC executor can be indexed in two methods
    • "I" Ix.y will use NIC x as the source and NIC y as the destination.
      E.g. (G0 I0.5 G4)
    • "N" Nx.y will use NIC closest to GPU x as source, and NIC closest to GPU y as destination
      E.g. (G0 N0.4 N4)
  • The closest NIC can be overridden by the environment variable CLOSEST_NIC, which should be a comma-separated
    list of NIC indices to use for the corresponding GPU
  • This feature can be explicitly disabled at compile time by specifying DISABLE_NIC_EXEC=1

Modified

  • Changing default data size to 256M from 64M
  • Adding NUM_QUEUE_PAIRS which enables NIC traffic in A2A. Each GPU will talk to the next GPU via the closest NIC
  • Sweep preset now saves last sweep run configuration to /tmp/lastSweep.cfg and can be changed via SWEEP_FILE

Fixed

  • Fixed bug with reporting when using subiterations
  • Fixed bug with per-Transfer data size specification
  • Fixed bug when using XCC prefered table

rocm-6.3.1

20 Dec 16:12
56a2d6f
Compare
Choose a tag to compare

ROCm release v6.3.1

TransferBench v1.58.00

05 Dec 20:46
fb713d0
Compare
Choose a tag to compare

v1.58.00

Fixed

  • Fixed broken specific DMA-engine copies

rocm-6.3.0

03 Dec 19:49
f6cc992
Compare
Choose a tag to compare

ROCm release v6.3.0

TransferBench v1.57.01

02 Dec 23:22
2c921db
Compare
Choose a tag to compare

v1.57.01

Added

  • Re-added "scaling" GPU GFX preset benchmark, which tests copies from GPU to other devices using varying
    number of CUs.