Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RDMA Executor support #147

Closed
wants to merge 70 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
70 commits
Select commit Hold shift + click to select a range
208c708
Merge IbVerbsUtils.hpp functions
mustafabar Nov 22, 2024
31b6e3c
Add RDMA configs
mustafabar Nov 26, 2024
acf6d71
Add RDMA resources to TransferResources
mustafabar Nov 26, 2024
a3550e3
Add RDMA resource init functions
mustafabar Nov 26, 2024
11f16c4
Add RDMA transfer and mem reg calls
mustafabar Nov 26, 2024
2866588
Add teardown
mustafabar Nov 26, 2024
2e82f8b
Add minor changes
mustafabar Nov 26, 2024
2656ce2
Add first working version
mustafabar Nov 27, 2024
ac1a036
Add topo detection code
mustafabar Nov 27, 2024
8ecd926
Reformat all files with clang-format
mustafabar Nov 27, 2024
1773313
Revert "Reformat all files with clang-format"
mustafabar Nov 27, 2024
3eb880d
Add device topo printing
mustafabar Nov 27, 2024
56fa289
Support nearest IBV and topo printing
mustafabar Nov 27, 2024
c6455cc
Minor change
mustafabar Nov 27, 2024
be33f84
Add GetClosestNicToGpu API function
mustafabar Nov 27, 2024
6bc254b
Add minor changes
mustafabar Nov 27, 2024
319ca40
Move topo printing to client side
mustafabar Dec 2, 2024
438353e
use NO_IBV_EXEC flag
mustafabar Dec 2, 2024
e66b57b
Fix output formatting
mustafabar Dec 2, 2024
a4af4ba
Minor reformatting
mustafabar Dec 2, 2024
3204df7
Init once changes
mustafabar Dec 2, 2024
aa6e3e8
Init once changes
mustafabar Dec 2, 2024
08938c0
Add better input validation error text
mustafabar Dec 2, 2024
45609ff
Remove unneeded var
mustafabar Dec 2, 2024
1bc7d22
Obtain CLOSEST_NIC in TB envs
mustafabar Dec 2, 2024
4b47a10
Fix spacing of environment
mustafabar Dec 2, 2024
ea1493d
Merge branch 'ROCm:develop' into rdma_exec_integration
mustafabar Dec 3, 2024
2e50a21
Minor formatting and notes on API changes
mustafabar Dec 3, 2024
08afd10
Restore comment
mustafabar Dec 3, 2024
fba8c5b
Unify executor results str
mustafabar Dec 4, 2024
d9a32cd
Fix spacing
mustafabar Dec 4, 2024
2731423
Fix IB_GID_INDEX usage comment
mustafabar Dec 4, 2024
6a5366c
Fix spelling
mustafabar Dec 4, 2024
e1a27cb
Check NIC index out-of-range
mustafabar Dec 4, 2024
3ec3797
Fix brackets
mustafabar Dec 4, 2024
481b0ba
Trim trailing spaces
mustafabar Dec 4, 2024
5d9d33c
Fix formatting
mustafabar Dec 4, 2024
8d4b55e
Simplify GetBusIdDistance function
mustafabar Dec 4, 2024
9edc1fb
Add more defensive check in GetBusIdDistance function
mustafabar Dec 4, 2024
2bf2d5e
Simplify NIC topo printing function
mustafabar Dec 4, 2024
0a0735f
Minor reformat
mustafabar Dec 4, 2024
80eb15a
IBV->NIC rename
mustafabar Dec 4, 2024
139ac8a
Indicate workspace
mustafabar Dec 4, 2024
b002f99
Add neater condition for skipping comma
mustafabar Dec 4, 2024
47c0d49
Redesign closest NIC capturing design
mustafabar Dec 5, 2024
72695c9
Fix spacs
mustafabar Dec 5, 2024
a862d78
Remove SetClosestNics API
mustafabar Dec 5, 2024
df0ff69
Compress error message
mustafabar Dec 5, 2024
767a204
Modify error message
mustafabar Dec 5, 2024
ce80e42
Separate debug and release ibv macros
mustafabar Dec 5, 2024
4496aa8
Separate debug and release ibv macros
mustafabar Dec 5, 2024
e2a7ec0
Reformat function calls
mustafabar Dec 5, 2024
772b5bd
Reorder args
mustafabar Dec 5, 2024
888e764
Minor edits
mustafabar Dec 5, 2024
1c37f3e
Resuse device list
mustafabar Dec 5, 2024
994f679
Remove gpuCount global
mustafabar Dec 5, 2024
aca8755
use underscore for globals
mustafabar Dec 5, 2024
0e368f5
Remove unneeded comment
mustafabar Dec 5, 2024
f9489c3
Apply minor reformatting
mustafabar Dec 6, 2024
7789220
Apply minor edits
mustafabar Dec 6, 2024
0cc628a
Relocate NIC exec code
mustafabar Dec 6, 2024
e415823
Merge branch 'ROCm:develop' into rdma_exec_integration
mustafabar Dec 6, 2024
741eaa5
Use NIC_EXEC_ENABLED macro
mustafabar Dec 6, 2024
cd2d33d
Remove whitespaces
mustafabar Dec 6, 2024
28d633d
Remove unneeded functions
mustafabar Dec 9, 2024
d51916a
Use ErrResult for error propagation to API callers
mustafabar Dec 9, 2024
a62b012
Revert "Use ErrResult for error propagation to API callers"
mustafabar Dec 13, 2024
0810906
Revert "Remove unneeded functions"
mustafabar Dec 13, 2024
607387c
Refactoring parts of the NIC executor code, detection (#4)
gilbertlee-amd Dec 16, 2024
0088f74
V1.59 candidate (#6)
mustafabar Jan 14, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 17 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,23 @@
Documentation for TransferBench is available at
[https://rocm.docs.amd.com/projects/TransferBench](https://rocm.docs.amd.com/projects/TransferBench).

## v1.59.00
### Added
- Adding in support for NIC executor, which allows for RDMA copies on NICs that support IBVerbs
By default, NIC executor will be enabled if IBVerbs is found in the dynamic linker cache
- NIC executor can be indexed in two methods
- "I" Ix.y will use NIC x as the source and NIC y as the destination.
E.g. (G0 I0.5 G4)
- "N" Nx.y will use NIC closest to GPU x as source, and NIC closest to GPU y as destination
E.g. (G0 N0.4 N4)
- The closest NIC can be overridden by the environment variable CLOSEST_NIC, which should be a comma-separated
list of NIC indices to use for the corresponding GPU
### Modified
- Changing default data size to 256M from 64M
- Adding NUM_QUEUE_PAIRS which enables NIC traffic in A2A. Each GPU will talk to the next GPU via the closest NIC
### Fixed
- Fixed bug with reporting when using subiterations

## v1.58.00
### Fixed
- Fixed broken specific DMA-engine copies
Expand Down
8 changes: 8 additions & 0 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -56,6 +56,14 @@ set( CMAKE_CXX_FLAGS "${flags_str} ${CMAKE_CXX_FLAGS}")

set( CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -O3 -L${ROCM_PATH}/lib")
include_directories(${ROCM_PATH}/include)
find_library(IBVERBS_LIBRARY ibverbs)
if (IBVERBS_LIBRARY)
message(STATUS "Found ibverbs: ${IBVERBS_LIBRARY}")
add_definitions(-DNIC_EXEC_ENABLED)
link_libraries(ibverbs)
else()
message(WARNING "ibverbs not found")
endif()
link_libraries(numa hsa-runtime64 pthread)
set (CMAKE_RUNTIME_OUTPUT_DIRECTORY .)
add_executable(TransferBench src/client/Client.cpp)
Expand Down
6 changes: 6 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,12 @@ NVFLAGS = -x cu -lnuma -arch=native
COMMON_FLAGS = -O3 -I./src/header -I./src/client -I./src/client/Presets
LDFLAGS += -lpthread

# Compile RDMA executor if IBVerbs is found in the Dynamic Linker cache
ifneq ("$(shell ldconfig -p | grep -c ibverbs)", "0")
LDFLAGS += -libverbs -DNIC_EXEC_ENABLED
NVFLAGS += -libverbs -DNIC_EXEC_ENABLED
endif

all: $(EXE)

TransferBench: ./src/client/Client.cpp $(shell find -regex ".*\.\hpp")
Expand Down
23 changes: 14 additions & 9 deletions src/client/Client.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -172,7 +172,11 @@ int main(int argc, char **argv) {

void DisplayUsage(char const* cmdName)
{
printf("TransferBench v%s.%s\n", TransferBench::VERSION, CLIENT_VERSION);
std::string nicSupport = "";
#if NIC_EXEC_ENABLED
nicSupport = " (with NIC support)";
#endif
printf("TransferBench v%s.%s%s\n", TransferBench::VERSION, CLIENT_VERSION, nicSupport.c_str());
printf("========================================\n");

if (numa_available() == -1) {
Expand Down Expand Up @@ -218,7 +222,7 @@ void PrintResults(EnvVars const& ev, int const testNum,
ExeType const exeType = exeDevice.exeType;
int32_t const exeIndex = exeDevice.exeIndex;

printf(" Executor: %3s %02d %c %7.3f GB/s %c %8.3f ms %c %12lu bytes %c %-7.3f GB/s (sum)\n",
printf(" Executor: %3s %02d %c %8.3f GB/s %c %8.3f ms %c %12lu bytes %c %-7.3f GB/s (sum)\n",
ExeTypeName[exeType], exeIndex, sep, exeResult.avgBandwidthGbPerSec, sep,
exeResult.avgDurationMsec, sep, exeResult.numBytes, sep, exeResult.sumBandwidthGbPerSec);

Expand All @@ -230,14 +234,15 @@ void PrintResults(EnvVars const& ev, int const testNum,
char exeSubIndexStr[32] = "";
if (t.exeSubIndex != -1)
sprintf(exeSubIndexStr, ".%d", t.exeSubIndex);

printf(" Transfer %02d %c %7.3f GB/s %c %8.3f ms %c %12lu bytes %c %s -> %s%02d%s:%03d -> %s\n",
printf(" Transfer %02d %c %8.3f GB/s %c %8.3f ms %c %12lu bytes %c %s -> %c%03d%s:%03d -> %s\n",
idx, sep,
r.avgBandwidthGbPerSec, sep,
r.avgDurationMsec, sep,
r.numBytes, sep,
MemDevicesToStr(t.srcs).c_str(), ExeTypeName[exeType], exeIndex,
exeSubIndexStr, t.numSubExecs, MemDevicesToStr(t.dsts).c_str());
MemDevicesToStr(t.srcs).c_str(),
TransferBench::ExeTypeStr[t.exeDevice.exeType], t.exeDevice.exeIndex,
exeSubIndexStr, t.numSubExecs,
MemDevicesToStr(t.dsts).c_str());

// Show per-iteration timing information
if (ev.showIterations) {
Expand Down Expand Up @@ -269,7 +274,7 @@ void PrintResults(EnvVars const& ev, int const testNum,
for (auto& time : times) {
double iterDurationMsec = time.first;
double iterBandwidthGbs = (t.numBytes / 1.0E9) / iterDurationMsec * 1000.0f;
printf(" Iter %03d %c %7.3f GB/s %c %8.3f ms %c", time.second, sep, iterBandwidthGbs, sep, iterDurationMsec, sep);
printf(" Iter %03d %c %8.3f GB/s %c %8.3f ms %c", time.second, sep, iterBandwidthGbs, sep, iterDurationMsec, sep);

std::set<int> usedXccs;
if (time.second - 1 < r.perIterCUs.size()) {
Expand All @@ -285,11 +290,11 @@ void PrintResults(EnvVars const& ev, int const testNum,
printf(" %02d", x);
printf("\n");
}
printf(" StandardDev %c %7.3f GB/s %c %8.3f ms %c\n", sep, stdDevBw, sep, stdDevTime, sep);
printf(" StandardDev %c %8.3f GB/s %c %8.3f ms %c\n", sep, stdDevBw, sep, stdDevTime, sep);
}
}
}
printf(" Aggregate (CPU) %c %7.3f GB/s %c %8.3f ms %c %12lu bytes %c Overhead: %.3f ms\n",
printf(" Aggregate (CPU) %c %8.3f GB/s %c %8.3f ms %c %12lu bytes %c Overhead: %.3f ms\n",
sep, results.avgTotalBandwidthGbPerSec,
sep, results.avgTotalDurationMsec,
sep, results.totalBytesTransferred,
Expand Down
4 changes: 2 additions & 2 deletions src/client/Client.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -28,9 +28,9 @@ THE SOFTWARE.
#include "TransferBench.hpp"
#include "EnvVars.hpp"

size_t const DEFAULT_BYTES_PER_TRANSFER = (1<<26);
size_t const DEFAULT_BYTES_PER_TRANSFER = (1<<28);

char const ExeTypeName[4][4] = {"CPU", "GPU", "DMA", "IBV"};
char const ExeTypeName[5][4] = {"CPU", "GPU", "DMA", "NIC", "NIC"};

// Display detected hardware
void DisplayTopology(bool outputToCsv);
Expand Down
81 changes: 78 additions & 3 deletions src/client/EnvVars.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -100,6 +100,14 @@ class EnvVars
int outputToCsv; // Output in CSV format
int samplingFactor; // Affects how many different values of N are generated (when N set to 0)

// NIC options
int ibGidIndex; // GID Index for RoCE NICs
int roceVersion; // RoCE version number
int ipAddressFamily; // IP Address Famliy
uint8_t ibPort; // NIC port number to be used
int nicRelaxedOrder; // Use relaxed ordering for RDMA
std::string closestNicStr; // Holds the user-specified list of closest NICs

// Developer features
int gpuMaxHwQueues; // Tracks GPU_MAX_HW_QUEUES environment variable

Expand Down Expand Up @@ -147,8 +155,16 @@ class EnvVars
validateDirect = GetEnvVar("VALIDATE_DIRECT" , 0);
validateSource = GetEnvVar("VALIDATE_SOURCE" , 0);

ibGidIndex = GetEnvVar("IB_GID_INDEX" ,-1);
ibPort = GetEnvVar("IB_PORT_NUMBER" , 1);
roceVersion = GetEnvVar("ROCE_VERSION" , 2);
ipAddressFamily = GetEnvVar("IP_ADDRESS_FAMILY" , 4);
nicRelaxedOrder = GetEnvVar("NIC_RELAX_ORDER" , 1);
closestNicStr = GetEnvVar("CLOSEST_NIC" , "");

gpuMaxHwQueues = GetEnvVar("GPU_MAX_HW_QUEUES" , 4);


// Check for fill pattern
char* pattern = getenv("FILL_PATTERN");
if (pattern != NULL) {
Expand Down Expand Up @@ -279,18 +295,32 @@ class EnvVars
printf(" BLOCK_SIZE - # of threads per threadblock (Must be multiple of 64)\n");
printf(" BLOCK_BYTES - Controls granularity of how work is divided across subExecutors\n");
printf(" BYTE_OFFSET - Initial byte-offset for memory allocations. Must be multiple of 4\n");
#if NIC_EXEC_ENABLED
printf(" CLOSEST_NIC - Comma-separated list of per-GPU closest NIC (default=auto)\n");
#endif
printf(" CU_MASK - CU mask for streams. Can specify ranges e.g '5,10-12,14'\n");
printf(" FILL_PATTERN - Big-endian pattern for source data, specified in hex digits. Must be even # of digits\n");
printf(" GFX_UNROLL - Unroll factor for GFX kernel (0=auto), must be less than %d\n", TransferBench::GetIntAttribute(ATR_GFX_MAX_UNROLL));
printf(" GFX_SINGLE_TEAM - Have subexecutors work together on full array instead of working on disjoint subarrays\n");
printf(" GFX_WAVE_ORDER - Stride pattern for GFX kernel (0=UWC,1=UCW,2=WUC,3=WCU,4=CUW,5=CWU)\n");
printf(" HIDE_ENV - Hide environment variable value listing\n");
#if NIC_EXEC_ENABLED
printf(" IB_GID_INDEX - Required for RoCE NICs (default=-1/auto)\n");
printf(" IB_PORT_NUMBER - RDMA port count for RDMA NIC (default=1)\n");
printf(" IP_ADDRESS_FAMILY - IP address family (4=v4, 6=v6, default=v4)\n");
#endif
printf(" MIN_VAR_SUBEXEC - Minumum # of subexecutors to use for variable subExec Transfers\n");
printf(" MAX_VAR_SUBEXEC - Maximum # of subexecutors to use for variable subExec Transfers (0 for device limits)\n");
#if NIC_EXEC_ENABLED
printf(" NIC_RELAX_ORDER - Set to non-zero to use relaxed ordering");
#endif
printf(" NUM_ITERATIONS - # of timed iterations per test. If negative, run for this many seconds instead\n");
printf(" NUM_SUBITERATIONS - # of sub-iterations to run per iteration. Must be non-negative\n");
printf(" NUM_WARMUPS - # of untimed warmup iterations per test\n");
printf(" OUTPUT_TO_CSV - Outputs to CSV format if set\n");
#if NIC_EXEC_ENABLED
printf(" ROCE_VERSION - RoCE version (default=2)\n");
#endif
printf(" SAMPLING_FACTOR - Add this many samples (when possible) between powers of 2 when auto-generating data sizes\n");
printf(" SHOW_ITERATIONS - Show per-iteration timing info\n");
printf(" USE_HIP_EVENTS - Use HIP events for GFX executor timing\n");
Expand All @@ -301,6 +331,7 @@ class EnvVars
printf(" VALIDATE_SOURCE - Validate GPU src memory immediately after preparation\n");
}


void Print(std::string const& name, int32_t const value, const char* format, ...) const
{
printf("%-20s%s%12d%s", name.c_str(), outputToCsv ? "," : " = ", value, outputToCsv ? "," : " : ");
Expand All @@ -325,9 +356,12 @@ class EnvVars
void DisplayEnvVars() const
{
int numGpuDevices = TransferBench::GetNumExecutors(EXE_GPU_GFX);

std::string nicSupport = "";
#if NIC_EXEC_ENABLED
nicSupport = " (with NIC support)";
#endif
if (!outputToCsv) {
printf("TransferBench v%s.%s\n", TransferBench::VERSION, CLIENT_VERSION);
printf("TransferBench v%s.%s%s\n", TransferBench::VERSION, CLIENT_VERSION, nicSupport.c_str());
printf("===============================================================\n");
if (!hideEnv) printf("[Common] (Suppress by setting HIDE_ENV=1)\n");
}
Expand All @@ -341,6 +375,10 @@ class EnvVars
"Each CU gets a mulitple of %d bytes to copy", blockBytes);
Print("BYTE_OFFSET", byteOffset,
"Using byte offset of %d", byteOffset);
#if NIC_EXEC_ENABLED
Print("CLOSEST_NIC", (closestNicStr == "" ? "auto" : "user-input"),
"Per-GPU closest NIC is set as %s", (closestNicStr == "" ? "auto" : closestNicStr.c_str()));
#endif
Print("CU_MASK", getenv("CU_MASK") ? 1 : 0,
"%s", (cuMask.size() ? GetCuMaskDesc().c_str() : "All"));
Print("FILL_PATTERN", getenv("FILL_PATTERN") ? 1 : 0,
Expand All @@ -359,18 +397,35 @@ class EnvVars
gfxWaveOrder == 3 ? "Wavefront,CU,Unroll" :
gfxWaveOrder == 4 ? "CU,Unroll,Wavefront" :
"CU,Wavefront,Unroll"));
#if NIC_EXEC_ENABLED
Print("IP_ADDRESS_FAMILY", ipAddressFamily,
"IP address family is set to IPv%d", ipAddressFamily);

Print("IB_GID_INDEX", ibGidIndex,
"RoCE GID index is set to %s", (ibGidIndex < 0 ? "auto" : std::to_string(ibGidIndex).c_str()));
Print("IB_PORT_NUMBER", ibPort,
"IB port number is set to %d", ibPort);
#endif
Print("MIN_VAR_SUBEXEC", minNumVarSubExec,
"Using at least %d subexecutor(s) for variable subExec tranfers", minNumVarSubExec);
Print("MAX_VAR_SUBEXEC", maxNumVarSubExec,
"Using up to %s subexecutors for variable subExec transfers",
maxNumVarSubExec ? std::to_string(maxNumVarSubExec).c_str() : "all available");
#if NIC_EXEC_ENABLED
Print("NIC_RELAX_ORDER", nicRelaxedOrder,
"Using %s ordering for NIC RDMA", nicRelaxedOrder ? "relaxed" : "strict");
#endif
Print("NUM_ITERATIONS", numIterations,
(numIterations == 0) ? "Running infinitely" :
"Running %d %s", abs(numIterations), (numIterations > 0 ? " timed iteration(s)" : "seconds(s) per Test"));
Print("NUM_SUBITERATIONS", numSubIterations,
"Running %s subiterations", (numSubIterations == 0 ? "infinite" : std::to_string(numSubIterations)).c_str());
Print("NUM_WARMUPS", numWarmups,
"Running %d warmup iteration(s) per Test", numWarmups);
#if NIC_EXEC_ENABLED
Print("ROCE_VERSION", roceVersion,
"RoCE version is set to %d", roceVersion);
#endif
Print("SHOW_ITERATIONS", showIterations,
"%s per-iteration timing", showIterations ? "Showing" : "Hiding");
Print("USE_HIP_EVENTS", useHipEvents,
Expand All @@ -381,7 +436,6 @@ class EnvVars
"Running in %s mode", useInteractive ? "interactive" : "non-interactive");
Print("USE_SINGLE_STREAM", useSingleStream,
"Using single stream per GFX %s", useSingleStream ? "device" : "Transfer");

if (getenv("XCC_PREF_TABLE")) {
printf("%36s: Preferred XCC Table (XCC_PREF_TABLE)\n", "");
printf("%36s: ", "");
Expand Down Expand Up @@ -479,6 +533,27 @@ class EnvVars
cfg.gfx.useSingleTeam = gfxSingleTeam;
cfg.gfx.waveOrder = gfxWaveOrder;

cfg.nic.ibGidIndex = ibGidIndex;
cfg.nic.ibPort = ibPort;
cfg.nic.ipAddressFamily = ipAddressFamily;
cfg.nic.useRelaxedOrder = nicRelaxedOrder;
cfg.nic.roceVersion = roceVersion;

std::vector<int> closestNics;
if(closestNicStr != "") {
std::stringstream ss(closestNicStr);
std::string item;
while (std::getline(ss, item, ',')) {
try {
int nic = std::stoi(item);
closestNics.push_back(nic);
} catch (const std::invalid_argument& e) {
printf("[ERROR] Invalid NIC index (%s) by user in %s\n", item.c_str(), closestNicStr.c_str());
exit(1);
}
}
cfg.nic.closestNics = closestNics;
}
return cfg;
}
};
Expand Down
Loading