This repository contains an NS3-based network simulator that acts as a network backend for SimAI.
We have released a new dev branch dev/qp featuring the following enhancements (From maintainer @MXtremist):
- QP Logic Support: Enables creation and destruction of QPs based on actual RDMA logic, allowing multiple messages to be carried by a pair of QPs.
- NIC CC Configuration: Supports perIP or perQP settings for enhanced flexibility.
- Optimized Scheduling Logic: Adheres to the Max-Min principle, resolving issues of underutilization and unfairness in network resource allocation.
- Decoupling of the CC Module: For improved modularity and efficiency.
Compared to the original ns-3, this repo extends the point-to-point module with a datacenter / RDMA-oriented end-to-end model. The main additions live in simulation/src/point-to-point/model/ and include:
- QBB/PFC + multi-priority queues: 8 priority queues, PAUSE/RESUME (PFC-like) handling, and priority-aware scheduling on each port/NIC.
- ECN + CNP (QCN-style) feedback: switch-side ECN marking based on queue occupancy and receiver-side ECN accounting; congestion feedback is carried via CNP packets.
- RDMA host stack (QP-level): QP/RxQP modeling, window/on-the-fly control, ACK/NACK handling, and multiple NIC congestion-control (CC) modes (e.g., DCQCN/HPCC/TIMELY/DCTCP/HPCC-PINT).
- Switch and NVSwitch modeling: ECMP forwarding, buffer/MMU admission control, PFC trigger/resume logic, and (optional) INT/PINT-style metadata injection for HPCC(-PINT).
-
qbb-net-device.{h,cc}(QbbNetDevice,RdmaEgressQueue)- What it does: A QBB-capable net device on top of
PointToPointNetDevicewith 8 priorities. It intercepts receive to honor PFC, schedules transmissions from either:- host/NIC:
RdmaEgressQueue(high-priority ACK/NACK queue + round-robin across QPs), or - switch port:
BEgressQueueround-robin across priority queues. It also supports an NVSwitch “switch acts as host” send path when NVLS is enabled.
- host/NIC:
- Key attributes:
QbbEnabled,QcnEnabled,DynamicThreshold,PauseTime,NVLS_enable. - Key integration callbacks:
m_rdmaReceiveCb(deliver non-PFC packets toRdmaHw),m_rdmaSentCb(per-packet send completion),m_rdmaPktSent(update QP pacing/next-available),m_rdmaLinkDownCb. - Where to extend:
- Scheduling / priority rules:
DequeueAndTransmit()andRdmaEgressQueue::GetNextQindex() - PFC behavior:
Receive()andSendPfc()
- Scheduling / priority rules:
- What it does: A QBB-capable net device on top of
-
qbb-channel.{h,cc}/qbb-remote-channel.{h,cc}- What it does: point-to-point channel for
QbbNetDevice;QbbRemoteChanneluses MPI (MpiInterface::SendPacket) for distributed simulations. - Where to extend: link behavior/delivery path in
TransmitStart().
- What it does: point-to-point channel for
-
switch-node.{h,cc}(SwitchNode)- What it does: switch pipeline (
nodeType = 1): ECMP forwarding (5-tuple hash), admission control via MMU, PFC pause/resume generation, optional ECN marking, and INT/PINT injection on dequeue (used by HPCC / HPCC-PINT). - Key attributes:
EcnEnabled,CcMode,AckHighPrio,MaxRtt. - Where to extend:
- Forwarding / ECMP:
GetOutDev(),EcmpHash(),AddTableEntry() - ECN / PFC / INT-PINT injection:
SwitchNotifyDequeue()
- Forwarding / ECMP:
- What it does: switch pipeline (
-
switch-mmu.{h,cc}(SwitchMmu)- What it does: switch buffer/MMU model: ingress/egress accounting, shared buffer & headroom, pause/resume decisions, ECN marking probability curve (
kmin/kmax/pmax), and PFC threshold computation. - Key config APIs:
ConfigBufferSize(),ConfigHdrm(),ConfigNPort(),ConfigEcn(). - Where to extend: implement new buffer management / PFC threshold formula / ECN curve here.
- What it does: switch buffer/MMU model: ingress/egress accounting, shared buffer & headroom, pause/resume decisions, ECN marking probability curve (
-
nvswitch-node.{h,cc}(NVSwitchNode)- What it does: NVSwitch node model (
nodeType = 2) used for intra-server GPU communication (paired with the NVLS routing logic inRdmaHw/QbbNetDevice). - Where to extend: NVSwitch forwarding/admission/monitoring (similar entry points to
SwitchNode, but currently without ECN/INT injection).
- What it does: NVSwitch node model (
-
rdma-hw.{h,cc}(RdmaHw)- What it does: host RDMA core: QP create/delete, packet construction (PPP + IPv4 + UDP + SeqTs), ACK/NACK processing, CNP processing, per-QP CC algorithms, and routing to NIC (including NVSwitch routing tables).
- Key attributes:
CcMode,Mtu,MinRate,L2ChunkSize,L2AckInterval,L2BackToZero, plus CC-specific knobs for DCQCN/TIMELY/DCTCP/HPCC/PINT (seeGetTypeId()). - Protocol numbers used (IPv4 Protocol field):
- UDP data:
0x11 - CNP:
0xFF - PFC:
0xFE - ACK:
0xFC - NACK:
0xFD
- UDP data:
- Where to extend:
- Add a new CC algorithm: add
HandleAckX/UpdateRateX(and optionally CNP hooks) and dispatch bym_cc_modeinReceiveAck()/ReceiveCnp(). - Add/modify routing (including NVSwitch/NVLS):
GetNicIdxOfQp(),GetNicIdxOfRxQp(),AddTableEntry(),RedistributeQp().
- Add a new CC algorithm: add
-
rdma-driver.{h,cc}(RdmaDriver)- What it does: wiring layer between
Node/NICs andRdmaHw: builds NIC/QP groups and exposes QP lifecycle traces (QpComplete,SendComplete). - Where to extend: add higher-level observability or app-facing callbacks around QP lifecycle here.
- What it does: wiring layer between
-
rdma-queue-pair.{h,cc}(RdmaQueuePair,RdmaRxQueuePair,RdmaQueuePairGroup)- What it does: per-QP and per-RxQP state (window, rate, ACKed seq, plus per-CC algorithm state: DCQCN alpha/targetRate, HPCC hop state, TIMELY RTT tracking, DCTCP alpha/ecnCnt, HPCC-PINT state).
- Where to extend: if your new CC needs extra per-QP state, add it here.
-
Headers / utilities
qbb-header.{h,cc}: ACK/NACK header (PG/seq/CNP-flag + optional INT header).cn-header.{h,cc}: CNP header (feedback fields:fid/qIndex/ecnbits/qfb/total).pause-header.{h,cc}: PFC pause header (time/qlen/qindex).pint.{h,cc}: PINT encode/decode utilities.trace-format.h: binary trace record structureTraceFormatused by offline analyzers.
-
Add a new host-side congestion control (CC)
- Primary:
rdma-hw.{h,cc}(algorithm + dispatch byCcMode) - Often needed:
rdma-queue-pair.h(new per-QP state) - If switch feedback is required:
switch-node.cc(INT/PINT or new markings)
- Primary:
-
Change switch behavior (buffer/ECN/PFC)
- Primary:
switch-mmu.{h,cc}(thresholds/curves/formulas) - Where marking/injection happens:
switch-node.cc::SwitchNotifyDequeue() - Where admission/priority is applied:
switch-node.cc::SendToDev()
- Primary:
-
Introduce a new control packet/header
- Primary: add a new
*Headerinmodel/(followCnHeader/PauseHeader) - Parsing/dispatch: usually in
QbbNetDevice::Receive()(device-level) orRdmaHw::Receive()(host stack) - Note: if you need it parsed by
CustomHeader, you’ll also need to extend thecustom-headerimplementation (outside this folder).
- Primary: add a new
Please email Gang Lu (yunding.lg@alibaba-inc.com), Feiyang Xue (xuefeiyang.xfy@alibaba-inc.com) or Qingxu Li (qingxu.lqx@alibaba-inc.com) if you have any questions.
Welcome to join the SimAI community chat groups, with the DingTalk group on the left and the WeChat group on the right.

