Fix simulation hang when QP allocation in SendFlow produces 0-byte or mismatched-size QPs#225
Fix simulation hang when QP allocation in SendFlow produces 0-byte or mismatched-size QPs#225
Conversation
…y_sender_sending_finished Co-authored-by: tianhao909 <48342395+tianhao909@users.noreply.github.com>
|
|
There was a problem hiding this comment.
Pull request overview
Fixes a simulation deadlock in the ns-3 network frontend when per-connection QP splitting can produce 0-byte QPs or when the total sent size doesn’t exactly match the requested message size, preventing send-complete callbacks from firing.
Changes:
- Skip creation/allocation of QPs that would carry 0 bytes in
SendFlow. - Relax sender completion condition in
notify_sender_sending_finishedfrom strict equality to>=to ensure callbacks fire on size mismatches.
Comments suppressed due to low confidence (1)
astra-sim-alibabacloud/astra-sim/network_frontend/ns3/entry.h:269
- The mismatch path here is now specifically
message_size < t2.count, but the error log still says "msg size !=" and it doesn’t include the expectedt2.count(or any flow/tag identifier). Also, the format string ends withflow_idbut no value is actually printed. Updating the log to report both actual/expected counts plusflowTag.current_flow_id/tagwill make stalls/mismatches diagnosable.
if (message_size >= t2.count) {
sentHash.erase(make_pair(tag, make_pair(sender_node, receiver_node)));
if (nodeHash.find(make_pair(sender_node, 0)) == nodeHash.end()) {
nodeHash[make_pair(sender_node, 0)] = message_size;
} else {
nodeHash[make_pair(sender_node, 0)] += message_size;
}
#ifdef NS3_MTP
cs.ExitSection();
#endif
t2.msg_handler(t2.fun_arg);
goto sender_end_1st_section;
}else{
NcclLog->writeLog(NcclLogLevel::ERROR,"sentHash msg size != sender_node %d receiver_node %d message_size %lu flow_id ",sender_node,receiver_node,message_size);
}
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
| MockNcclLog*NcclLog = MockNcclLog::getInstance(); | ||
| uint64_t PacketCount=((maxPacketCount+_QPS_PER_CONNECTION_-1)/_QPS_PER_CONNECTION_); | ||
| uint64_t leftPacketCount = maxPacketCount; | ||
| for(int index = 0 ;index<_QPS_PER_CONNECTION_;index++){ | ||
| uint64_t real_PacketCount = min(PacketCount,leftPacketCount); | ||
| leftPacketCount-=real_PacketCount; | ||
| if(real_PacketCount == 0) { | ||
| NcclLog->writeLog(NcclLogLevel::DEBUG," [SendFlow] skipping 0-byte QP index %d src %d dst %d flow_id %d",index,src,dst,request->flowTag.current_flow_id); | ||
| continue; | ||
| } |
There was a problem hiding this comment.
Skipping real_PacketCount == 0 QPs prevents the counter/map inflation, but it also means a 0-byte send (maxPacketCount==0) will schedule no QPs and therefore never call send_finish() / notify_sender_sending_finished(). Since AstraSimNetwork::sim_send() always inserts an entry into sentHash, a 0-byte send can now deadlock waiting for a completion that will never happen (e.g., Ring sets msg_size = data_size / nodes_in_ring, which can be 0 when data_size < nodes_in_ring). Consider handling maxPacketCount==0 (or “no QPs scheduled” after the loop) by immediately completing the send: erase the sentHash entry for (tag,src,dst) and invoke the upper-layer handler (or call notify_sender_sending_finished(src,dst,0,request->flowTag)).
When
_QPS_PER_CONNECTION_exceeds the available data or when custom per-QP allocation is used, two bugs inentry.hcombine to permanently stall the simulation: counters inwaiting_to_sent_callbacknever reach zero, and the upper-layer send-complete callback is never fired.Root causes & fixes
SendFlow: 0-byte QPs inflate the actual sent totalPreviously, when
maxPacketCount < _QPS_PER_CONNECTION_, surplus loop iterations computedreal_PacketCount = 0. Port allocation,sender_src_port_mapinsertion, and both waiting-counter increments had already happened before the check — which then bumped the size to 1 rather than skipping. This madeall_sent_chunksize > maxPacketCount, breaking the equality check below.Fix: Move the zero check before any resource allocation and use
continue:notify_sender_sending_finished: strict equality blocks the callbackThe condition
t2.count == message_sizemeant any size mismatch (from the above bug, or from integer rounding in custom allocation) silently dropped the callback, leaving the simulation deadlocked. The receiver-side counterpart (notify_receiver_receive_data) already handlesmessage_size > t2.countcorrectly.Fix: Relax to
>=, consistent with receiver-side logic:🔒 GitHub Advanced Security automatically protects Copilot coding agent pull requests. You can protect all pull requests by enabling Advanced Security for your repositories. Learn more about Advanced Security.