Skip to content

Fix simulation hang when QP allocation in SendFlow produces 0-byte or mismatched-size QPs#225

Draft
Copilot wants to merge 2 commits intomasterfrom
copilot/analyze-issue-224
Draft

Fix simulation hang when QP allocation in SendFlow produces 0-byte or mismatched-size QPs#225
Copilot wants to merge 2 commits intomasterfrom
copilot/analyze-issue-224

Conversation

Copy link

Copilot AI commented Mar 4, 2026

When _QPS_PER_CONNECTION_ exceeds the available data or when custom per-QP allocation is used, two bugs in entry.h combine to permanently stall the simulation: counters in waiting_to_sent_callback never reach zero, and the upper-layer send-complete callback is never fired.

Root causes & fixes

SendFlow: 0-byte QPs inflate the actual sent total

Previously, when maxPacketCount < _QPS_PER_CONNECTION_, surplus loop iterations computed real_PacketCount = 0. Port allocation, sender_src_port_map insertion, and both waiting-counter increments had already happened before the check — which then bumped the size to 1 rather than skipping. This made all_sent_chunksize > maxPacketCount, breaking the equality check below.

Fix: Move the zero check before any resource allocation and use continue:

// Before: silently inflated total bytes sent
if(real_PacketCount == 0) real_PacketCount = 1;

// After: skip the QP entirely — no port, no map entry, no counter bump
if(real_PacketCount == 0) {
    NcclLog->writeLog(NcclLogLevel::DEBUG, " [SendFlow] skipping 0-byte QP ...");
    continue;
}
uint32_t port = portNumber[src][dst]++;

notify_sender_sending_finished: strict equality blocks the callback

The condition t2.count == message_size meant any size mismatch (from the above bug, or from integer rounding in custom allocation) silently dropped the callback, leaving the simulation deadlocked. The receiver-side counterpart (notify_receiver_receive_data) already handles message_size > t2.count correctly.

Fix: Relax to >=, consistent with receiver-side logic:

// Before
if (t2.count == message_size) { ... }

// After
if (message_size >= t2.count) { ... }

🔒 GitHub Advanced Security automatically protects Copilot coding agent pull requests. You can protect all pull requests by enabling Advanced Security for your repositories. Learn more about Advanced Security.

…y_sender_sending_finished

Co-authored-by: tianhao909 <48342395+tianhao909@users.noreply.github.com>
@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

Copilot AI changed the title [WIP] Analyze issue 224 and propose solution Fix simulation hang when QP allocation in SendFlow produces 0-byte or mismatched-size QPs Mar 4, 2026
@tianhao909 tianhao909 requested a review from Copilot March 4, 2026 13:40
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a simulation deadlock in the ns-3 network frontend when per-connection QP splitting can produce 0-byte QPs or when the total sent size doesn’t exactly match the requested message size, preventing send-complete callbacks from firing.

Changes:

  • Skip creation/allocation of QPs that would carry 0 bytes in SendFlow.
  • Relax sender completion condition in notify_sender_sending_finished from strict equality to >= to ensure callbacks fire on size mismatches.
Comments suppressed due to low confidence (1)

astra-sim-alibabacloud/astra-sim/network_frontend/ns3/entry.h:269

  • The mismatch path here is now specifically message_size < t2.count, but the error log still says "msg size !=" and it doesn’t include the expected t2.count (or any flow/tag identifier). Also, the format string ends with flow_id but no value is actually printed. Updating the log to report both actual/expected counts plus flowTag.current_flow_id/tag will make stalls/mismatches diagnosable.
      if (message_size >= t2.count) {
        sentHash.erase(make_pair(tag, make_pair(sender_node, receiver_node)));
        if (nodeHash.find(make_pair(sender_node, 0)) == nodeHash.end()) {
          nodeHash[make_pair(sender_node, 0)] = message_size;
        } else {
          nodeHash[make_pair(sender_node, 0)] += message_size;
        }
        #ifdef NS3_MTP
        cs.ExitSection();
        #endif
        t2.msg_handler(t2.fun_arg);
        goto sender_end_1st_section;
      }else{
        NcclLog->writeLog(NcclLogLevel::ERROR,"sentHash msg size != sender_node %d receiver_node %d message_size %lu flow_id ",sender_node,receiver_node,message_size);
      }

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Comment on lines 109 to +118
MockNcclLog*NcclLog = MockNcclLog::getInstance();
uint64_t PacketCount=((maxPacketCount+_QPS_PER_CONNECTION_-1)/_QPS_PER_CONNECTION_);
uint64_t leftPacketCount = maxPacketCount;
for(int index = 0 ;index<_QPS_PER_CONNECTION_;index++){
uint64_t real_PacketCount = min(PacketCount,leftPacketCount);
leftPacketCount-=real_PacketCount;
if(real_PacketCount == 0) {
NcclLog->writeLog(NcclLogLevel::DEBUG," [SendFlow] skipping 0-byte QP index %d src %d dst %d flow_id %d",index,src,dst,request->flowTag.current_flow_id);
continue;
}
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Skipping real_PacketCount == 0 QPs prevents the counter/map inflation, but it also means a 0-byte send (maxPacketCount==0) will schedule no QPs and therefore never call send_finish() / notify_sender_sending_finished(). Since AstraSimNetwork::sim_send() always inserts an entry into sentHash, a 0-byte send can now deadlock waiting for a completion that will never happen (e.g., Ring sets msg_size = data_size / nodes_in_ring, which can be 0 when data_size < nodes_in_ring). Consider handling maxPacketCount==0 (or “no QPs scheduled” after the loop) by immediately completing the send: erase the sentHash entry for (tag,src,dst) and invoke the upper-layer handler (or call notify_sender_sending_finished(src,dst,0,request->flowTag)).

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants