-
Notifications
You must be signed in to change notification settings - Fork 17
packet latency surrogate director api
As of Feb 6th 2024, the director and surrogate functionality are stored in the directories
codes/surrogate and src/surrogate. Git branch: kronos-develop, commit: 94ae872.
Each header file has its own corresponding .c or .C file following the same
structure. For example, src/surrogate/init.c implements codes/surrogate/init.h. This
is in contrast with the rest of CODES where most headers are stored at the base directory
codes/ and the .c|.C files follow some structure.
This file documents what are the director and surrogate, and how they are implemented.
The following definitions are meant as a snapshot of what each part of the implementation is and are subject to change.
- Director: A function/method that decides when to switch from high-fidelity to surrogate mode. It informs the network when to switch. It is triggered at every GVT.
- Predictor: An object (in OOP) which can predict the delay of delivering a packet from source to destination terminal. Additionally, it predicts the delay for the next packet to be processed by the source terminal.
- Surrogate: A coarse network model where package routing is NOT simulated. Instead, packets are scheduled to arrive at their destination from the source terminal using the prediction given by the predictor.
- High-fidelity: Running the simulation "fully", i.e, packets are routed from hop to hop.
- Surrogate mode: Running the simutalion replacing packet routing with the (network) surrogate.
- Hybrid execution: A simulation that runs in one mode and switches to another at some point.
- Freeze event: Rescheduling an event (changing its timestamp) so that it occurs farther
into the future. The event is rescheduled for a time past the next switching
timestamp. If the event were to happen at
t > switch_1, it will rescheduled att' = t - switch_1 + switch_2, whereswitch_1andswitch_2are timestamps, withswitch_1 < switch_2.
This could be called config.h|c instead. It initializes the surrogate machinery. The
function surrogate_configure has to be called before the simulation starts. It reads
from the .config file all it needs. It reads from a "section name" called SURROGATE.
The structure of the "section name" is the following:
SURROGATE {
director_mode="at-fixed-virtual-times";
fixed_switch_timestamps=( "1000e4", "8900e4" );
packet_latency_predictor="average"; # options: average, torch-jit
ignore_until="200e4";
torch_jit_mode="single-static-model-for-all-terminals";
torch_jit_model_path="";
network_treatment_on_switch="freeze"; # options: frezee, nothing
}
There is only one director_mode currently implemented: at-fixed-virtual-times.
Switches the simulation from high-fidelity to surrogate and back at fixed virtual times.
fixed_switch_timestamps should only work for the mode at-fixed-virtual-times. It has
no restrictions, i.e, we can switch back and forth at any positive floating point number.
In this example, the simulation starts in high-fidelity, then switches to surrogate at
1000e4 and back to high-fidelity at 8900e4.
packet_latency_predictor indicates which predictor to use for the surrogate. The current
options are:
-
average: it is fed data from the simulation as the simulation runs. It will ignore all data that comes beforeignore_until(we use it to ignore all data of an uncogested network, because we want packet latency data of only stable states). -
torch-jit: it requires a pathtorch_jit_model_pathto read a compliant model written in PyTorch and saved as a TorchScript. Currently, there is only one option fortorch_jit_mode:single-static-model-for-all-terminals. In this mode, the PyTorch model has to be trained already and it will not be modified with new data (ie, all packet-latency data produced by the simulation will be disregarded).
network_treatment_on_switch indicates what to do with the network when switching from
one simulation mode to another (surrogate to high-fildelity, or viceversa). The two
options are:
-
nothing: The network will be untouched when we switch to surrogate mode. For some time, as the in-flight packets are routed to their destination, two strategies for advancing the simulation will co-exist: packet routing simulation and source-to-destination terminals direct packet simulation. After that period of time, the simulation will run exclusively on the second strategy. -
freeze: All in-flight packets are suspended, so the network is "frozen" in place. This allows for a clear distinction between the two strategies on surrogate mode (full network simulation or terminal-to-terminal packet scheduling). All in-flight packets at the switch are rescheduled to arrive at the destination immediately.
Contains all struct definitions for the director to work and the director function.
director_switch is called at every GVT, or, in the case of sequential execution,
whenever it decides it should be awaken again. In sequential mode, the director is called
at those fixed time stamps at which we switch from one mode to another.
The director has access to the following:
-
director_data: a struct with two pointers to functions, both defined by the network model-
director_data.switch_surrogate: function that when called it will ask the network model to switch from high-fidelity to surrogate, and viceversa -
director_data.is_surrogate_on: function that when called will inform us of whether the network is in high-fidelity mode or surrogate. Currently, there is no real reason for this function to exist because the director can keep track of the state with ease. It is not the case that the network will switch on its own to surrogate. Thus, this function is thought more as a failsafe, i.e., this function is to be called to assert the director's believe against the network mode.
-
-
A list of
lp_types_switch, one per LP type in the simulation.lp_types_switchis a struct that contains information on whether to freeze an event. To determine whether to freeze an event the director can use the following:-
lp_types_switch.lpname: This is the name of the LP type. We can only keep track of LP types via their name. Names are only visible to CODES (not ROSS). Modelnet LP types have a name starting withmodelnet_. -
lp_types_switch.trigger_idle_modelnet: Modelnet LPs use a special event type (MN_BASE_SCHED_NEXT) to loop and process the next packet from their queues. If this event is frozen, the LP won't process any new packets until the network is awaken. This is expected behavior for the routers, but it is not for the terminals. Terminal LPs don't want theirMN_BASE_SCHED_NEXTevents to be frozen, thus they should have this variable set totrue. Setting it to true, will force the LP to reschedule allMN_BASE_SCHED_NEXTevents that it needs to operate appropiately. -
lp_types_switch.highdef_to_surrogate: it is in charge of switching this specific LP type from high-fidelity to surrogate. -
lp_types_switch.surrogate_to_highdef: it is in charge of switching this specific LP type from surrogate to high-fidelity. -
lp_types_switch.should_event_be_frozen: The director is in charge of freezing the network. It does so by checking every event in the simulation. If this function isNULL, it will not freeze the specific event (associated with the LP type). If it's a function, we will use it to determine whether the event should be frozen or not.
-
-
switch_at: a struct containing a list of time stamps at which the simulation should switch. We traverse the list one element at the time (we assume that the list is strictly monotonic, ie, sorted and with unique values). To keep track of where in the list we are, we usecurrent_i.totaljust tells us the total number of switching timestamps.
We implement director_switch, the director function. This is what it does when called:
- Did we hit a trigger? (One of the switching-at time stamps.) YES, continue. NO, return.
- Rollback and cancel all events that have to be rollbacked (if in Optimistic mode).
- Ask the network model to switch. It calls
director_data.switch_surrogate. - Freeze the network, if the user set it up (otherwise, the network is not frozen).
- Schedule another trigger (time stamp in the future), if one exists.
Currently step 2 does little (possibly nothing). We know that no event has to be rollbacked because ROSS guarantees that it will stop and issue a trigger only when all unprocessed events that remain are past the trigger timestamp. Some event cancelations still might occur, thus why we run the function, but it is not needed or useful, as of now.
To freeze the network, we have to go through each LP and every event in the simulation.
Each LP will be frozen as it deems appropiate (we use the info stored in
lp_types_switchs). Each event event will be checked individually and frozen according to
the lp_types_switch configuration (if one exists, if they don't, no event from that LP
type will be frozen).
common.h defines the predictor's interface. Namely:
- Four "methods":
-
init: initialization, called once at per each predictor (one predictor is attached to one terminal) -
feed: individual packet start and finish "timestamps" are fed to the predictor one at the time with this function. We assume that they are commited data, thus no rollbacking will ever occur -
predict: given a starting "timestamp", we predict when will the packet be delivered -
predict_rc: because of optimistic execution, we might need to rollback the lastpredicttimestamp
-
- Starting "timestamp". We need more than a starting timestamp to determine the delay of
packet delivery. In fact, we ask for:
- packet id
- destination terminal
- start time
- injection into terminal by workload
- delay to process since last one
- packet size
- whether there's another packet in the terminal's queue
- End "timestamp". As per above, we require a bit more than a timestamp:
- travel end time
- delay to process next event in queue
We have to predict two values, travel end time and delay to process next event in queue, because those are the two values that emerge from the simulation and greatly influence the simulation accuracy. Other emergent values might be important for the surrogate network to keep track and use, but these are the values that the surrogate simulation cannot work properly without.
Note: common.c is an empty file. It's purpose it to make sure that common.h in itself
has all included all dependencies it needs to work properly. As mentioned before, the
"surrogate" header and implementation structures match one to one, every .h file has a
.c.
Simple predictor which keeps a count of packets and delays for each destination terminal, so that it can predict packet delays for each destination terminal independently. Because each subsequente prediction is independent of each other, and no variables are modified when predicting, there is not reverse predictor.
It's total size grows quadratically on the number of nodes in the simulated network.
Instead of training an average model on the fly, we can load an already trained model. For this we use Torch, the C++ library for pytorch.
The model must:
- Be "immutable", same input equals same output regardless of input history
- Have input shape of
(-1, 4) - Have output shape of
(-1, 2) - The four input values are floating point numbers representing: source terminal, destination terminal, packet size, and flag is-there-another-packet-in-queue
- The two output values are: packet latency and next packet delay
Note that the model is loaded once on each rank at the start and never updated.
The dragonfly network model is at the center of the whole surrogate implementation. These are some key reference points for the API:
- We configure the surrogate inside of
dragonfly_read_configonce we know what we determine that we actually want to run in surrogate mode. For this, we fill up a structsurrogate_config(defined incodes/surrogate/init.h) - We save all packet latency data into a text file per rank. They follow the pattern
%s/packets-delay-gid=%lu.txtwhere%sis the path given by the user, and%luthe rank id. Only commited data is stored in memory. -
dragonfly_dally_terminal_highdef_to_surrogateis the function that switches a terminal LPs from high-fidelity to surrogate. It browses through all in-flight packets and reschedules those that haven't yet been delivered. Additionally, it stores the current state of the LP into a different space in memory. We do this to control with precision what can be modified during the surrogate execution. - When asking whether a terminal event should be frozen, the terminal replies that only
MN_BASE_NEW_MSGevents should NOT be frozen. Everything else is. Filedragonfly_dally_terminal_should_event_be_frozen - We track packets sent to determine packet latency. Once the terminal is notified, it removes the entry.
- Zombies are also tracked in the terminal. We tell the destination terminals which in-flight packets are zombies. The zombie tracking list empties once there are no zombie packets in the network.
- Instead of overpowering
packet_generate, we created a new function only to be called in surrogate modepacket_generate_predicted. These events are only processed bypacket_arrive_predicted, instead ofpacket_arrive - We predict packet delay not chunk delay. This allows for larger speedup.
There are two significant changes in the model-net algorithm for the surrogate mode to work:
-
MN_BASE_SCHED_NEXTevents produced when the network was frozen are ignored. NewMN_BASE_SCHED_NEXThave to be rescheduled when switching from surrogate to high-fidelity (see second change below). -
model_net_method_switch_to_surrogate_lpandmodel_net_method_switch_to_highdef_lpare two functions used to coordinate the correct scheduling ofMN_BASE_SCHED_NEXTwhen the network is frozen.
To be able to call the director at every GVT, we've modified ROSS' main loop. The original loop looks pretty much like this:
while not end of simulation:
GVT operations (check for new events, compute GVT, rollback and cancel events)
while event in batch events:
process event
This is the loop after the changes:
while true:
if fun_state == active:
fun_state = triggered if all PEs reached trigger timestamp
call fun(gvt)
if fun_state == triggered:
fun_state = disabled
if not of simulation:
break
GVT operations (check for new events, compute GVT, rollback and cancel events)
while event in batch events:
if event > next trigger time:
break
process event
fun_state and fun are related to the director function. fun is the director
function. fun_state is a tri-state variable with possible values:
-
disabled: the functionfunwon't be called -
enabled: the functionfunwill be called -
triggered: transitionary state that will only occur within theif fun_state == activeportion of the code. It is used to signal to the functionfunthat we have reached the timestamp we were set up to stop at. To be called again, the functionfunhas to schedule another triggering time in the future
In the implementation, the function fun is presented as "arbitrary function to be
executed at GVT". The pointer saving the function is g_tw_gvt_arbitrary_fun.
In order to schedule another trigger, a timestamp at which the simulation will STOP, we
use tw_trigger_arbitrary_fun_at. This function is to be used by the arbitrary function
(the director) to switch from and to surrogate.
The trigger_arbitrary_fun struct, in core/ross-gvt.h, stores state_fun (actually
named in the implemantion as .active) and the next triggering time stampt.
In the case of sequential execution, we only call the director function at trigger time stamps!