packet latency surrogate director api

Director and surrogate API

Current implementation

As of Feb 6th 2024, the director and surrogate functionality are stored in the directories codes/surrogate and src/surrogate. Git branch: kronos-develop, commit: 94ae872.

Each header file has its own corresponding .c or .C file following the same structure. For example, src/surrogate/init.c implements codes/surrogate/init.h. This is in contrast with the rest of CODES where most headers are stored at the base directory codes/ and the .c|.C files follow some structure.

This file documents what are the director and surrogate, and how they are implemented.

Definitions

The following definitions are meant as a snapshot of what each part of the implementation is and are subject to change.

Director: A function/method that decides when to switch from high-fidelity to surrogate mode. It informs the network when to switch. It is triggered at every GVT.
Predictor: An object (in OOP) which can predict the delay of delivering a packet from source to destination terminal. Additionally, it predicts the delay for the next packet to be processed by the source terminal.
Surrogate: A coarse network model where package routing is NOT simulated. Instead, packets are scheduled to arrive at their destination from the source terminal using the prediction given by the predictor.
High-fidelity: Running the simulation "fully", i.e, packets are routed from hop to hop.
Surrogate mode: Running the simutalion replacing packet routing with the (network) surrogate.
Hybrid execution: A simulation that runs in one mode and switches to another at some point.
Freeze event: Rescheduling an event (changing its timestamp) so that it occurs farther into the future. The event is rescheduled for a time past the next switching timestamp. If the event were to happen at t > switch_1, it will rescheduled at t' = t - switch_1 + switch_2, where switch_1 and switch_2 are timestamps, with switch_1 < switch_2.

`init.h` and `init.c`

This could be called config.h|c instead. It initializes the surrogate machinery. The function surrogate_configure has to be called before the simulation starts. It reads from the .config file all it needs. It reads from a "section name" called SURROGATE.

The structure of the "section name" is the following:

SURROGATE {
   director_mode="at-fixed-virtual-times";
   fixed_switch_timestamps=( "1000e4", "8900e4" );

   packet_latency_predictor="average";  # options: average, torch-jit
   ignore_until="200e4";

   torch_jit_mode="single-static-model-for-all-terminals";
   torch_jit_model_path="";

   network_treatment_on_switch="freeze";  # options: frezee, nothing
}

There is only one director_mode currently implemented: at-fixed-virtual-times. Switches the simulation from high-fidelity to surrogate and back at fixed virtual times.

fixed_switch_timestamps should only work for the mode at-fixed-virtual-times. It has no restrictions, i.e, we can switch back and forth at any positive floating point number. In this example, the simulation starts in high-fidelity, then switches to surrogate at 1000e4 and back to high-fidelity at 8900e4.

packet_latency_predictor indicates which predictor to use for the surrogate. The current options are:

average: it is fed data from the simulation as the simulation runs. It will ignore all data that comes before ignore_until (we use it to ignore all data of an uncogested network, because we want packet latency data of only stable states).
torch-jit: it requires a path torch_jit_model_path to read a compliant model written in PyTorch and saved as a TorchScript. Currently, there is only one option for torch_jit_mode: single-static-model-for-all-terminals. In this mode, the PyTorch model has to be trained already and it will not be modified with new data (ie, all packet-latency data produced by the simulation will be disregarded).

network_treatment_on_switch indicates what to do with the network when switching from one simulation mode to another (surrogate to high-fildelity, or viceversa). The two options are:

nothing: The network will be untouched when we switch to surrogate mode. For some time, as the in-flight packets are routed to their destination, two strategies for advancing the simulation will co-exist: packet routing simulation and source-to-destination terminals direct packet simulation. After that period of time, the simulation will run exclusively on the second strategy.
freeze: All in-flight packets are suspended, so the network is "frozen" in place. This allows for a clear distinction between the two strategies on surrogate mode (full network simulation or terminal-to-terminal packet scheduling). All in-flight packets at the switch are rescheduled to arrive at the destination immediately.

`switch.h`

Contains all struct definitions for the director to work and the director function.

director_switch is called at every GVT, or, in the case of sequential execution, whenever it decides it should be awaken again. In sequential mode, the director is called at those fixed time stamps at which we switch from one mode to another.

The director has access to the following:

director_data: a struct with two pointers to functions, both defined by the network model
- director_data.switch_surrogate: function that when called it will ask the network model to switch from high-fidelity to surrogate, and viceversa
- director_data.is_surrogate_on: function that when called will inform us of whether the network is in high-fidelity mode or surrogate. Currently, there is no real reason for this function to exist because the director can keep track of the state with ease. It is not the case that the network will switch on its own to surrogate. Thus, this function is thought more as a failsafe, i.e., this function is to be called to assert the director's believe against the network mode.
A list of lp_types_switch, one per LP type in the simulation.

lp_types_switch is a struct that contains information on whether to freeze an event. To determine whether to freeze an event the director can use the following:
- lp_types_switch.lpname: This is the name of the LP type. We can only keep track of LP types via their name. Names are only visible to CODES (not ROSS). Modelnet LP types have a name starting with modelnet_.
- lp_types_switch.trigger_idle_modelnet: Modelnet LPs use a special event type (MN_BASE_SCHED_NEXT) to loop and process the next packet from their queues. If this event is frozen, the LP won't process any new packets until the network is awaken. This is expected behavior for the routers, but it is not for the terminals. Terminal LPs don't want their MN_BASE_SCHED_NEXT events to be frozen, thus they should have this variable set to true. Setting it to true, will force the LP to reschedule all MN_BASE_SCHED_NEXT events that it needs to operate appropiately.
- lp_types_switch.highdef_to_surrogate: it is in charge of switching this specific LP type from high-fidelity to surrogate.
- lp_types_switch.surrogate_to_highdef: it is in charge of switching this specific LP type from surrogate to high-fidelity.
- lp_types_switch.should_event_be_frozen: The director is in charge of freezing the network. It does so by checking every event in the simulation. If this function is NULL, it will not freeze the specific event (associated with the LP type). If it's a function, we will use it to determine whether the event should be frozen or not.
switch_at: a struct containing a list of time stamps at which the simulation should switch. We traverse the list one element at the time (we assume that the list is strictly monotonic, ie, sorted and with unique values). To keep track of where in the list we are, we use current_i. total just tells us the total number of switching timestamps.

`switch.c`

We implement director_switch, the director function. This is what it does when called:

Did we hit a trigger? (One of the switching-at time stamps.) YES, continue. NO, return.
Rollback and cancel all events that have to be rollbacked (if in Optimistic mode).
Ask the network model to switch. It calls director_data.switch_surrogate.
Freeze the network, if the user set it up (otherwise, the network is not frozen).
Schedule another trigger (time stamp in the future), if one exists.

Currently step 2 does little (possibly nothing). We know that no event has to be rollbacked because ROSS guarantees that it will stop and issue a trigger only when all unprocessed events that remain are past the trigger timestamp. Some event cancelations still might occur, thus why we run the function, but it is not needed or useful, as of now.

To freeze the network, we have to go through each LP and every event in the simulation. Each LP will be frozen as it deems appropiate (we use the info stored in lp_types_switchs). Each event event will be checked individually and frozen according to the lp_types_switch configuration (if one exists, if they don't, no event from that LP type will be frozen).

`packet-latency-predictor/common.h`

common.h defines the predictor's interface. Namely:

Four "methods":
- init: initialization, called once at per each predictor (one predictor is attached to one terminal)
- feed: individual packet start and finish "timestamps" are fed to the predictor one at the time with this function. We assume that they are commited data, thus no rollbacking will ever occur
- predict: given a starting "timestamp", we predict when will the packet be delivered
- predict_rc: because of optimistic execution, we might need to rollback the last predict timestamp
Starting "timestamp". We need more than a starting timestamp to determine the delay of packet delivery. In fact, we ask for:
- packet id
- destination terminal
- start time
- injection into terminal by workload
- delay to process since last one
- packet size
- whether there's another packet in the terminal's queue
End "timestamp". As per above, we require a bit more than a timestamp:
- travel end time
- delay to process next event in queue

We have to predict two values, travel end time and delay to process next event in queue, because those are the two values that emerge from the simulation and greatly influence the simulation accuracy. Other emergent values might be important for the surrogate network to keep track and use, but these are the values that the surrogate simulation cannot work properly without.

Note: common.c is an empty file. It's purpose it to make sure that common.h in itself has all included all dependencies it needs to work properly. As mentioned before, the "surrogate" header and implementation structures match one to one, every .h file has a .c.

`packet-latency-predictor/average.h|c`

Simple predictor which keeps a count of packets and delays for each destination terminal, so that it can predict packet delays for each destination terminal independently. Because each subsequente prediction is independent of each other, and no variables are modified when predicting, there is not reverse predictor.

It's total size grows quadratically on the number of nodes in the simulated network.

`packet-latency-predictor/torch-jit.h|C`

Instead of training an average model on the fly, we can load an already trained model. For this we use Torch, the C++ library for pytorch.

The model must:

Be "immutable", same input equals same output regardless of input history
Have input shape of (-1, 4)
Have output shape of (-1, 2)
The four input values are floating point numbers representing: source terminal, destination terminal, packet size, and flag is-there-another-packet-in-queue
The two output values are: packet latency and next packet delay

Note that the model is loaded once on each rank at the start and never updated.

`src/networks/model-net/dragonfly-dally.C`

The dragonfly network model is at the center of the whole surrogate implementation. These are some key reference points for the API:

We configure the surrogate inside of dragonfly_read_config once we know what we determine that we actually want to run in surrogate mode. For this, we fill up a struct surrogate_config (defined in codes/surrogate/init.h)
We save all packet latency data into a text file per rank. They follow the pattern %s/packets-delay-gid=%lu.txt where %s is the path given by the user, and %lu the rank id. Only commited data is stored in memory.
dragonfly_dally_terminal_highdef_to_surrogate is the function that switches a terminal LPs from high-fidelity to surrogate. It browses through all in-flight packets and reschedules those that haven't yet been delivered. Additionally, it stores the current state of the LP into a different space in memory. We do this to control with precision what can be modified during the surrogate execution.
When asking whether a terminal event should be frozen, the terminal replies that only MN_BASE_NEW_MSG events should NOT be frozen. Everything else is. File dragonfly_dally_terminal_should_event_be_frozen
We track packets sent to determine packet latency. Once the terminal is notified, it removes the entry.
Zombies are also tracked in the terminal. We tell the destination terminals which in-flight packets are zombies. The zombie tracking list empties once there are no zombie packets in the network.
Instead of overpowering packet_generate, we created a new function only to be called in surrogate mode packet_generate_predicted. These events are only processed by packet_arrive_predicted, instead of packet_arrive
We predict packet delay not chunk delay. This allows for larger speedup.

`src/networks/model-net/core/model-net-lp.c`

There are two significant changes in the model-net algorithm for the surrogate mode to work:

MN_BASE_SCHED_NEXT events produced when the network was frozen are ignored. New MN_BASE_SCHED_NEXT have to be rescheduled when switching from surrogate to high-fidelity (see second change below).
model_net_method_switch_to_surrogate_lp and model_net_method_switch_to_highdef_lp are two functions used to coordinate the correct scheduling of MN_BASE_SCHED_NEXT when the network is frozen.

ROSS-side of the implementation

To be able to call the director at every GVT, we've modified ROSS' main loop. The original loop looks pretty much like this:

while not end of simulation:
    GVT operations (check for new events, compute GVT, rollback and cancel events)

    while event in batch events:
        process event

This is the loop after the changes:

while true:
    if fun_state == active:
        fun_state = triggered if all PEs reached trigger timestamp
        call fun(gvt)
        if fun_state == triggered:
           fun_state = disabled

    if not of simulation:
        break

    GVT operations (check for new events, compute GVT, rollback and cancel events)

    while event in batch events:
        if event > next trigger time:
            break
        process event

fun_state and fun are related to the director function. fun is the director function. fun_state is a tri-state variable with possible values:

disabled: the function fun won't be called
enabled: the function fun will be called
triggered: transitionary state that will only occur within the if fun_state == active portion of the code. It is used to signal to the function fun that we have reached the timestamp we were set up to stop at. To be called again, the function fun has to schedule another triggering time in the future

In the implementation, the function fun is presented as "arbitrary function to be executed at GVT". The pointer saving the function is g_tw_gvt_arbitrary_fun.

In order to schedule another trigger, a timestamp at which the simulation will STOP, we use tw_trigger_arbitrary_fun_at. This function is to be used by the arbitrary function (the director) to switch from and to surrogate.

The trigger_arbitrary_fun struct, in core/ross-gvt.h, stores state_fun (actually named in the implemantion as .active) and the next triggering time stampt.

In the case of sequential execution, we only call the director function at trigger time stamps!

Home

packet latency surrogate director api

Director and surrogate API

Current implementation

Definitions

init.h and init.c

switch.h

switch.c

packet-latency-predictor/common.h

packet-latency-predictor/average.h|c

packet-latency-predictor/torch-jit.h|C

src/networks/model-net/dragonfly-dally.C

src/networks/model-net/core/model-net-lp.c

ROSS-side of the implementation

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Master (stable branch)

Clone this wiki locally

`init.h` and `init.c`

`switch.h`

`switch.c`

`packet-latency-predictor/common.h`

`packet-latency-predictor/average.h|c`

`packet-latency-predictor/torch-jit.h|C`

`src/networks/model-net/dragonfly-dally.C`

`src/networks/model-net/core/model-net-lp.c`