Skip to content

codes example dragonfly dally

Neil McGlohon edited this page Mar 17, 2021 · 3 revisions

Example Simulation: 1D Dragonfly (Dragonfly Dally)

Let's create a simple 1D Dragonfly simulation. We're going to configure CODES to generate a 1D Dragonfly network (defined by dragonfly-dally.C).

We'll need a few main things in order to do this:

  1. A workload generator binary
  2. A CODES network configuration file
  3. Network Configuration Tertiary files

Workload Generator

For simple synthetic traffic patterns, a workload generator binary for all CODES Dragonfly networks exists: model-net-synthetic-dragonfly-all.c. This is compiled by CODES and placed into the bin directory of your install folder. This workload generator accepts certain command line arguments that allow for specifications about the kind of traffic and how much it should try to generate.

Most importantly, there is the --traffic=, --num_messages=, --payload_sz=, and --arrival_time= arguments. This allows us to specify what type of pattern we want the generator to create, how many messages each workload LP should send, how big each of these messages should be (it will be broken up into packets by the model-net layer), and how much time in nanoseconds between the creation of each message. This gives us flexibility about the duration of the workload and the strength of the injected traffic. We can shorten the arrival time and make messages very frequent, or we can increase the payload size so that each message creates a lot more packets, or both, to scale the effective bandwidth of injected traffic. We can increase the number of messages to scale the duration of the workload generation.

Let's say we want each of our workload ranks to generate 10 messages of 8192 bytes every 100us and send them to other workload ranks with a uniform random distribution. We'd specify --traffic=1 --num_messages=10 --payload_sz=8192 --arrival_time=100000 when we launch the simulation.

Configuration File

Next thing we need is a CODES network configuration file. Each network topology in CODES will have its own set of parameters that are configurable though this file and the main source of knowledge for this is in the source file in each network itself (look for *_read_config() function!).

Example configuration files can be found in src/network-workloads/conf/. For this example we're going to be looking at src/network-workloads/conf/dragonfly-dally/dfdally_72.conf:

LPGROUPS
{
   MODELNET_GRP
   {
      repetitions="36";
      nw-lp="2";
# these lp names are specific to the dragonfly dally model
      modelnet_dragonfly_dally="2";
      modelnet_dragonfly_dally_router="1";
   }
}
PARAMS
{
# packet size in the network
   packet_size="4096";
   modelnet_order=( "dragonfly_dally","dragonfly_dally_router" );
   # scheduler options
   modelnet_scheduler="fcfs";
# chunk size in the network (safest to keep equal to packet size)
   chunk_size="4096";
# number of routers in group
   num_routers="4";
# number of groups in the network
   num_groups="9";
# buffer size in bytes for local virtual channels
   local_vc_size="16384";
#buffer size in bytes for global virtual channels
   global_vc_size="16384";
#buffer size in bytes for compute node virtual channels
   cn_vc_size="32768";
#bandwidth in GiB/s for local channels
   local_bandwidth="2.0";
# bandwidth in GiB/s for global channels
   global_bandwidth="2.0";
# bandwidth in GiB/s for compute node-router channels
   cn_bandwidth="2.0";
# ROSS message size
   message_size="736";
# number of compute nodes connected to router, dictated by dragonfly config
# file
   num_cns_per_router="2";
# number of global channels per router
   num_global_channels="2";
# network config file for intra-group connections
   intra-group-connections="../src/network-workloads/conf/dragonfly-dally/dfdally-72-intra";
# network config file for inter-group connections
   inter-group-connections="../src/network-workloads/conf/dragonfly-dally/dfdally-72-inter";
# routing protocol to be used
   routing="minimal";
}

In this configuration file, we first tell CODES exactly how many LPs we want to be simulated.

It will appear somewhat complicated with the concept of 'repetitions' but essentially this grants some flexibility for how LPs can be spread out in the simulation. nw-lp is the general name for most workload LPs - this must match the LP type name that is registered by the workload, CODES will let you know if this is wrong and won't run. modelnet_dragonfly_dally is the LP name for terminals/compute nodes in the Dragonfly Dally network. modelnet_dragonfly_dally_router is the LP name for the routers/switches in the Dragonfly Dally network.

This section (generally) does not play a role in how the simulated system behaves but is just giving CODES/ROSS the necessary information to know how to initialize the simulation. Remember: CODES is built on top of ROSS which is a PDES engine. Effective PDES execution at large scale will rely on how well the simulation load (from a PDES perspective) is balanced. Let's say we had three physical processors that we are executing the simulation on, if ALL of the router LPs were mapped to a single processor, then that one processor will be doing almost all of the computation necessary for processing events relating to routing packets around the network. It will be doing a lot more work than the other three processors and the simulation performance will suffer. The concept of repetitions allows us to specify a minimum "block" of LPs and it repeats that block for the set number of repetitions.

In other words, let's say you had an example network with 5 routers and 1 terminals per router: here's a bad example for defining this block:

LPGROUPS
{
   MODELNET_GRP
   {
      repetitions="1";
      nw-lp="5";
# these lp names will be the same for dragonfly-custom model
      modelnet_dragonfly_dally="5";
      modelnet_dragonfly_dally_router="5";
   }
}

This will result in the list of LPs being [workload, workload, workload, workload, workload, terminal, terminal, terminal, terminal, terminal, router, router, router, router, router]. With three physical processors executing this simulation, we'll end up with all workload LPs being on one processor, all terminal LPs being on another, and all router LPs being on the last. This may not be very balanced and could result in poor parallel performance.

Perhaps a better choice would be the following:

LPGROUPS
{
   MODELNET_GRP
   {
      repetitions="5";
      nw-lp="1";
# these lp names will be the same for dragonfly-custom model
      modelnet_dragonfly_dally="1";
      modelnet_dragonfly_dally_router="1";
   }
}

The corresponding list of LP types would be [workload, terminal, router, workload, terminal, router, workload, terminal, router, workload, terminal, router, workload, terminal, router]. We now have a much more balanced physical load per processor. There are numerous factors at play here though (like inter-processor traffic and synchronization) which can affect parallel performance and their doesn't exist a single golden solution but a good rule of thumb: set the repetitions equal to the number of routers that you have in the simulation (or the gcd of all three lp types) and set the other values accordingly until you get the correct number of simulated LPs for your desired model.

The PARAMS configuration block has a few modelnet simulation configuration parameters and the rest are all how to configure the chosen network.

message_size - DO NOT CONFUSE THIS FOR SIMULATED NETWORK MESSAGE SIZES. This is necessary for CODES to be able to tell ROSS the size of the message struct that it needs to transmit. This is eventually planned to be automatic and CODES/ROSS will error and tell you this is wrong if it is.

packet_size is the size of packets simulated in the network. Smaller packets corresponds to finer simulation granularity. For a simple simulation without huge workload generators, 128 bytes might be a fair value. But if you have workload ranks that each generate 10000 messages with an average size of 128KB, then you'll be simulating ten million packets per simulated workload rank. Each time a packet is transmitted between LPs is an event in the simulation. It would be very easy to find yourself simulating billions of events which will take significant amount of time to process. Increasing the packet size will result in significantly fewer packets to be generated to facilitate that workload and many many fewer events and the simulation will run faster. But this performance comes from a loss of simulation granularity. Larger packet sizes lead to more discrete behavior which at extremes can result in unexpected behavior. So be aware!

chunk_size is kind of a way that many CODES models can break up packets into flit-like pieces. The behavior though is not exactly flit-like and the rule of thumb is to leave this equal to the packet size.

num_routers is the number of routers in a dragonfly router group. This must match the supplied tertiary connection files. num_groups is the number of router groups in the network. This must match the supplied tertiary connection files. num_cns_per_router is the number of terminals connected to each router in the network. num_global_channels is the number of ports on each router dedicated to global connections local_vc_size is the buffer size for each VC on local router ports (intra-group ports) global_vc_size is the buffer size for each VC on global router ports (inter-group ports) cn_vc_size is the buffer size for each VC on terminal/compute node connections (on both the terminal and router side of the connection) *_bandwidth is the bandwidth for links of the specified type routing is the type of routing algorithm to be applied: "minimal", "non-minimal" (valiant), "prog-adaptive" (PAR)

intra-group-connections is the path to the generated binary file instructing how to connect routers to each other within their local group inter-group-connections is the path to the generated binary file instructing how to connect routers to routers in other groups

Tertiary Configuration Files

The files for the last two configuration parameters are generated using a python script (example script used to generate supplied example networks found in scripts/dragonfly-dally/dragonfly-dally-topo-gen.py

Executing our Simulation

CODES simulations are created with the command line. It can be all done in a single line but it is often easier (particularly with more complicated CODES configurations and workloads) to create a bash script and use a text editor to edit it as these single line commands can become quite long. So let's create a simple bash script that executes the simulation we've been building:

#! /bin/bash

# name our experiment (this will specify the name of a directory where certain model output files will be created)
EXP_NAME="simple-dfdally-example"
# let's create some variable that we can set so that we can add some unique number and not overwrite saved stdout files
count=0

CODES_LOC="<PATH-TO>/codes"
CODES_BUILD_LOC="$CODES_LOC/build"

BIN="$CODES_BUILD_LOC/src/network-workloads/model-net-synthetic-dragonfly-all"
CONFIG="$CODES_LOC/src/network-workloads/conf/dragonfly-dally/dfdally_72.conf"

RANKS=4
TRAFFIC=1
NUM_MESSAGES=10
PAYLOAD_SIZE=8192
ARRIVAL_TIME=100000

# Let's create a parent directory where we will put our EXP_NAME output file directory into if it doesn't already exist
LP_IO_PARENT_DIR="lpio"
if [ ! -d $LP_IO_PARENT_DIR ]
then
    mkdir $LP_IO_PARENT_DIR
fi

# Let's create a parent directory where we will put our simulation standard output file into if it doesn't already exist
OUT_DIR="stdout"
if [ ! -d $OUT_DIR ]
then
    mkdir $OUT_DIR
fi

LP_IO_DIR="$EXP_NAME-$count"
mpirun -np $RANKS $BIN --sync=3 --traffic=$TRAFFIC --num_messages=$NUM_MESSAGES --payload_sz=$PAYLOAD_SIZE --arrival_time=$ARRIVAL_TIME --lp-io-dir=$LP_IO_PARENT_DIR/$LP_IO_DIR --lp-io-use-suffix=1 -- $CONFIG | tee "$OUT_DIR/$LP_IO_DIR.txt"

If we save this bash script as example-dfdally.sh and chmod +x example-dfdally.sh to make it executable, we can now just directly call ./example-dfdally.sh and CODES should execute the simulation! We'll get some statistics output:

Average number of hops traversed 3.383333 average chunk latency 18.790354 us maximum chunk latency 30.917578 us avg message size 8192.000000 bytes finished messages 720 finished chunks 1440

ADAPTIVE ROUTING STATS: 1440 chunks routed minimally 0 chunks routed non-minimally completed packets 1440 

Total packets generated 1440 finished 1440 Locally routed- same router 10 different-router 112 Remote (inter-group) 1318 

Synthetic Workload LP Stats: Mean Message Latency: 21.373816 us,  Maximum Message Latency: 32.834938 us, Total Messages Received: 2160
	Maximum Workload End Time 1030927.59

We can verify that we generated the correct amount of packets as what we'd expect. Each of the 72 compute nodes generated 10 messages of 8192 in size. At a packet size of 4096, that means that each of the 720 total messages required two packets to facilitate which results in 1440 total generated packets. We can see from the network side, the average latency of each chunk (packet) but we can also see the latency of the messages which is the time it takes for the entire 8192 message to be transmitted through the network.

There are also other additional output files which give more detailed information about specific link statistics, node statistics, and workload rank statistics. The location of these files are given in the stdout:

LP-IO: writing output to lpio/simple-dfdally-example-0-11840-1616010420/
LP-IO: data files:
   lpio/simple-dfdally-example-0-11840-1616010420/dragonfly-cn-stats
   lpio/simple-dfdally-example-0-11840-1616010420/dragonfly-link-stats
   lpio/simple-dfdally-example-0-11840-1616010420/synthetic-stats

Clone this wiki locally