Skip to content

SimpleKernelTimer

Vivek Kale edited this page Jul 10, 2024 · 7 revisions

Tool Description

Simple Kernel Timer provides statistics about Kernel execution time (parallel regions). It records number of calls to a kernel as well as its total execution time. If the developer provides a string label for the parallel region it is used as identifier. Otherwise, the C++ type name of the functor or lambda is used. Note that fencing is always turned on for the simple-kernel-timer’s callbacks.

The tool is located at: https://github.com/kokkos/kokkos-tools/tree/develop/profiling/simple-kernel-timer

Compilation

Simply type "make" inside the source directory. When compiling for specific platforms modify the simple Makefile to use the correct compiler and compiler flags.

One can also use the cmake build system. The simple-kernel-timer is one of the tools that the Kokkos Tools CMake build system builds by default.

Usage

This is a standard tool which does not yet support tool chaining. In Bash do:

export KOKKOS_TOOLS_LIBS={PATH_TO_TOOL_DIRECTORY}/kp_kernel_timer.so
./application COMMANDS

This tool uses on the order of 200 bytes per unique kernel.

Output

The SimpleKernelTimer tool will generate one file per process for the list of kernel. The files are named HOSTNAME-PROCESSID.dat. The file is binary and required the kp_reader tool from the tool directory to be read. The kp_reader tool can read multiple files at the same time and will combine the results. This is for example useful to combine results of multiple MPI ranks.

Example Output

Consider the following code:

#include<Kokkos_Core.hpp>

int main(int argc, char* argv[]) {
  Kokkos::initialize(argc,argv);
  {
    int N = 100000000;
  
    Kokkos::View<double*> a("A",N);
    Kokkos::View<double*> b("B",N);
    Kokkos::View<double*> c("C",N);
  
    Kokkos::parallel_for(N, KOKKOS_LAMBDA (const int& i) {
      a(i) = 1.0*i;
      b(i) = 1.5*i;
      c(i) = 0.0;
    });
  
    double result = 0.0;
    for(int k = 0; k<50; k++) {
    
      Kokkos::parallel_for("AXPB", N, KOKKOS_LAMBDA (const int& i) {
        c(i) = 1.0*k*a(i) + b(i);
      });
    
      double dot;
      Kokkos::parallel_reduce("Dot", N, KOKKOS_LAMBDA (const int& i, double& lsum) {
        lsum += c(i)*c(i);
      },dot);
      result += dot;
  
    }
    printf("Result: %lf\n",result);
  }
  Kokkos::finalize();
}

Using kp_reader to read the output file of a run produces the following output. The columns are: Name, Total Time, Calls, Time/call, %of Kokkos Time, %of Total Time

       AXPB         9.61008                   50         0.19220  63.870  59.312
        Dot         2.85686                   50         0.05714  18.987  17.632
Z4mainE3$_0         2.57932                    1         2.57932  17.143  15.919

-------------------------------------------------------------------------
Summary:

Total Execution Time (incl. Kokkos + Non-Kokkos:                   16.20268 seconds
Total Time in Kokkos kernels:                                      15.04626 seconds
   -> Time outside Kokkos kernels:                                  1.15642 seconds
   -> Percentage in Kokkos kernels:                                   92.86 %
Total Calls to Kokkos Kernels:                                          101

-------------------------------------------------------------------------
Clone this wiki locally