Skip to content
This repository has been archived by the owner on Mar 20, 2023. It is now read-only.

Fixes for building with LLVM / XL OpenMP offload #706

Draft
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

pramodk
Copy link
Collaborator

@pramodk pramodk commented Dec 10, 2021

Various changes (including temporary) to make XL OpenMP offload build working
 * todo: need to rebase this branch on latest hackathon_master
 * todo: temporary changes to OpenAccHelper.cmake, needs refinement
 * todo: see caliper linkling issue
 * todo: _OPENACC needs to be renamed CORENRN_ENABLE_GPU so that OpenMP
         based builds can use GPU offload.
 * todo: hardcoded CXX flags for quick build

How to test this?

Outstanding issues?

  • Building with XLC gives (57cecb5):
make[2]: *** [coreneuron/CMakeFiles/coreneuron.dir/build.make:114: coreneuron/CMakeFiles/coreneuron.dir/io/core2nrn_data_return.cpp.o] Error 1
In file included from /m100/home/userexternal/pkumbhar/CoreNeuron/coreneuron/io/mech_report.cpp:12:
In file included from /m100/home/userexternal/pkumbhar/CoreNeuron/coreneuron/coreneuron.hpp:24:
In file included from /m100/home/userexternal/pkumbhar/CoreNeuron/coreneuron/utils/randoms/nrnran123.h:42:
In file included from /m100/home/userexternal/pkumbhar/CoreNeuron/build/include/Random123/philox.h:37:
In file included from /m100/home/userexternal/pkumbhar/CoreNeuron/build/include/Random123/features/compilerfeatures.h:218:
In file included from /m100/home/userexternal/pkumbhar/CoreNeuron/build/include/Random123/features/clangfeatures.h:91:
/m100/home/userexternal/pkumbhar/CoreNeuron/build/include/Random123/features/gccfeatures.h:48:10: fatal error: 'ppu_intrinsics.h' file not found
#include <ppu_intrinsics.h>
         ^~~~~~~~~~~~~~~~~~
  • Running ringtest with XLC gives (57cecb5):
(venv) [pkumbhar@login01 build]$ ./bin/ppc64le/special-core  -d ../tests/integration/ring --gpu --cell-permute=2
 Info : 4 GPUs shared by 1 ranks per node

 Duke, Yale, and the BlueBrain Project -- Copyright 1984-2020
 Version : 1.0 57cecb59 (2022-01-03 17:00:52 +0100)

 Additional mechanisms from files
 exp2syn.mod expsyn.mod halfgap.mod hh.mod netstim.mod passive.mod pattern.mod stim.mod svclmp.mod

 Memory (MBs) :             After mk_mech : Max 390.8125, Min 390.8125, Avg 390.8125
 Memory (MBs) :            After MPI_Init : Max 390.8125, Min 390.8125, Avg 390.8125
 Memory (MBs) :          Before nrn_setup : Max 390.8125, Min 390.8125, Avg 390.8125
best_balance=0.848837 ncell=10 ntype=3 nwarp=10
best_balance=0.82093 ncell=10 ntype=3 nwarp=10
 Setup Done   : 0.13 seconds
 Model size   : 84.19 kB
 Memory (MBs) :          After nrn_setup  : Max 398.8750, Min 398.8750, Avg 398.8750
GENERAL PARAMETERS
--mpi=false
--mpi-lib=
--gpu=true
--dt=0.025
--tstop=100

GPU
--nwarp=65536
--cell-permute=2
--cuda-interface=false

INPUT PARAMETERS
--voltage=-65
--seed=-1
--datpath=../tests/integration/ring
--filesdat=files.dat
--pattern=
--report-conf=
--restore=

PARALLEL COMPUTATION PARAMETERS
--threading=false
--skip_mpi_finalize=false

SPIKE EXCHANGE
--ms_phases=2
--ms_subintervals=2
--multisend=false
--spk_compress=0
--binqueue=false

CONFIGURATION
--spikebuf=100000
--prcellgid=-1
--forwardskip=0
--celsius=6.3
--mindelay=1
--report-buffer-size=4

OUTPUT PARAMETERS
--dt_io=0.1
--outpath=.
--checkpoint=

 Start time (t) = 0

 Memory (MBs) :  After mk_spikevec_buffer : Max 398.8750, Min 398.8750, Avg 398.8750
 Memory (MBs) :     After nrn_finitialize : Max 398.6875, Min 398.6875, Avg 398.6875

1587-175 The underlying GPU runtime reported the following error "an illegal memory access was encountered".
1587-175 The underlying GPU runtime reported the following error "an illegal memory access was encountered".
1587-163 Error encountered while attempting to execute on the target device 0.  The program will stop.
1587-163 Error encountered while attempting to execute on the target device 0.  The program will stop.
free(): corrupted unsorted chunks

Issue seems to be with net_buf_receive_ExpSyn(). If I comment it out then it runs further

Running for 0 msec gives:

(venv) [pkumbhar@login01 build]$ gdb --args ./bin/ppc64le/special-core  -d ../tests/integration/ring --gpu --cell-permute=2 -e 0
GNU gdb (GDB) Red Hat Enterprise Linux 8.2-6.el8_0
...
(gdb) r
Starting program: /m100/home/userexternal/pkumbhar/CoreNeuron/build/bin/ppc64le/special-core -d ../tests/integration/ring --gpu --cell-permute=2 -e 0
Missing separate debuginfos, use: dnf debuginfo-install glibc-2.28-72.el8_1.1.ppc64le
Missing separate debuginfo for /cineca/prod/opt/compilers/cuda/11.2/none/compat/libcuda.so.1
Try: dnf --enablerepo='*debug*' install /usr/lib/debug/.build-id/d1/e9e189d76f924564adf8c9d73eeb713d5b23d9.debug
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/power9/libthread_db.so.1".
[New Thread 0x7fffefcbd890 (LWP 128546)]
 Info : 4 GPUs shared by 1 ranks per node

 Duke, Yale, and the BlueBrain Project -- Copyright 1984-2020
 Version : 1.0 57cecb59 (2022-01-03 17:00:52 +0100)

[New Thread 0x7fffcfffd890 (LWP 128551)]
 Additional mechanisms from files
 exp2syn.mod expsyn.mod halfgap.mod hh.mod netstim.mod passive.mod pattern.mod stim.mod svclmp.mod

 Memory (MBs) :             After mk_mech : Max 390.7500, Min 390.7500, Avg 390.7500
 Memory (MBs) :            After MPI_Init : Max 390.7500, Min 390.7500, Avg 390.7500
 Memory (MBs) :          Before nrn_setup : Max 390.7500, Min 390.7500, Avg 390.7500
best_balance=0.848837 ncell=10 ntype=3 nwarp=10
best_balance=0.82093 ncell=10 ntype=3 nwarp=10
 Setup Done   : 0.00 seconds
 Model size   : 84.19 kB
 Memory (MBs) :          After nrn_setup  : Max 390.7500, Min 390.7500, Avg 390.7500
GENERAL PARAMETERS
--mpi=false
--mpi-lib=
--gpu=true
--dt=0.025
--tstop=0

GPU
--nwarp=65536
--cell-permute=2
--cuda-interface=false

INPUT PARAMETERS
--voltage=-65
--seed=-1
--datpath=../tests/integration/ring
--filesdat=files.dat
--pattern=
--report-conf=
--restore=

PARALLEL COMPUTATION PARAMETERS
--threading=false
--skip_mpi_finalize=false

SPIKE EXCHANGE
--ms_phases=2
--ms_subintervals=2
--multisend=false
--spk_compress=0
--binqueue=false

CONFIGURATION
--spikebuf=100000
--prcellgid=-1
--forwardskip=0
--celsius=6.3
--mindelay=1
--report-buffer-size=4

OUTPUT PARAMETERS
--dt_io=0.1
--outpath=.
--checkpoint=

 Start time (t) = 0

 Memory (MBs) :  After mk_spikevec_buffer : Max 390.7500, Min 390.7500, Avg 390.7500
 Memory (MBs) :     After nrn_finitialize : Max 390.5625, Min 390.5625, Avg 390.5625

psolve |=========================================================| t: 0.00   ETA: 0h00m00s

Solver Time : 4.69685e-05

Thread 1 "special-core" received signal SIGSEGV, Segmentation fault.
IPRA.$_ZN10coreneuronL17update_ml_on_hostEPKNS_9Memb_listEi (ml=0x7fff50009f00, type=-16288) at /m100/home/userexternal/pkumbhar/CoreNeuron/coreneuron/gpu/nrn_acc_manager.cpp:180
180	    int n = ml->nodecount;
Missing separate debuginfos, use: dnf debuginfo-install libgcc-8.3.1-4.5.el8.ppc64le libstdc++-8.3.1-4.5.el8.ppc64le
(gdb) bt
#0  IPRA.$_ZN10coreneuronL17update_ml_on_hostEPKNS_9Memb_listEi (ml=0x7fff50009f00, type=-16288) at /m100/home/userexternal/pkumbhar/CoreNeuron/coreneuron/gpu/nrn_acc_manager.cpp:180
#1  0x000000001014c28c in coreneuron::update_nrnthreads_on_host () at /m100/home/userexternal/pkumbhar/CoreNeuron/coreneuron/gpu/nrn_acc_manager.cpp:831
#2  0x00000000100df048 in run_solve_core (argc=-14016, argv=0x0) at /m100/home/userexternal/pkumbhar/CoreNeuron/coreneuron/apps/main1.cpp:630
#3  0x000000001003a048 in solve_core (argc=-13968, argv=0x0) at /m100/home/userexternal/pkumbhar/CoreNeuron/build/share/coreneuron/enginemech.cpp:49
#4  0x0000000010039fd8 in main (argc=-13936, argv=0x7fff0000002e) at /m100/home/userexternal/pkumbhar/CoreNeuron/build/share/coreneuron/coreneuron.cpp:14

if I comment out update_nrnthreads_on_host() then it runs further.

CI_BRANCHES:NMODL_BRANCH=hackathon_main,NEURON_BRANCH=master,

@pramodk pramodk marked this pull request as draft December 10, 2021 00:36
@bbpbuildbot
Copy link
Collaborator

@pramodk
Copy link
Collaborator Author

pramodk commented Dec 10, 2021

On Ascent @ ORNL:

module load nvhpc/21.9 python/3.7.0 cmake flex bison
module swap cuda/10.1.243 cuda/11.0.2
module use /ccsopen/proj/gen170/neuron/spack_modules/linux-rhel7-power9le
module load caliper ninja py-pytest py-sympy py-jinja2 py-pyyaml boost
export NVLOCALRC=/ccsopen/proj/gen170/neuron/nersc-gpu-hackathon-dec-2021/localrc
export PATH=/sw/ascent/gcc/6.4.0/bin:$PATH
module load xl/16.1.1-10

Configure XL for newer GCC:

xlc_configure -gcc /sw/ascent/gcc/10.2.0 -o /ccsopen/proj/gen170/neuron/nersc-gpu-hackaxlc_gcc10.cfg -cuda /sw/ascent/cuda/11.0.2 -cudaVersion 11.0 -cuda_cc_major 7 -cuda_cc_minor 0

And build with:

cmake .. -DCORENRN_ENABLE_CALIPER_PROFILING=ON -DCORENRN_ENABLE_GPU=ON -DCORENRN_ENABLE_NMODL=ON -DCMAKE_INSTALL_PREFIX=../install  -DCMAKE_CXX_FLAGS="-DR123_USE_SSE=0"  -DCMAKE_CUDA_ARCHITECTURES=70 -DCMAKE_CUDA_COMPILER=nvcc -DCORENRN_EXTERNAL_BENCHMARK_DATA=/ccsopen/proj/gen170/neuron/nersc-gpu-hackathon-dec-2021/ -DCMAKE_CXX_FLAGS="" -DCORENRN_ENABLE_UNIT_TESTS=OFF -DCMAKE_CXX_COMPILER=xlc++_r -DCORENRN_NMODL_DIR=/ccsopen/proj/gen170/neuron/nersc-gpu-hackathon-dec-2021/users/kumbhar/nmodl/build/install -DCORENRN_ENABLE_MPI=ON

@bbpbuildbot
Copy link
Collaborator

@olupton olupton force-pushed the olupton/llvm-gpu branch 2 times, most recently from b468870 to affbc25 Compare December 10, 2021 07:57
@bbpbuildbot
Copy link
Collaborator

@olupton
Copy link
Contributor

olupton commented Dec 17, 2021

I rebased this and tried to resolve the conflicts fairly blindly.

@olupton olupton changed the base branch from hackathon_main to master December 23, 2021 11:08
olupton and others added 7 commits December 31, 2021 14:02
… working

 * todo: temporary changes to OpenAccHelper.cmake, needs refinement
 * todo: see caliper linkling issue
 * todo: _OPENACC needs to be renamed CORENRN_ENABLE_GPU so that OpenMP
         based builds can use GPU offload.
 * todo: hardcoded CXX flags for quick build
 * Setting CMAKE_EXE_LINKER_FLAGS was used also for NMODL
   when NMODL is built from submodule
 * In case of LLVM OpenMP offload, if binary is being created
   from the object that is not compiled with OpenMP flags,
   it causes link error:

   echo "int main() { return 0; } " > foo.cpp
   clang++ -c foo.cpp                                           # adding openmp flags here doesn't cause below error
   clang++ foo.o -o foo -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda

     nvlink fatal   : Could not open input file '/tmp/foo-0b9a1a.cubin'
     clang-12: error: nvlink command failed with exit code 1 (use -v to see invocation)
@pramodk
Copy link
Collaborator Author

pramodk commented Jan 4, 2022

Here are some notes from various experimentation / debugging attempts to get OpenMP offload working with LLVM v13.0.

  • On Slack thread, Olli reminded small reproducer from NERSC Hackathon showing issue with global variables and static library:
[olupton@r2i3n6 build_olli_gpu]$ cat test.sh
CXX=clang++
CXXFLAGS="-fopenmp -fopenmp-targets=nvptx64-nvidia-cuda -g"
${CXX} ${CXXFLAGS} -c test.cpp
ar cq libtest.a test.o
${CXX} ${CXXFLAGS} -o test1 main.cpp -L. -ltest
${CXX} ${CXXFLAGS} -o test2 main.cpp test.o
[olupton@r2i3n6 build_olli_gpu]$ cat test.cpp
#pragma omp declare target
int y;
#pragma omp end declare target

int test() {
  y = 24;
  #pragma omp target update to(y)
  y = 42;

  int x;
  #pragma omp target map(from:x)
  {
    x = y;
  }
  return x;
}
[olupton@r2i3n6 build_olli_gpu]$ cat main.cpp
extern int test();

int main() {
  return test();
}

small reproducer for the problem I am seeing now — going via the static library seems to cause problems 🤷

[olupton@r2i3n6 build_olli_gpu]$ ./test1
CUDA error: Loading global 'y' Failed
CUDA error: named symbol not found
Libomptarget error: Unable to generate entries table for device id 0.
Libomptarget error: Failed to init globals on device 0
Libomptarget error: Run with LIBOMPTARGET_INFO=4 to dump host-target pointer mappings.
test.cpp:7:3: Libomptarget fatal error 1: failure of target construct while offloading is mandatory
Aborted
[olupton@r2i3n6 build_olli_gpu]$ ./test2; echo $?
24
  • So I switched to shared library in coreneuron + nrnivmodl-core i.e. build shared library of libcorenrnmech. This was giving undefined symbol errors for global variables & function defined in libcoreneuron.a:
nvlink error   : Undefined reference to '_ZN10coreneuron7celsiusE' in '/gpfs/bbp.cscs.ch/ssd/slurmTmpFS/kumbhar/149976/_dimplic-ba4873.cubin'
nvlink error   : Undefined reference to '_ZN10coreneuron2piE' in '/gpfs/bbp.cscs.ch/ssd/slurmTmpFS/kumbhar/149976/_dimplic-ba4873.cubin'
nvlink error   : Undefined reference to '_ZN10coreneuron11secondorderE' in '/gpfs/bbp.cscs.ch/ssd/slurmTmpFS/kumbhar/149976/_dimplic-ba4873.cubin'
nvlink error   : Undefined reference to '_ZN10coreneuron7at_timeEPNS_9NrnThreadEd' in

See also Olli's comment in NERSC GPU Hackathon Slack:

With the Clang/OpenMP build + shared libraries instead of static, I avoided some link errors by removing the annotations in the header

#pragma omp declare target 
extern double celsius;
#pragma omp end declare target 

and the remaining one (at_time function) by making its body available. But now I have a new segfault at startup:

Program received signal SIGSEGV, Segmentation fault.
0x00007fffed92b801 in RTLsTy::RegisterRequires(long) () from /gpfs/bbp.cscs.ch/ssd/apps/bsd/2021-11/stage_externals/install_gcc-11.2.0-skylake/llvm-13.0.0-rrwbmv/lib/libomptarget.so
(gdb) bt
#0  0x00007fffed92b801 in RTLsTy::RegisterRequires(long) () from /gpfs/bbp.cscs.ch/ssd/apps/bsd/2021-11/stage_externals/install_gcc-11.2.0-skylake/llvm-13.0.0-rrwbmv/lib/libomptarget.so
#1  0x00007fffed92824e in __tgt_register_requires () from /gpfs/bbp.cscs.ch/ssd/apps/bsd/2021-11/stage_externals/install_gcc-11.2.0-skylake/llvm-13.0.0-rrwbmv/lib/libomptarget.so
#2  0x00007fffed8f29c3 in _dl_init_internal () from /lib64/ld-linux-x86-64.so.2
#3  0x00007fffed8e417a in _dl_start_user () from /lib64/ld-linux-x86-64.so.2
  still investigating..
  • Then I avoided building libcorenrnmech library and tried to use mechanisms object files directly to create special-core (by modifying nrnivmodl-core-makefile). This didn't go too far - I saw similar linking errors with global symbols in libcoreneuron library:
 => Binary creating x86_64/special-core
/gpfs/bbp.cscs.ch/ssd/apps/bsd/2021-11/stage_externals/install_gcc-11.2.0-skylake/llvm-13.0.0-lvcrm6/bin/clang++ -fopenmp=libomp      -std=c++14 -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda -Wno-unknown-cuda-version -I/gpfs/bbp.cscs.ch/ssd/apps/bsd/2021-11/stage_externals/install_gcc-11.2.0-skylake/cuda-11.4.2-yopegt/include  -Wl,--as-needed -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda -Wno-unknown-cuda-version -I/gpfs/bbp.cscs.ch/ssd/apps/bsd/2021-11/stage_externals/install_gcc-11.2.0-skylake/cuda-11.4.2-yopegt/include -o x86_64/special-core /gpfs/bbp.cscs.ch/home/kumbhar/workarena/systems/bbpv/repos/bbp/coreneuron/build/share/coreneuron/coreneuron.cpp \
  -I/gpfs/bbp.cscs.ch/home/kumbhar/workarena/systems/bbpv/repos/bbp/coreneuron/build/include  \
  x86_64/corenrn/build/_mod_func.o x86_64/corenrn/build/_dimplic.o x86_64/corenrn/build/exp2syn.o x86_64/corenrn/build/expsyn.o x86_64/corenrn/build/halfgap.o x86_64/corenrn/build/hh.o x86_64/corenrn/build/netstim.o x86_64/corenrn/build/passive.o x86_64/corenrn/build/pattern.o x86_64/corenrn/build/stim.o x86_64/corenrn/build/svclmp.o  x86_64/corenrn/build/enginemech.o
nvlink error   : Undefined reference to '_ZN10coreneuron7celsiusE' in '/gpfs/bbp.cscs.ch/ssd/slurmTmpFS/kumbhar/149976/_dimplic-ba4873.cubin'
nvlink error   : Undefined reference to '_ZN10coreneuron2piE' in '/gpfs/bbp.cscs.ch/ssd/slurmTmpFS/kumbhar/149976/_dimplic-ba4873.cubin'
nvlink error   : Undefined reference to '_ZN10coreneuron11secondorderE' in '/gpfs/bbp.cscs.ch/ssd/slurmTmpFS/kumbhar/149976/_dimplic-ba4873.cubin'
nvlink error   : Undefined reference to '_ZN10coreneuron7at_timeEPNS_9NrnThreadEd' in '/gpfs/bbp.cscs.ch/ssd/slurmTmpFS/kumbhar/149976/svclmp-852313.cubin'
  • Next, I wanted to avoid building any intermediate libraries i.e. create special-core binary by taking all objects. For this, I took mechanism cpp files generated by nrnivmodl-core and added into coreneuron source tree (temporarily). Then, updated cmake to build special-core directly via cmake i.e. something like (partial diff):
+++ b/coreneuron/CMakeLists.txt
@@ -47,6 +47,13 @@ list(APPEND CORENEURON_CODE_FILES ${PROJECT_BINARY_DIR}/coreneuron/config/config
 set(DIMPLIC_CODE_FILE "mechanism/mech/dimplic.cpp")
 set(ENGINEMECH_CODE_FILE "mechanism/mech/enginemech.cpp")

+file(GLOB CORENEURON_SPECIAL_CORE_FILES "exe/*.cpp")
+list(APPEND CORENEURON_SPECIAL_CORE_FILES "${CMAKE_CURRENT_SOURCE_DIR}/apps/coreneuron.cpp")
+list(APPEND CORENEURON_SPECIAL_CORE_FILES "${DIMPLIC_CODE_FILE}")
+list(APPEND CORENEURON_SPECIAL_CORE_FILES "${ENGINEMECH_CODE_FILE}")
+
 # for external mod files we need to generate modl_ref function in mod_func.c
 set(MODFUNC_PERL_SCRIPT "mechanism/mech/mod_func.c.pl")

@@ -184,6 +191,8 @@ if(CORENRN_ENABLE_MPI AND NOT CORENRN_ENABLE_MPI_DYNAMIC)
   set(CORENRN_MPI_OBJ $<TARGET_OBJECTS:${CORENRN_MPI_LIB_NAME}>)
 endif()

+set(COMPILE_LIBRARY_TYPE OBJECT)
+
 # main coreneuron library
 add_library(
   coreneuron
@@ -319,7 +328,7 @@ add_custom_command(
           "${modfile_directory}"
   WORKING_DIRECTORY ${CMAKE_BINARY_DIR}/bin
   COMMENT "Running nrnivmodl-core with halfgap.mod")
-add_custom_target(nrniv-core ALL DEPENDS ${output_binaries})
+#add_custom_target(nrniv-core ALL DEPENDS ${output_binaries})

 include_directories(${CORENEURON_PROJECT_SOURCE_DIR})

@@ -358,6 +367,12 @@ configure_file("utils/profile/profiler_interface.h"
 # main program required for building special-core
 file(COPY apps/coreneuron.cpp DESTINATION ${CMAKE_BINARY_DIR}/share/coreneuron)

+add_executable(special-core-gpu ${CORENEURON_SPECIAL_CORE_FILES})
+target_compile_options(special-core-gpu BEFORE PRIVATE $<$<COMPILE_LANGUAGE:CXX>:${CORENRN_ACC_FLAGS}>)
+target_compile_definitions(special-core-gpu PUBLIC -DADDITIONAL_MECHS)
+target_link_libraries(special-core-gpu coreneuron ${CMAKE_DL_LIBS})

This created exe using all object files:

[ 87%] Linking CXX executable ../bin/special-core-gpu
cd /gpfs/bbp.cscs.ch/home/kumbhar/workarena/systems/bbpv/repos/bbp/coreneuron/build/coreneuron && /gpfs/bbp.cscs.ch/ssd/apps/bsd/2021-11/stage_externals/install_gcc-11.2.0-skylake/cmake-3.21.4-cdyb7k/bin/cmake -E cmake_link_script CMakeFiles/special-core-gpu.dir/link.txt --verbose=1
/gpfs/bbp.cscs.ch/ssd/apps/bsd/2021-11/stage_externals/install_gcc-11.2.0-skylake/llvm-13.0.0-klsplt/bin/clang++  -fopenmp=libomp   -Wl,--as-needed -fopenmp=libomp -fopenmp-targets=nvptx64-nvidia-cuda -Wno-unknown-cuda-version -I/gpfs/bbp.cscs.ch/ssd/apps/bsd/2021-11/stage_externals/install_gcc-11.2.0-skylake/cuda-11.4.2-yopegt/include CMakeFiles/special-core-gpu.dir/exe/_mod_func.cpp.o CMakeFiles/special-core-gpu.dir/exe/exp2syn.cpp.o CMakeFiles/special-core-gpu.dir/exe/expsyn.cpp.o CMakeFiles/special-core-gpu.dir/exe/halfgap.cpp.o CMakeFiles/special-core-gpu.dir/exe/hh.cpp.o CMakeFiles/special-core-gpu.dir/exe/netstim.cpp.o CMakeFiles/special-core-gpu.dir/exe/passive.cpp.o CMakeFiles/special-core-gpu.dir/exe/pattern.cpp.o CMakeFiles/special-core-gpu.dir/exe/stim.cpp.o CMakeFiles/special-core-gpu.dir/exe/svclmp.cpp.o CMakeFiles/special-core-gpu.dir/apps/coreneuron.cpp.o CMakeFiles/special-core-gpu.dir/mechanism/mech/dimplic.cpp.o CMakeFiles/special-core-gpu.dir/mechanism/mech/enginemech.cpp.o CMakeFiles/coreneuron.dir/apps/corenrn_parameters.cpp.o CMakeFiles/coreneuron.dir/apps/main1.cpp.o CMakeFiles/coreneuron.dir/gpu/nrn_acc_manager.cpp.o CMakeFiles/coreneuron.dir/io/core2nrn_data_return.cpp.o CMakeFiles/coreneuron.dir/io/file_utils.cpp.o CMakeFiles/coreneuron.dir/io/global_vars.cpp.o CMakeFiles/coreneuron.dir/io/lfp.cpp.o CMakeFiles/coreneuron.dir/io/mech_report.cpp.o CMakeFiles/coreneuron.dir/io/mem_layout_util.cpp.o CMakeFiles/coreneuron.dir/io/mk_mech.cpp.o CMakeFiles/coreneuron.dir/io/nrn2core_data_init.cpp.o CMakeFiles/coreneuron.dir/io/nrn_checkpoint.cpp.o CMakeFiles/coreneuron.dir/io/nrn_filehandler.cpp.o CMakeFiles/coreneuron.dir/io/nrn_setup.cpp.o CMakeFiles/coreneuron.dir/io/output_spikes.cpp.o CMakeFiles/coreneuron.dir/io/phase1.cpp.o CMakeFiles/coreneuron.dir/io/phase2.cpp.o CMakeFiles/coreneuron.dir/io/prcellstate.cpp.o CMakeFiles/coreneuron.dir/io/reports/binary_report_handler.cpp.o CMakeFiles/coreneuron.dir/io/reports/nrnreport.cpp.o CMakeFiles/coreneuron.dir/io/reports/report_configuration_parser.cpp.o CMakeFiles/coreneuron.dir/io/reports/report_event.cpp.o CMakeFiles/coreneuron.dir/io/reports/report_handler.cpp.o CMakeFiles/coreneuron.dir/io/reports/sonata_report_handler.cpp.o CMakeFiles/coreneuron.dir/io/setup_fornetcon.cpp.o CMakeFiles/coreneuron.dir/mechanism/capac.cpp.o CMakeFiles/coreneuron.dir/mechanism/eion.cpp.o CMakeFiles/coreneuron.dir/mechanism/mech_mapping.cpp.o CMakeFiles/coreneuron.dir/mechanism/patternstim.cpp.o CMakeFiles/coreneuron.dir/mechanism/register_mech.cpp.o CMakeFiles/coreneuron.dir/network/cvodestb.cpp.o CMakeFiles/coreneuron.dir/network/multisend.cpp.o CMakeFiles/coreneuron.dir/network/multisend_setup.cpp.o CMakeFiles/coreneuron.dir/network/netcvode.cpp.o CMakeFiles/coreneuron.dir/network/netpar.cpp.o CMakeFiles/coreneuron.dir/network/partrans.cpp.o CMakeFiles/coreneuron.dir/network/partrans_setup.cpp.o CMakeFiles/coreneuron.dir/network/tqueue.cpp.o CMakeFiles/coreneuron.dir/permute/balance.cpp.o CMakeFiles/coreneuron.dir/permute/cellorder.cpp.o CMakeFiles/coreneuron.dir/permute/cellorder1.cpp.o CMakeFiles/coreneuron.dir/permute/cellorder2.cpp.o CMakeFiles/coreneuron.dir/permute/data_layout.cpp.o CMakeFiles/coreneuron.dir/permute/node_permute.cpp.o CMakeFiles/coreneuron.dir/sim/fadvance_core.cpp.o CMakeFiles/coreneuron.dir/sim/fast_imem.cpp.o CMakeFiles/coreneuron.dir/sim/finitialize.cpp.o CMakeFiles/coreneuron.dir/sim/multicore.cpp.o CMakeFiles/coreneuron.dir/sim/solve_core.cpp.o CMakeFiles/coreneuron.dir/sim/treeset_core.cpp.o CMakeFiles/coreneuron.dir/utils/ispc/globals.cpp.o CMakeFiles/coreneuron.dir/utils/ivocvect.cpp.o CMakeFiles/coreneuron.dir/utils/lpt.cpp.o CMakeFiles/coreneuron.dir/utils/memory.cpp.o CMakeFiles/coreneuron.dir/utils/memory_utils.cpp.o CMakeFiles/coreneuron.dir/utils/nrn_stats.cpp.o CMakeFiles/coreneuron.dir/utils/nrnoc_aux.cpp.o CMakeFiles/coreneuron.dir/utils/nrntimeout.cpp.o CMakeFiles/coreneuron.dir/utils/progressbar/progressbar.cpp.o CMakeFiles/coreneuron.dir/utils/randoms/nrnran123.cpp.o CMakeFiles/coreneuron.dir/utils/string_utils.cpp.o CMakeFiles/coreneuron.dir/utils/utils.cpp.o CMakeFiles/coreneuron.dir/utils/vrecord.cpp.o CMakeFiles/coreneuron.dir/config/config.cpp.o CMakeFiles/coreneuron.dir/mpi/core/nrnmpi_def_cinc.cpp.o CMakeFiles/coreneuron.dir/mpi/core/nrnmpi.cpp.o CMakeFiles/coreneuron.dir/mpi/core/nrnmpidec.cpp.o -o ../bin/special-core-gpu  -Wl,-rpath,/gpfs/bbp.cscs.ch/ssd/apps/bsd/2021-11/stage_externals/install_gcc-11.2.0-skylake/cuda-11.4.2-yopegt/lib64 -ldl /gpfs/bbp.cscs.ch/ssd/apps/bsd/2021-11/stage_externals/install_gcc-11.2.0-skylake/cuda-11.4.2-yopegt/lib64/libcudart.so

Unfortunately, this still gives similar cryptic error at launch for GPU or CPU execution:

kumbhar@r2i3n6:~/workarena/systems/bbpv/repos/bbp/coreneuron/build$ ./bin/special-core-gpu -e 1 -d ../tests/integration/ring

 Duke, Yale, and the BlueBrain Project -- Copyright 1984-2020
 Version : 1.0 4de7951f (2022-01-04 16:29:44 +0100)

CUDA error: Error returned from cuModuleLoadDataEx
CUDA error: out of memory
Libomptarget error: Unable to generate entries table for device id 0.
Libomptarget error: Failed to init globals on device 0
Libomptarget error: Run with LIBOMPTARGET_INFO=4 to dump host-target pointer mappings.
Libomptarget error: Source location information not present. Compile with -g or -gline-tables-only.
Libomptarget fatal error 1: failure of target construct while offloading is mandatory
Aborted
kumbhar@r2i3n6:~/workarena/systems/bbpv/repos/bbp/coreneuron/build$ ./bin/special-core-gpu -e 1 -d ../tests/integration/ring  --gpu
 Info : 4 GPUs shared by 1 ranks per node

 Duke, Yale, and the BlueBrain Project -- Copyright 1984-2020
 Version : 1.0 4de7951f (2022-01-04 16:29:44 +0100)

CUDA error: Error returned from cuModuleLoadDataEx
CUDA error: out of memory
Libomptarget error: Unable to generate entries table for device id 0.
Libomptarget error: Failed to init globals on device 0
Libomptarget error: Run with LIBOMPTARGET_INFO=4 to dump host-target pointer mappings.
Libomptarget error: Source location information not present. Compile with -g or -gline-tables-only.
Libomptarget fatal error 1: failure of target construct while offloading is mandatory
Aborted

I saw one such error reported here but that doesn't seem relevant.

  • Next, I wanted to incrementally enable OpenMP offload part and find out what causes above error. So, first thing I did was to disable OpenMP offload parts from mechanisms cpp files i.e. simply something like:
cd coreneuron/exe         #  this directory contain exp2syn.cpp  expsyn.cpp  halfgap.cpp  hh.cpp  _kinderiv.h  _mod_func.cpp  netstim.cpp  passive.cpp  pattern.cpp  stim.cpp  svclmp.cpp
sed -i 's#nrn_pragma_omp#//nrn_pragma_omp#g' *.cpp

And by re-building I saw that the special-core-gpu binary that we build is running fine on CPU or GPU!
So the issue seems to be somehow related to mechanisms cpp files!

$ nvprof ./bin/special-core-gpu -e 1 -d ../tests/integration/ring  --gpu
==112850== NVPROF is profiling process 112850, command: ./bin/special-core-gpu -e 1 -d ../tests/integration/ring --gpu
 Info : 4 GPUs shared by 1 ranks per node

 Duke, Yale, and the BlueBrain Project -- Copyright 1984-2020
 Version : 1.0 4de7951f (2022-01-04 16:29:44 +0100)

 Additional mechanisms from files
 exp2syn.mod expsyn.mod halfgap.mod hh.mod netstim.mod passive.mod pattern.mod stim.mod svclmp.mod

Exp2Syn Reg
 Memory (MBs) :             After mk_mech : Max 298.0469, Min 298.0469, Avg 298.0469
 Memory (MBs) :            After MPI_Init : Max 296.1094, Min 296.1094, Avg 296.1094
 Memory (MBs) :          Before nrn_setup : Max 296.1406, Min 296.1406, Avg 296.1406
 WARNING : GPU execution requires --cell-permute type 1 or 2. Setting it to 1.
 Setup Done   : 0.00 seconds
 Model size   : 84.19 kB
 Memory (MBs) :          After nrn_setup  : Max 296.4258, Min 296.4258, Avg 296.4258
GENERAL PARAMETERS
--mpi=false
--mpi-lib=
--gpu=true
--dt=0.025
--tstop=1

GPU
--nwarp=65536
--cell-permute=0
--cuda-interface=false

INPUT PARAMETERS
--voltage=-65
--seed=-1
--datpath=../tests/integration/ring
--filesdat=files.dat
--pattern=
--report-conf=
--restore=

PARALLEL COMPUTATION PARAMETERS
--threading=false
--skip_mpi_finalize=false

SPIKE EXCHANGE
--ms_phases=2
--ms_subintervals=2
--multisend=false
--spk_compress=0
--binqueue=false

CONFIGURATION
--spikebuf=100000
--prcellgid=-1
--forwardskip=0
--celsius=6.3
--mindelay=1
--report-buffer-size=4

OUTPUT PARAMETERS
--dt_io=0.1
--outpath=.
--checkpoint=

 Start time (t) = 0

 Memory (MBs) :  After mk_spikevec_buffer : Max 296.4258, Min 296.4258, Avg 296.4258
....
  Memory (MBs) :     After nrn_finitialize : Max 301.5273, Min 301.5273, Avg 301.5273

psolve |=========================================================| t: 1.00   ETA: 0h00m00s

Solver Time : 0.24223


 Simulation Statistics
 Number of cells: 20
 Number of compartments: 804
 Number of presyns: 21
 Number of input presyns: 0
 Number of synapses: 21
 Number of point processes: 41
 Number of transfer sources: 0
 Number of transfer targets: 0
 Number of spikes: 40
 Number of spikes with non negative gid-s: 40
==112850== Profiling application: ./bin/special-core-gpu -e 1 -d ../tests/integration/ring --gpu
==112850== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   11.68%  20.892ms       160  130.57us  126.98us  134.59us  __omp_offloading_2f_188133fe__ZN10coreneuron11nrn_cur_ionEPNS_9NrnThreadEPNS_9Memb_listEi_l273
                    9.64%  17.250ms        80  215.63us  186.62us  222.72us  __omp_offloading_2f_2979dc5__ZN10coreneuron12nrn_state_hhEPNS_9NrnThreadEPNS_9Memb_listEi_l520
                    8.98%  16.057ms       160  100.35us  99.007us  101.92us  __omp_offloading_2f_12055abf__ZN10coreneuron22net_buf_receive_ExpSynEPNS_9NrnThreadE_l290
                    7.72%  13.816ms        80  172.70us  169.18us  174.59us  __omp_offloading_2f_2979dc5__ZN10coreneuron10nrn_cur_hhEPNS_9NrnThreadEPNS_9Memb_listEi_l472
                    6.87%  12.293ms        80  153.66us  152.13us  154.94us  __omp_offloading_2f_12055abf__ZN10coreneuron14nrn_cur_ExpSynEPNS_9NrnThreadEPNS_9Memb_listEi_l383
                    6.82%  12.200ms        80  152.50us  150.37us  154.30us  __omp_offloading_2f_3962f8e__ZN10coreneuron11nrn_cur_pasEPNS_9NrnThreadEPNS_9Memb_listEi_l276
                    6.07%  10.865ms        80  135.81us  135.23us  136.58us  __omp_offloading_2f_12055abf__ZN10coreneuron16nrn_state_ExpSynEPNS_9NrnThreadEPNS_9Memb_listEi_l426
                    6.05%  10.826ms        80  135.33us  132.64us  136.80us  __omp_offloading_2f_191507f3__ZN10coreneuronL7nrn_lhsEPNS_9NrnThreadE_l166
                    6.04%  10.806ms        80  135.07us  133.41us  137.73us  __omp_offloading_2f_191507f3__ZN10coreneuronL7nrn_rhsEPNS_9NrnThreadE_l96
                    6.01%  10.746ms        80  134.32us  133.73us  135.33us  __omp_offloading_2f_198f6a66__ZN10coreneuron21nrn_jacob_capacitanceEPNS_9NrnThreadEPNS_9Memb_listEi_l74
                    5.94%  10.621ms        80  132.76us  132.35us  136.48us  __omp_offloading_2f_191507c2__ZN10coreneuron8NetCvode12check_threshEPNS_9NrnThreadE_l541
                    5.92%  10.589ms        80  132.36us  131.52us  133.31us  __omp_offloading_2f_198f6a66__ZN10coreneuron19nrn_cur_capacitanceEPNS_9NrnThreadEPNS_9Memb_listEi_l120
                    5.84%  10.446ms        80  130.57us  127.07us  131.42us  __omp_offloading_2f_191507f3__ZN10coreneuronL7nrn_rhsEPNS_9NrnThreadE_l37
                    5.79%  10.358ms        80  129.48us  128.90us  132.19us  __omp_offloading_2f_1d8e98f0__ZN10coreneuron6updateEPNS_9NrnThreadE_l217
                    0.33%  594.30us         6  99.050us  2.4320us  446.46us  [CUDA memset]
                    0.18%  318.05us       246  1.2920us  1.2470us  1.7920us  [CUDA memcpy HtoD]
                    0.10%  185.74us       125  1.4850us  1.3750us  2.9440us  [CUDA memcpy DtoH]
      API calls:   38.95%  214.73ms         1  214.73ms  214.73ms  214.73ms  cuDevicePrimaryCtxRetain
                   38.43%  211.84ms      1450  146.09us     756ns  275.66us  cuStreamSynchronize
                    9.61%  52.977ms         1  52.977ms  52.977ms  52.977ms  cuModuleLoadDataEx
                    4.59%  25.275ms         1  25.275ms  25.275ms  25.275ms  cuModuleUnload
                    3.69%  20.328ms         6  3.3880ms  8.9020us  20.278ms  cudaMallocManaged
                    2.10%  11.594ms        84  138.03us  11.220us  146.78us  cuMemcpyDtoHAsync
                    1.17%  6.4369ms      1280  5.0280us  4.2990us  17.611us  cuLaunchKernel
                    0.30%  1.6383ms       407  4.0250us     132ns  194.16us  cuDeviceGetAttribute
                    0.28%  1.5530ms         4  388.24us  385.38us  395.38us  cuDeviceTotalMem
                    0.26%  1.4583ms      1619     900ns     161ns  581.21us  cuCtxSetCurrent
                    0.19%  1.0351ms       246  4.2070us  3.6880us  13.495us  cuMemcpyHtoDAsync
                    0.15%  844.22us         6  140.70us  26.971us  516.61us  cudaMemset
                    0.11%  591.25us        41  14.420us  11.523us  44.503us  cuMemcpyDtoH
                    0.07%  407.33us        32  12.729us  1.7660us  190.68us  cuStreamCreate
                    0.03%  172.74us         4  43.183us  36.483us  57.844us  cuDeviceGetName
                    0.02%  125.94us        32  3.9350us  2.2320us  27.967us  cuStreamDestroy
                    0.02%  95.955us       373     257ns     144ns  4.1180us  cuGetProcAddress
                    0.01%  44.318us        54     820ns     292ns  5.4780us  cuModuleGetGlobal
                    0.01%  30.319us        41     739ns     377ns  4.3020us  cuModuleGetFunction
                    0.00%  16.994us         1  16.994us  16.994us  16.994us  cuMemAlloc
                    0.00%  13.340us         1  13.340us  13.340us  13.340us  cudaSetDevice
                    0.00%  12.246us         4  3.0610us     985ns  8.0850us  cuDeviceGetPCIBusId
                    0.00%  5.9690us         9     663ns     141ns  3.1970us  cuDeviceGet
                    0.00%  2.8740us         1  2.8740us  2.8740us  2.8740us  cuDevicePrimaryCtxGetState
                    0.00%  2.7800us         2  1.3900us     201ns  2.5790us  cuCtxGetLimit
                    0.00%  1.7460us         5     349ns     256ns     595ns  cuFuncGetAttribute
                    0.00%  1.1300us         4     282ns     202ns     443ns  cuDeviceGetCount
                    0.00%     946ns         4     236ns     161ns     322ns  cuDeviceGetUuid
                    0.00%     895ns         1     895ns     895ns     895ns  cuInit
                    0.00%     737ns         1     737ns     737ns     737ns  cuDevicePrimaryCtxSetFlags
                    0.00%     591ns         1     591ns     591ns     591ns  cuCtxGetDevice
                    0.00%     371ns         1     371ns     371ns     371ns  cuDevicePrimaryCtxRelease
                    0.00%     217ns         1     217ns     217ns     217ns  cuDriverGetVersion

==112850== Unified Memory profiling result:
Device "Tesla V100-SXM2-16GB (0)"
   Count  Avg Size  Min Size  Max Size  Total Size  Total Time  Name
       4  32.000KB  4.0000KB  60.000KB  128.0000KB  20.64000us  Host To Device
       6  32.000KB  4.0000KB  60.000KB  192.0000KB  21.34400us  Device To Host
       3         -         -         -           -  575.5490us  Gpu page fault groups
      16  4.0000KB  4.0000KB  4.0000KB  64.00000KB           -  Memory thrashes
Total CPU Page faults: 3
Total CPU thrashes: 16
  • Then, I was able to isolate the issue to single file exp2syn.cpp - if I comment out all OpenMP offload pragmas in exp2syn.cpp then special-core-gpu binary works fine!

But the funny part with exp2syn.cpp is that this mechanism is not used in the ringtest model at all ! :

$ nvprof ./bin/special-core-gpu -e 1 -d ../tests/integration/ring  --gpu --model-stats
==112982== NVPROF is profiling process 112982, command: ./bin/special-core-gpu -e 1 -d ../tests/integration/ring --gpu --model-stats
 Info : 4 GPUs shared by 1 ranks per node

 Duke, Yale, and the BlueBrain Project -- Copyright 1984-2020
 Version : 1.0 4de7951f (2022-01-04 16:29:44 +0100)

 Additional mechanisms from files
 exp2syn.mod expsyn.mod halfgap.mod hh.mod netstim.mod passive.mod pattern.mod stim.mod svclmp.mod

 Memory (MBs) :             After mk_mech : Max 298.0469, Min 298.0469, Avg 298.0469
 Memory (MBs) :            After MPI_Init : Max 296.1094, Min 296.1094, Avg 296.1094
 Memory (MBs) :          Before nrn_setup : Max 296.1406, Min 296.1406, Avg 296.1406
 WARNING : GPU execution requires --cell-permute type 1 or 2. Setting it to 1.

================ MECHANISMS COUNT BY TYPE ==================
  Id                 Name      Count
   0               (null)          0
   1               (null)          0
   2           morphology          0
   3          capacitance        392
   4                  pas        372
   5        extracellular          0
   6              fastpas          0
   7               IClamp          0
   8         AlphaSynapse          0
   9               ExpSyn         40
  10              Exp2Syn          0
  11              SEClamp          0
  12               VClamp          0
  13               OClamp          0
  14              APCount          0
  15               na_ion         20
  16                k_ion         20
  17                   hh         20
  18              NetStim          1
  19             IntFire1          0
  20             IntFire2          0
  21             IntFire4          0
  22     PointProcessMark          0
  23          PatternStim          0
  24              HalfGap          0

You can see that Exp2Syn count is 0!

  • In the exp2syn.cpp if I enable single OpenMP pragma e.g. in the functionnrn_state_Exp2Syn() (which is not executed!) then I still get below error:
$ nvprof ./bin/special-core-gpu -e 1 -d ../tests/integration/ring  --gpu --model-stats
==113465== NVPROF is profiling process 113465, command: ./bin/special-core-gpu -e 1 -d ../tests/integration/ring --gpu --model-stats
 Info : 4 GPUs shared by 1 ranks per node

 Duke, Yale, and the BlueBrain Project -- Copyright 1984-2020
 Version : 1.0 4de7951f (2022-01-04 16:29:44 +0100)

CUDA error: Error returned from cuModuleLoadDataEx
CUDA error: out of memory
Libomptarget error: Unable to generate entries table for device id 0.
Libomptarget error: Failed to init globals on device 0
Libomptarget error: Run with LIBOMPTARGET_INFO=4 to dump host-target pointer mappings.
Libomptarget error: Source location information not present. Compile with -g or -gline-tables-only.
Libomptarget fatal error 1: failure of target construct while offloading is mandatory
kumbhar@r2i3n6:~/workarena/systems/bbpv/repos/bbp/coreneuron/build/coreneuron$ rm ../bin/special-core-gpu
kumbhar@r2i3n6:~/workarena/systems/bbpv/repos/bbp/coreneuron/build/coreneuron$ /gpfs/bbp.cscs.ch/ssd/apps/bsd/2021-11/stage_externals/install_gcc-11.2.0-skylake/llvm-13.0.0-klsplt/bin/clang++ -DADDITIONAL_MECHS -DCORENEURON_BUILD -DCORENEURON_CUDA_PROFILING -DCORENEURON_ENABLE_GPU -DCORENEURON_PREFER_OPENMP_OFFLOAD -DDISABLE_HOC_EXP -DENABLE_SPLAYTREE_QUEUING -DLAYOUT=0 -DNRNMPI=0 -DNRN_MULTISEND=0 -I/gpfs/bbp.cscs.ch/ssd/apps/bsd/2021-11/stage_externals/install_gcc-11.2.0-skylake/cuda-11.4.2-yopegt/targets/x86_64-linux/include -I/gpfs/bbp.cscs.ch/home/kumbhar/workarena/systems/bbpv/repos/bbp/coreneuron/build/include -I/gpfs/bbp.cscs.ch/home/kumbhar/workarena/systems/bbpv/repos/bbp/coreneuron/coreneuron/utils/randoms -I/gpfs/bbp.cscs.ch/home/kumbhar/workarena/systems/bbpv/repos/bbp/coreneuron -I/gpfs/bbp.cscs.ch/home/kumbhar/workarena/systems/bbpv/repos/bbp/coreneuron/build/coreneuron -isystem /gpfs/bbp.cscs.ch/ssd/apps/bsd/2021-11/stage_externals/install_gcc-11.2.0-skylake/cuda-11.4.2-yopegt/include -fopenmp=libomp   -fopenmp=libomp -fopenmp-targets=nvptx64-nvidia-cuda -Wno-unknown-cuda-version -I/gpfs/bbp.cscs.ch/ssd/apps/bsd/2021-11/stage_externals/install_gcc-11.2.0-skylake/cuda-11.4.2-yopegt/include -std=c++14 -MD -MT coreneuron/CMakeFiles/special-core-gpu.dir/exe/exp2syn.cpp.o -MF CMakeFiles/special-core-gpu.dir/exe/exp2syn.cpp.o.d -o CMakeFiles/special-core-gpu.dir/exe/exp2syn.cpp.o -c /gpfs/bbp.cscs.ch/home/kumbhar/workarena/systems/bbpv/repos/bbp/coreneuron/coreneuron/exe/exp2syn.cpp -O1
kumbhar@r2i3n6:~/workarena/systems/bbpv/repos/bbp/coreneuron/build/coreneuron$ make
[  1%] Built target pyembed
[  2%] Built target fmt
[  3%] Built target pywrapper
[  3%] Built target pyastgen
[ 18%] Built target lexer_obj
[ 18%] Built target lexer
[ 21%] Built target util_obj
[ 22%] Built target util
[ 38%] Built target visitor_obj
[ 39%] Built target visitor
[ 43%] Built target codegen
[ 45%] Built target printer_obj
[ 45%] Built target printer
[ 46%] Built target symtab_obj
[ 47%] Built target symtab
[ 48%] Built target nmodl
[ 48%] Built target nrnivmodl-core
[ 50%] Built target kin_deriv_header
[ 83%] Built target coreneuron
[ 86%] Built target scopmath
Consolidate compiler generated dependencies of target special-core-gpu
[ 87%] Linking CXX executable ../bin/special-core-gpu
[ 93%] Built target special-core-gpu
[100%] Built target coreneuron-copy-nrnivmodl-core-dependencies
kumbhar@r2i3n6:~/workarena/systems/bbpv/repos/bbp/coreneuron/build/coreneuron$ nvprof ../bin/special-core-gpu -e 1 -d ../../tests/integration/ring  --gpu --model-stats
==115317== NVPROF is profiling process 115317, command: ../bin/special-core-gpu -e 1 -d ../../tests/integration/ring --gpu --model-stats
 Info : 4 GPUs shared by 1 ranks per node

 Duke, Yale, and the BlueBrain Project -- Copyright 1984-2020
 Version : 1.0 4de7951f (2022-01-04 16:29:44 +0100)

 Additional mechanisms from files
 exp2syn.mod expsyn.mod halfgap.mod hh.mod netstim.mod passive.mod pattern.mod stim.mod svclmp.mod

Exp2Syn Reg
 Memory (MBs) :             After mk_mech : Max 297.8047, Min 297.8047, Avg 297.8047
 Memory (MBs) :            After MPI_Init : Max 296.2422, Min 296.2422, Avg 296.2422
 Memory (MBs) :          Before nrn_setup : Max 296.2734, Min 296.2734, Avg 296.2734
 WARNING : GPU execution requires --cell-permute type 1 or 2. Setting it to 1.

....
  Memory (MBs) :     After nrn_finitialize : Max 301.6562, Min 301.6562, Avg 301.6562

psolve |=========================================================| t: 1.00   ETA: 0h00m01s

Solver Time : 0.253621


 Simulation Statistics
 Number of cells: 20
 Number of compartments: 804
 Number of presyns: 21
 Number of input presyns: 0
 Number of synapses: 21
 Number of point processes: 41
 Number of transfer sources: 0
 Number of transfer targets: 0
 Number of spikes: 40
 Number of spikes with non negative gid-s: 40
==115317== Profiling application: ../bin/special-core-gpu -e 1 -d ../../tests/integration/ring --gpu --model-stats
==115317== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   11.69%  21.776ms       160  136.10us  127.74us  143.14us  __omp_offloading_2f_188133fe__ZN10coreneuron11nrn_cur_ionEPNS_9NrnThreadEPNS_9Memb_listEi_l273
                    9.66%  17.998ms        80  224.98us  198.98us  236.25us  __omp_offloading_2f_2979dc5__ZN10coreneuron12nrn_state_hhEPNS_9NrnThreadEPNS_9Memb_listEi_l520
                    8.98%  16.731ms       160  104.57us  99.968us  108.10us  __omp_offloading_2f_12055abf__ZN10coreneuron22net_buf_receive_ExpSynEPNS_9NrnThreadE_l290
                    7.82%  14.565ms        80  182.06us  176.06us  187.84us  __omp_offloading_2f_2979dc5__ZN10coreneuron10nrn_cur_hhEPNS_9NrnThreadEPNS_9Memb_listEi_l472
                    6.86%  12.780ms        80  159.75us  153.70us  163.58us  __omp_offloading_2f_12055abf__ZN10coreneuron14nrn_cur_ExpSynEPNS_9NrnThreadEPNS_9Memb_listEi_l383
                    6.82%  12.697ms        80  158.71us  152.13us  163.20us  __omp_offloading_2f_3962f8e__ZN10coreneuron11nrn_cur_pasEPNS_9NrnThreadEPNS_9Memb_listEi_l276
                    6.04%  11.251ms        80  140.64us  135.58us  144.13us  __omp_offloading_2f_12055abf__ZN10coreneuron16nrn_state_ExpSynEPNS_9NrnThreadEPNS_9Memb_listEi_l426
                    6.04%  11.250ms        80  140.62us  134.53us  146.59us  __omp_offloading_2f_191507f3__ZN10coreneuronL7nrn_rhsEPNS_9NrnThreadE_l96
                    6.03%  11.240ms        80  140.50us  135.33us  143.87us  __omp_offloading_2f_198f6a66__ZN10coreneuron21nrn_jacob_capacitanceEPNS_9NrnThreadEPNS_9Memb_listEi_l74
                    6.01%  11.203ms        80  140.04us  133.09us  145.22us  __omp_offloading_2f_191507f3__ZN10coreneuronL7nrn_lhsEPNS_9NrnThreadE_l166
                    5.95%  11.076ms        80  138.45us  133.34us  144.96us  __omp_offloading_2f_191507c2__ZN10coreneuron8NetCvode12check_threshEPNS_9NrnThreadE_l541
                    5.92%  11.035ms        80  137.94us  132.42us  141.38us  __omp_offloading_2f_198f6a66__ZN10coreneuron19nrn_cur_capacitanceEPNS_9NrnThreadEPNS_9Memb_listEi_l120
                    5.80%  10.795ms        80  134.94us  130.05us  139.90us  __omp_offloading_2f_1d8e98f0__ZN10coreneuron6updateEPNS_9NrnThreadE_l217
                    5.77%  10.751ms        80  134.39us  129.34us  137.79us  __omp_offloading_2f_191507f3__ZN10coreneuronL7nrn_rhsEPNS_9NrnThreadE_l37
                    0.32%  598.59us         6  99.765us  2.7200us  447.39us  [CUDA memset]
                    0.17%  323.17us       246  1.3130us  1.2470us  1.8240us  [CUDA memcpy HtoD]
                    0.11%  195.62us       126  1.5520us  1.3760us  3.0090us  [CUDA memcpy DtoH]
      API calls:   39.19%  222.11ms      1450  153.18us     811ns  3.8766ms  cuStreamSynchronize
                   38.77%  219.72ms         1  219.72ms  219.72ms  219.72ms  cuDevicePrimaryCtxRetain
                    9.24%  52.366ms         1  52.366ms  52.366ms  52.366ms  cuModuleLoadDataEx
                    4.44%  25.182ms         1  25.182ms  25.182ms  25.182ms  cuModuleUnload
                    3.59%  20.330ms         6  3.3883ms  9.0800us  20.277ms  cudaMallocManaged
                    2.13%  12.063ms        84  143.60us  11.176us  157.59us  cuMemcpyDtoHAsync
                    1.20%  6.7851ms      1280  5.3000us  4.5340us  19.189us  cuLaunchKernel
                    0.29%  1.6442ms       407  4.0390us     133ns  187.16us  cuDeviceGetAttribute
                    0.28%  1.5656ms         4  391.41us  387.70us  395.00us  cuDeviceTotalMem
                    0.25%  1.4421ms      1619     890ns     159ns  579.67us  cuCtxSetCurrent
                    0.19%  1.0798ms       246  4.3890us  3.7420us  17.665us  cuMemcpyHtoDAsync
                    0.15%  853.90us         6  142.32us  27.310us  519.82us  cudaMemset
                    0.11%  647.30us        42  15.411us  11.264us  49.087us  cuMemcpyDtoH
                    0.07%  412.45us        32  12.889us  1.7620us  191.92us  cuStreamCreate
                    0.03%  168.17us         4  42.041us  36.633us  56.245us  cuDeviceGetName
                    0.02%  112.19us        32  3.5050us  2.1360us  21.674us  cuStreamDestroy
                    0.02%  108.93us       373     292ns     138ns  4.2190us  cuGetProcAddress
                    0.01%  44.237us        55     804ns     401ns  3.3740us  cuModuleGetGlobal
                    0.01%  34.313us        42     816ns     354ns  6.8130us  cuModuleGetFunction
                    0.00%  14.817us         1  14.817us  14.817us  14.817us  cudaSetDevice
                    0.00%  13.742us         1  13.742us  13.742us  13.742us  cuMemAlloc
                    0.00%  11.865us         4  2.9660us     937ns  7.8230us  cuDeviceGetPCIBusId
                    0.00%  7.1150us         9     790ns     142ns  4.0720us  cuDeviceGet
                    0.00%  2.6710us         1  2.6710us  2.6710us  2.6710us  cuDevicePrimaryCtxGetState
                    0.00%  2.5190us         2  1.2590us     224ns  2.2950us  cuCtxGetLimit
                    0.00%  2.2450us         5     449ns     231ns     744ns  cuFuncGetAttribute
                    0.00%  1.2930us         4     323ns     180ns     528ns  cuDeviceGetCount
                    0.00%  1.1640us         1  1.1640us  1.1640us  1.1640us  cuInit
                    0.00%     984ns         4     246ns     162ns     333ns  cuDeviceGetUuid
                    0.00%     674ns         1     674ns     674ns     674ns  cuDevicePrimaryCtxSetFlags
                    0.00%     502ns         1     502ns     502ns     502ns  cuCtxGetDevice
                    0.00%     359ns         1     359ns     359ns     359ns  cuDevicePrimaryCtxRelease
                    0.00%     215ns         1     215ns     215ns     215ns  cuDriverGetVersion
..

So using #734 I am able to avoid the CUDA error: out of memory error.

TODO: Go back to nrnivmodl-core based build and see how this could be tested there. (tomorrow)

@pramodk
Copy link
Collaborator Author

pramodk commented Jan 5, 2022

Summary of the IBM XL Compiler

Issues reported on llvm openmp-dev mailing list: https://lists.llvm.org/pipermail/openmp-dev/2022-January/004276.html

  • Issue/Question # 1 : Shared library as well as Static library works
  • Issue/Question # 2 : Static library works but Shared library fails to link:
+ CXX=xlc++_r
+ CXXFLAGS='-fopenmp -fPIC -qsmp=omp -qoffload -g -O2'
+ xlc++_r -fopenmp -fPIC -qsmp=omp -qoffload -g -O2 -c test.cpp
+ ar cq libtest.a test.o
+ xlc++_r -fopenmp -fPIC -qsmp=omp -qoffload -g -O2 -o test1 main.cpp -L. -ltest
+ xlc++_r -fopenmp -fPIC -qsmp=omp -qoffload -g -O2 -o test2 main.cpp test.o
+ rm test.o
+ xlc++_r -fopenmp -fPIC -qsmp=omp -qoffload -g -O2 -fpic -shared test.cpp -o libtest.so
/cineca/prod/opt/compilers/xl/16.1.1_sp4.1/binary/xlC/16.1.1/bin/.orig/xlc++_r: warning: 1501-269 fpic is not supported on this Operating System platform.  Option fpic will be ignored.
/cineca/prod/opt/compilers/xl/16.1.1_sp4.1/binary/xlC/16.1.1/bin/.orig/xlc++_r: warning: 1501-308 The device linker only supports static linking.  Any device code placed into a shared library by the qmkshrobj option will be inaccessible.
/cineca/prod/opt/compilers/xl/16.1.1_sp4.1/binary/xlC/16.1.1/bin/.orig/minilink: warning: 1501-308 The device linker only supports static linking.  Any device code placed into a shared library by the qmkshrobj option will be inaccessible.
+ xlc++_r -fopenmp -fPIC -qsmp=omp -qoffload -g -O2 -o test3 main.cpp -L. -ltest -Wl,-rpath .
nvlink error   : Undefined reference to 'y' in '/tmp/24507_0.o'
...
...
$ nvprof ./test1
==29304== NVPROF is profiling process 29304, command: ./test1
--> 0
--> 1
--> 4
--> 2
--> 3
==29304== Profiling application: ./test1
==29304== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   79.20%  41.659us         1  41.659us  41.659us  41.659us  __xl_main_l11_OL_1
                   11.99%  6.3040us         3  2.1010us  1.8560us  2.5280us  [CUDA memcpy DtoH]
                    4.98%  2.6200us         1  2.6200us  2.6200us  2.6200us  __xl__Z4testv_l8_OL_1
  • Another issue that took quite some time to debug is following:

Historically famous one:

                int *_displ = nrb->_displ;
                int _displ_cnt = nrb->_displ_cnt;
                #pragma omp target update to(_displ[0:_displ_cnt])

vs.

                #pragma omp target update to(nrb->_displ[0:nrb->_displ_cnt])
  • First update the contents of the array
  • Second is updating _displ pointer itself from host to device (and hence result in wrong pointer on device side).

In coreneuron/gpu/nrn_acc_manager.cpp under update_net_receive_buffer() I did:

#if 0
                #pragma omp target update to(_displ[0:nrb->_displ_cnt])
#else
                #pragma omp target update to(nrb->_displ[0:nrb->_displ_cnt])
#endif
                #pragma omp target
                {
                        printf("nrb->_displ :%p \n", nrb->_displ);
                }
                abort();

And this produces following in first and second case:

Updating nrb now
nrb->_displ :0x7ffef000b100
Aborted (core dumped)
....
nrb->_displ :0x26a6b9c0
Aborted (core dumped)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants