-
Notifications
You must be signed in to change notification settings - Fork 39
Fixes for building with LLVM / XL OpenMP offload #706
base: master
Are you sure you want to change the base?
Conversation
Logfiles from GitLab pipeline #28712 (:no_entry:) have been uploaded here! Status and direct links: |
On Ascent @ ORNL:
Configure XL for newer GCC:
And build with:
|
7cb13a1
to
a8958d1
Compare
Logfiles from GitLab pipeline #28718 (:no_entry:) have been uploaded here! Status and direct links: |
b468870
to
affbc25
Compare
Logfiles from GitLab pipeline #28723 (:no_entry:) have been uploaded here! Status and direct links: |
affbc25
to
9cd7ace
Compare
I rebased this and tried to resolve the conflicts fairly blindly. |
9cd7ace
to
dc9632a
Compare
… working * todo: temporary changes to OpenAccHelper.cmake, needs refinement * todo: see caliper linkling issue * todo: _OPENACC needs to be renamed CORENRN_ENABLE_GPU so that OpenMP based builds can use GPU offload. * todo: hardcoded CXX flags for quick build
dc9632a
to
439c560
Compare
* Setting CMAKE_EXE_LINKER_FLAGS was used also for NMODL when NMODL is built from submodule * In case of LLVM OpenMP offload, if binary is being created from the object that is not compiled with OpenMP flags, it causes link error: echo "int main() { return 0; } " > foo.cpp clang++ -c foo.cpp # adding openmp flags here doesn't cause below error clang++ foo.o -o foo -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda nvlink fatal : Could not open input file '/tmp/foo-0b9a1a.cubin' clang-12: error: nvlink command failed with exit code 1 (use -v to see invocation)
Here are some notes from various experimentation / debugging attempts to get OpenMP offload working with LLVM v13.0.
[olupton@r2i3n6 build_olli_gpu]$ cat test.sh
CXX=clang++
CXXFLAGS="-fopenmp -fopenmp-targets=nvptx64-nvidia-cuda -g"
${CXX} ${CXXFLAGS} -c test.cpp
ar cq libtest.a test.o
${CXX} ${CXXFLAGS} -o test1 main.cpp -L. -ltest
${CXX} ${CXXFLAGS} -o test2 main.cpp test.o
[olupton@r2i3n6 build_olli_gpu]$ cat test.cpp
#pragma omp declare target
int y;
#pragma omp end declare target
int test() {
y = 24;
#pragma omp target update to(y)
y = 42;
int x;
#pragma omp target map(from:x)
{
x = y;
}
return x;
}
[olupton@r2i3n6 build_olli_gpu]$ cat main.cpp
extern int test();
int main() {
return test();
} small reproducer for the problem I am seeing now — going via the static library seems to cause problems 🤷 [olupton@r2i3n6 build_olli_gpu]$ ./test1
CUDA error: Loading global 'y' Failed
CUDA error: named symbol not found
Libomptarget error: Unable to generate entries table for device id 0.
Libomptarget error: Failed to init globals on device 0
Libomptarget error: Run with LIBOMPTARGET_INFO=4 to dump host-target pointer mappings.
test.cpp:7:3: Libomptarget fatal error 1: failure of target construct while offloading is mandatory
Aborted
[olupton@r2i3n6 build_olli_gpu]$ ./test2; echo $?
24
nvlink error : Undefined reference to '_ZN10coreneuron7celsiusE' in '/gpfs/bbp.cscs.ch/ssd/slurmTmpFS/kumbhar/149976/_dimplic-ba4873.cubin'
nvlink error : Undefined reference to '_ZN10coreneuron2piE' in '/gpfs/bbp.cscs.ch/ssd/slurmTmpFS/kumbhar/149976/_dimplic-ba4873.cubin'
nvlink error : Undefined reference to '_ZN10coreneuron11secondorderE' in '/gpfs/bbp.cscs.ch/ssd/slurmTmpFS/kumbhar/149976/_dimplic-ba4873.cubin'
nvlink error : Undefined reference to '_ZN10coreneuron7at_timeEPNS_9NrnThreadEd' in See also Olli's comment in NERSC GPU Hackathon Slack: With the Clang/OpenMP build + shared libraries instead of static, I avoided some link errors by removing the annotations in the header #pragma omp declare target
extern double celsius;
#pragma omp end declare target and the remaining one (at_time function) by making its body available. But now I have a new segfault at startup: Program received signal SIGSEGV, Segmentation fault.
0x00007fffed92b801 in RTLsTy::RegisterRequires(long) () from /gpfs/bbp.cscs.ch/ssd/apps/bsd/2021-11/stage_externals/install_gcc-11.2.0-skylake/llvm-13.0.0-rrwbmv/lib/libomptarget.so
(gdb) bt
#0 0x00007fffed92b801 in RTLsTy::RegisterRequires(long) () from /gpfs/bbp.cscs.ch/ssd/apps/bsd/2021-11/stage_externals/install_gcc-11.2.0-skylake/llvm-13.0.0-rrwbmv/lib/libomptarget.so
#1 0x00007fffed92824e in __tgt_register_requires () from /gpfs/bbp.cscs.ch/ssd/apps/bsd/2021-11/stage_externals/install_gcc-11.2.0-skylake/llvm-13.0.0-rrwbmv/lib/libomptarget.so
#2 0x00007fffed8f29c3 in _dl_init_internal () from /lib64/ld-linux-x86-64.so.2
#3 0x00007fffed8e417a in _dl_start_user () from /lib64/ld-linux-x86-64.so.2
=> Binary creating x86_64/special-core
/gpfs/bbp.cscs.ch/ssd/apps/bsd/2021-11/stage_externals/install_gcc-11.2.0-skylake/llvm-13.0.0-lvcrm6/bin/clang++ -fopenmp=libomp -std=c++14 -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda -Wno-unknown-cuda-version -I/gpfs/bbp.cscs.ch/ssd/apps/bsd/2021-11/stage_externals/install_gcc-11.2.0-skylake/cuda-11.4.2-yopegt/include -Wl,--as-needed -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda -Wno-unknown-cuda-version -I/gpfs/bbp.cscs.ch/ssd/apps/bsd/2021-11/stage_externals/install_gcc-11.2.0-skylake/cuda-11.4.2-yopegt/include -o x86_64/special-core /gpfs/bbp.cscs.ch/home/kumbhar/workarena/systems/bbpv/repos/bbp/coreneuron/build/share/coreneuron/coreneuron.cpp \
-I/gpfs/bbp.cscs.ch/home/kumbhar/workarena/systems/bbpv/repos/bbp/coreneuron/build/include \
x86_64/corenrn/build/_mod_func.o x86_64/corenrn/build/_dimplic.o x86_64/corenrn/build/exp2syn.o x86_64/corenrn/build/expsyn.o x86_64/corenrn/build/halfgap.o x86_64/corenrn/build/hh.o x86_64/corenrn/build/netstim.o x86_64/corenrn/build/passive.o x86_64/corenrn/build/pattern.o x86_64/corenrn/build/stim.o x86_64/corenrn/build/svclmp.o x86_64/corenrn/build/enginemech.o
nvlink error : Undefined reference to '_ZN10coreneuron7celsiusE' in '/gpfs/bbp.cscs.ch/ssd/slurmTmpFS/kumbhar/149976/_dimplic-ba4873.cubin'
nvlink error : Undefined reference to '_ZN10coreneuron2piE' in '/gpfs/bbp.cscs.ch/ssd/slurmTmpFS/kumbhar/149976/_dimplic-ba4873.cubin'
nvlink error : Undefined reference to '_ZN10coreneuron11secondorderE' in '/gpfs/bbp.cscs.ch/ssd/slurmTmpFS/kumbhar/149976/_dimplic-ba4873.cubin'
nvlink error : Undefined reference to '_ZN10coreneuron7at_timeEPNS_9NrnThreadEd' in '/gpfs/bbp.cscs.ch/ssd/slurmTmpFS/kumbhar/149976/svclmp-852313.cubin'
+++ b/coreneuron/CMakeLists.txt
@@ -47,6 +47,13 @@ list(APPEND CORENEURON_CODE_FILES ${PROJECT_BINARY_DIR}/coreneuron/config/config
set(DIMPLIC_CODE_FILE "mechanism/mech/dimplic.cpp")
set(ENGINEMECH_CODE_FILE "mechanism/mech/enginemech.cpp")
+file(GLOB CORENEURON_SPECIAL_CORE_FILES "exe/*.cpp")
+list(APPEND CORENEURON_SPECIAL_CORE_FILES "${CMAKE_CURRENT_SOURCE_DIR}/apps/coreneuron.cpp")
+list(APPEND CORENEURON_SPECIAL_CORE_FILES "${DIMPLIC_CODE_FILE}")
+list(APPEND CORENEURON_SPECIAL_CORE_FILES "${ENGINEMECH_CODE_FILE}")
+
# for external mod files we need to generate modl_ref function in mod_func.c
set(MODFUNC_PERL_SCRIPT "mechanism/mech/mod_func.c.pl")
@@ -184,6 +191,8 @@ if(CORENRN_ENABLE_MPI AND NOT CORENRN_ENABLE_MPI_DYNAMIC)
set(CORENRN_MPI_OBJ $<TARGET_OBJECTS:${CORENRN_MPI_LIB_NAME}>)
endif()
+set(COMPILE_LIBRARY_TYPE OBJECT)
+
# main coreneuron library
add_library(
coreneuron
@@ -319,7 +328,7 @@ add_custom_command(
"${modfile_directory}"
WORKING_DIRECTORY ${CMAKE_BINARY_DIR}/bin
COMMENT "Running nrnivmodl-core with halfgap.mod")
-add_custom_target(nrniv-core ALL DEPENDS ${output_binaries})
+#add_custom_target(nrniv-core ALL DEPENDS ${output_binaries})
include_directories(${CORENEURON_PROJECT_SOURCE_DIR})
@@ -358,6 +367,12 @@ configure_file("utils/profile/profiler_interface.h"
# main program required for building special-core
file(COPY apps/coreneuron.cpp DESTINATION ${CMAKE_BINARY_DIR}/share/coreneuron)
+add_executable(special-core-gpu ${CORENEURON_SPECIAL_CORE_FILES})
+target_compile_options(special-core-gpu BEFORE PRIVATE $<$<COMPILE_LANGUAGE:CXX>:${CORENRN_ACC_FLAGS}>)
+target_compile_definitions(special-core-gpu PUBLIC -DADDITIONAL_MECHS)
+target_link_libraries(special-core-gpu coreneuron ${CMAKE_DL_LIBS}) This created exe using all object files: [ 87%] Linking CXX executable ../bin/special-core-gpu
cd /gpfs/bbp.cscs.ch/home/kumbhar/workarena/systems/bbpv/repos/bbp/coreneuron/build/coreneuron && /gpfs/bbp.cscs.ch/ssd/apps/bsd/2021-11/stage_externals/install_gcc-11.2.0-skylake/cmake-3.21.4-cdyb7k/bin/cmake -E cmake_link_script CMakeFiles/special-core-gpu.dir/link.txt --verbose=1
/gpfs/bbp.cscs.ch/ssd/apps/bsd/2021-11/stage_externals/install_gcc-11.2.0-skylake/llvm-13.0.0-klsplt/bin/clang++ -fopenmp=libomp -Wl,--as-needed -fopenmp=libomp -fopenmp-targets=nvptx64-nvidia-cuda -Wno-unknown-cuda-version -I/gpfs/bbp.cscs.ch/ssd/apps/bsd/2021-11/stage_externals/install_gcc-11.2.0-skylake/cuda-11.4.2-yopegt/include CMakeFiles/special-core-gpu.dir/exe/_mod_func.cpp.o CMakeFiles/special-core-gpu.dir/exe/exp2syn.cpp.o CMakeFiles/special-core-gpu.dir/exe/expsyn.cpp.o CMakeFiles/special-core-gpu.dir/exe/halfgap.cpp.o CMakeFiles/special-core-gpu.dir/exe/hh.cpp.o CMakeFiles/special-core-gpu.dir/exe/netstim.cpp.o CMakeFiles/special-core-gpu.dir/exe/passive.cpp.o CMakeFiles/special-core-gpu.dir/exe/pattern.cpp.o CMakeFiles/special-core-gpu.dir/exe/stim.cpp.o CMakeFiles/special-core-gpu.dir/exe/svclmp.cpp.o CMakeFiles/special-core-gpu.dir/apps/coreneuron.cpp.o CMakeFiles/special-core-gpu.dir/mechanism/mech/dimplic.cpp.o CMakeFiles/special-core-gpu.dir/mechanism/mech/enginemech.cpp.o CMakeFiles/coreneuron.dir/apps/corenrn_parameters.cpp.o CMakeFiles/coreneuron.dir/apps/main1.cpp.o CMakeFiles/coreneuron.dir/gpu/nrn_acc_manager.cpp.o CMakeFiles/coreneuron.dir/io/core2nrn_data_return.cpp.o CMakeFiles/coreneuron.dir/io/file_utils.cpp.o CMakeFiles/coreneuron.dir/io/global_vars.cpp.o CMakeFiles/coreneuron.dir/io/lfp.cpp.o CMakeFiles/coreneuron.dir/io/mech_report.cpp.o CMakeFiles/coreneuron.dir/io/mem_layout_util.cpp.o CMakeFiles/coreneuron.dir/io/mk_mech.cpp.o CMakeFiles/coreneuron.dir/io/nrn2core_data_init.cpp.o CMakeFiles/coreneuron.dir/io/nrn_checkpoint.cpp.o CMakeFiles/coreneuron.dir/io/nrn_filehandler.cpp.o CMakeFiles/coreneuron.dir/io/nrn_setup.cpp.o CMakeFiles/coreneuron.dir/io/output_spikes.cpp.o CMakeFiles/coreneuron.dir/io/phase1.cpp.o CMakeFiles/coreneuron.dir/io/phase2.cpp.o CMakeFiles/coreneuron.dir/io/prcellstate.cpp.o CMakeFiles/coreneuron.dir/io/reports/binary_report_handler.cpp.o CMakeFiles/coreneuron.dir/io/reports/nrnreport.cpp.o CMakeFiles/coreneuron.dir/io/reports/report_configuration_parser.cpp.o CMakeFiles/coreneuron.dir/io/reports/report_event.cpp.o CMakeFiles/coreneuron.dir/io/reports/report_handler.cpp.o CMakeFiles/coreneuron.dir/io/reports/sonata_report_handler.cpp.o CMakeFiles/coreneuron.dir/io/setup_fornetcon.cpp.o CMakeFiles/coreneuron.dir/mechanism/capac.cpp.o CMakeFiles/coreneuron.dir/mechanism/eion.cpp.o CMakeFiles/coreneuron.dir/mechanism/mech_mapping.cpp.o CMakeFiles/coreneuron.dir/mechanism/patternstim.cpp.o CMakeFiles/coreneuron.dir/mechanism/register_mech.cpp.o CMakeFiles/coreneuron.dir/network/cvodestb.cpp.o CMakeFiles/coreneuron.dir/network/multisend.cpp.o CMakeFiles/coreneuron.dir/network/multisend_setup.cpp.o CMakeFiles/coreneuron.dir/network/netcvode.cpp.o CMakeFiles/coreneuron.dir/network/netpar.cpp.o CMakeFiles/coreneuron.dir/network/partrans.cpp.o CMakeFiles/coreneuron.dir/network/partrans_setup.cpp.o CMakeFiles/coreneuron.dir/network/tqueue.cpp.o CMakeFiles/coreneuron.dir/permute/balance.cpp.o CMakeFiles/coreneuron.dir/permute/cellorder.cpp.o CMakeFiles/coreneuron.dir/permute/cellorder1.cpp.o CMakeFiles/coreneuron.dir/permute/cellorder2.cpp.o CMakeFiles/coreneuron.dir/permute/data_layout.cpp.o CMakeFiles/coreneuron.dir/permute/node_permute.cpp.o CMakeFiles/coreneuron.dir/sim/fadvance_core.cpp.o CMakeFiles/coreneuron.dir/sim/fast_imem.cpp.o CMakeFiles/coreneuron.dir/sim/finitialize.cpp.o CMakeFiles/coreneuron.dir/sim/multicore.cpp.o CMakeFiles/coreneuron.dir/sim/solve_core.cpp.o CMakeFiles/coreneuron.dir/sim/treeset_core.cpp.o CMakeFiles/coreneuron.dir/utils/ispc/globals.cpp.o CMakeFiles/coreneuron.dir/utils/ivocvect.cpp.o CMakeFiles/coreneuron.dir/utils/lpt.cpp.o CMakeFiles/coreneuron.dir/utils/memory.cpp.o CMakeFiles/coreneuron.dir/utils/memory_utils.cpp.o CMakeFiles/coreneuron.dir/utils/nrn_stats.cpp.o CMakeFiles/coreneuron.dir/utils/nrnoc_aux.cpp.o CMakeFiles/coreneuron.dir/utils/nrntimeout.cpp.o CMakeFiles/coreneuron.dir/utils/progressbar/progressbar.cpp.o CMakeFiles/coreneuron.dir/utils/randoms/nrnran123.cpp.o CMakeFiles/coreneuron.dir/utils/string_utils.cpp.o CMakeFiles/coreneuron.dir/utils/utils.cpp.o CMakeFiles/coreneuron.dir/utils/vrecord.cpp.o CMakeFiles/coreneuron.dir/config/config.cpp.o CMakeFiles/coreneuron.dir/mpi/core/nrnmpi_def_cinc.cpp.o CMakeFiles/coreneuron.dir/mpi/core/nrnmpi.cpp.o CMakeFiles/coreneuron.dir/mpi/core/nrnmpidec.cpp.o -o ../bin/special-core-gpu -Wl,-rpath,/gpfs/bbp.cscs.ch/ssd/apps/bsd/2021-11/stage_externals/install_gcc-11.2.0-skylake/cuda-11.4.2-yopegt/lib64 -ldl /gpfs/bbp.cscs.ch/ssd/apps/bsd/2021-11/stage_externals/install_gcc-11.2.0-skylake/cuda-11.4.2-yopegt/lib64/libcudart.so Unfortunately, this still gives similar cryptic error at launch for GPU or CPU execution: kumbhar@r2i3n6:~/workarena/systems/bbpv/repos/bbp/coreneuron/build$ ./bin/special-core-gpu -e 1 -d ../tests/integration/ring
Duke, Yale, and the BlueBrain Project -- Copyright 1984-2020
Version : 1.0 4de7951f (2022-01-04 16:29:44 +0100)
CUDA error: Error returned from cuModuleLoadDataEx
CUDA error: out of memory
Libomptarget error: Unable to generate entries table for device id 0.
Libomptarget error: Failed to init globals on device 0
Libomptarget error: Run with LIBOMPTARGET_INFO=4 to dump host-target pointer mappings.
Libomptarget error: Source location information not present. Compile with -g or -gline-tables-only.
Libomptarget fatal error 1: failure of target construct while offloading is mandatory
Aborted
kumbhar@r2i3n6:~/workarena/systems/bbpv/repos/bbp/coreneuron/build$ ./bin/special-core-gpu -e 1 -d ../tests/integration/ring --gpu
Info : 4 GPUs shared by 1 ranks per node
Duke, Yale, and the BlueBrain Project -- Copyright 1984-2020
Version : 1.0 4de7951f (2022-01-04 16:29:44 +0100)
CUDA error: Error returned from cuModuleLoadDataEx
CUDA error: out of memory
Libomptarget error: Unable to generate entries table for device id 0.
Libomptarget error: Failed to init globals on device 0
Libomptarget error: Run with LIBOMPTARGET_INFO=4 to dump host-target pointer mappings.
Libomptarget error: Source location information not present. Compile with -g or -gline-tables-only.
Libomptarget fatal error 1: failure of target construct while offloading is mandatory
Aborted
cd coreneuron/exe # this directory contain exp2syn.cpp expsyn.cpp halfgap.cpp hh.cpp _kinderiv.h _mod_func.cpp netstim.cpp passive.cpp pattern.cpp stim.cpp svclmp.cpp
sed -i 's#nrn_pragma_omp#//nrn_pragma_omp#g' *.cpp And by re-building I saw that the special-core-gpu binary that we build is running fine on CPU or GPU! $ nvprof ./bin/special-core-gpu -e 1 -d ../tests/integration/ring --gpu
==112850== NVPROF is profiling process 112850, command: ./bin/special-core-gpu -e 1 -d ../tests/integration/ring --gpu
Info : 4 GPUs shared by 1 ranks per node
Duke, Yale, and the BlueBrain Project -- Copyright 1984-2020
Version : 1.0 4de7951f (2022-01-04 16:29:44 +0100)
Additional mechanisms from files
exp2syn.mod expsyn.mod halfgap.mod hh.mod netstim.mod passive.mod pattern.mod stim.mod svclmp.mod
Exp2Syn Reg
Memory (MBs) : After mk_mech : Max 298.0469, Min 298.0469, Avg 298.0469
Memory (MBs) : After MPI_Init : Max 296.1094, Min 296.1094, Avg 296.1094
Memory (MBs) : Before nrn_setup : Max 296.1406, Min 296.1406, Avg 296.1406
WARNING : GPU execution requires --cell-permute type 1 or 2. Setting it to 1.
Setup Done : 0.00 seconds
Model size : 84.19 kB
Memory (MBs) : After nrn_setup : Max 296.4258, Min 296.4258, Avg 296.4258
GENERAL PARAMETERS
--mpi=false
--mpi-lib=
--gpu=true
--dt=0.025
--tstop=1
GPU
--nwarp=65536
--cell-permute=0
--cuda-interface=false
INPUT PARAMETERS
--voltage=-65
--seed=-1
--datpath=../tests/integration/ring
--filesdat=files.dat
--pattern=
--report-conf=
--restore=
PARALLEL COMPUTATION PARAMETERS
--threading=false
--skip_mpi_finalize=false
SPIKE EXCHANGE
--ms_phases=2
--ms_subintervals=2
--multisend=false
--spk_compress=0
--binqueue=false
CONFIGURATION
--spikebuf=100000
--prcellgid=-1
--forwardskip=0
--celsius=6.3
--mindelay=1
--report-buffer-size=4
OUTPUT PARAMETERS
--dt_io=0.1
--outpath=.
--checkpoint=
Start time (t) = 0
Memory (MBs) : After mk_spikevec_buffer : Max 296.4258, Min 296.4258, Avg 296.4258
....
Memory (MBs) : After nrn_finitialize : Max 301.5273, Min 301.5273, Avg 301.5273
psolve |=========================================================| t: 1.00 ETA: 0h00m00s
Solver Time : 0.24223
Simulation Statistics
Number of cells: 20
Number of compartments: 804
Number of presyns: 21
Number of input presyns: 0
Number of synapses: 21
Number of point processes: 41
Number of transfer sources: 0
Number of transfer targets: 0
Number of spikes: 40
Number of spikes with non negative gid-s: 40
==112850== Profiling application: ./bin/special-core-gpu -e 1 -d ../tests/integration/ring --gpu
==112850== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 11.68% 20.892ms 160 130.57us 126.98us 134.59us __omp_offloading_2f_188133fe__ZN10coreneuron11nrn_cur_ionEPNS_9NrnThreadEPNS_9Memb_listEi_l273
9.64% 17.250ms 80 215.63us 186.62us 222.72us __omp_offloading_2f_2979dc5__ZN10coreneuron12nrn_state_hhEPNS_9NrnThreadEPNS_9Memb_listEi_l520
8.98% 16.057ms 160 100.35us 99.007us 101.92us __omp_offloading_2f_12055abf__ZN10coreneuron22net_buf_receive_ExpSynEPNS_9NrnThreadE_l290
7.72% 13.816ms 80 172.70us 169.18us 174.59us __omp_offloading_2f_2979dc5__ZN10coreneuron10nrn_cur_hhEPNS_9NrnThreadEPNS_9Memb_listEi_l472
6.87% 12.293ms 80 153.66us 152.13us 154.94us __omp_offloading_2f_12055abf__ZN10coreneuron14nrn_cur_ExpSynEPNS_9NrnThreadEPNS_9Memb_listEi_l383
6.82% 12.200ms 80 152.50us 150.37us 154.30us __omp_offloading_2f_3962f8e__ZN10coreneuron11nrn_cur_pasEPNS_9NrnThreadEPNS_9Memb_listEi_l276
6.07% 10.865ms 80 135.81us 135.23us 136.58us __omp_offloading_2f_12055abf__ZN10coreneuron16nrn_state_ExpSynEPNS_9NrnThreadEPNS_9Memb_listEi_l426
6.05% 10.826ms 80 135.33us 132.64us 136.80us __omp_offloading_2f_191507f3__ZN10coreneuronL7nrn_lhsEPNS_9NrnThreadE_l166
6.04% 10.806ms 80 135.07us 133.41us 137.73us __omp_offloading_2f_191507f3__ZN10coreneuronL7nrn_rhsEPNS_9NrnThreadE_l96
6.01% 10.746ms 80 134.32us 133.73us 135.33us __omp_offloading_2f_198f6a66__ZN10coreneuron21nrn_jacob_capacitanceEPNS_9NrnThreadEPNS_9Memb_listEi_l74
5.94% 10.621ms 80 132.76us 132.35us 136.48us __omp_offloading_2f_191507c2__ZN10coreneuron8NetCvode12check_threshEPNS_9NrnThreadE_l541
5.92% 10.589ms 80 132.36us 131.52us 133.31us __omp_offloading_2f_198f6a66__ZN10coreneuron19nrn_cur_capacitanceEPNS_9NrnThreadEPNS_9Memb_listEi_l120
5.84% 10.446ms 80 130.57us 127.07us 131.42us __omp_offloading_2f_191507f3__ZN10coreneuronL7nrn_rhsEPNS_9NrnThreadE_l37
5.79% 10.358ms 80 129.48us 128.90us 132.19us __omp_offloading_2f_1d8e98f0__ZN10coreneuron6updateEPNS_9NrnThreadE_l217
0.33% 594.30us 6 99.050us 2.4320us 446.46us [CUDA memset]
0.18% 318.05us 246 1.2920us 1.2470us 1.7920us [CUDA memcpy HtoD]
0.10% 185.74us 125 1.4850us 1.3750us 2.9440us [CUDA memcpy DtoH]
API calls: 38.95% 214.73ms 1 214.73ms 214.73ms 214.73ms cuDevicePrimaryCtxRetain
38.43% 211.84ms 1450 146.09us 756ns 275.66us cuStreamSynchronize
9.61% 52.977ms 1 52.977ms 52.977ms 52.977ms cuModuleLoadDataEx
4.59% 25.275ms 1 25.275ms 25.275ms 25.275ms cuModuleUnload
3.69% 20.328ms 6 3.3880ms 8.9020us 20.278ms cudaMallocManaged
2.10% 11.594ms 84 138.03us 11.220us 146.78us cuMemcpyDtoHAsync
1.17% 6.4369ms 1280 5.0280us 4.2990us 17.611us cuLaunchKernel
0.30% 1.6383ms 407 4.0250us 132ns 194.16us cuDeviceGetAttribute
0.28% 1.5530ms 4 388.24us 385.38us 395.38us cuDeviceTotalMem
0.26% 1.4583ms 1619 900ns 161ns 581.21us cuCtxSetCurrent
0.19% 1.0351ms 246 4.2070us 3.6880us 13.495us cuMemcpyHtoDAsync
0.15% 844.22us 6 140.70us 26.971us 516.61us cudaMemset
0.11% 591.25us 41 14.420us 11.523us 44.503us cuMemcpyDtoH
0.07% 407.33us 32 12.729us 1.7660us 190.68us cuStreamCreate
0.03% 172.74us 4 43.183us 36.483us 57.844us cuDeviceGetName
0.02% 125.94us 32 3.9350us 2.2320us 27.967us cuStreamDestroy
0.02% 95.955us 373 257ns 144ns 4.1180us cuGetProcAddress
0.01% 44.318us 54 820ns 292ns 5.4780us cuModuleGetGlobal
0.01% 30.319us 41 739ns 377ns 4.3020us cuModuleGetFunction
0.00% 16.994us 1 16.994us 16.994us 16.994us cuMemAlloc
0.00% 13.340us 1 13.340us 13.340us 13.340us cudaSetDevice
0.00% 12.246us 4 3.0610us 985ns 8.0850us cuDeviceGetPCIBusId
0.00% 5.9690us 9 663ns 141ns 3.1970us cuDeviceGet
0.00% 2.8740us 1 2.8740us 2.8740us 2.8740us cuDevicePrimaryCtxGetState
0.00% 2.7800us 2 1.3900us 201ns 2.5790us cuCtxGetLimit
0.00% 1.7460us 5 349ns 256ns 595ns cuFuncGetAttribute
0.00% 1.1300us 4 282ns 202ns 443ns cuDeviceGetCount
0.00% 946ns 4 236ns 161ns 322ns cuDeviceGetUuid
0.00% 895ns 1 895ns 895ns 895ns cuInit
0.00% 737ns 1 737ns 737ns 737ns cuDevicePrimaryCtxSetFlags
0.00% 591ns 1 591ns 591ns 591ns cuCtxGetDevice
0.00% 371ns 1 371ns 371ns 371ns cuDevicePrimaryCtxRelease
0.00% 217ns 1 217ns 217ns 217ns cuDriverGetVersion
==112850== Unified Memory profiling result:
Device "Tesla V100-SXM2-16GB (0)"
Count Avg Size Min Size Max Size Total Size Total Time Name
4 32.000KB 4.0000KB 60.000KB 128.0000KB 20.64000us Host To Device
6 32.000KB 4.0000KB 60.000KB 192.0000KB 21.34400us Device To Host
3 - - - - 575.5490us Gpu page fault groups
16 4.0000KB 4.0000KB 4.0000KB 64.00000KB - Memory thrashes
Total CPU Page faults: 3
Total CPU thrashes: 16
But the funny part with exp2syn.cpp is that this mechanism is not used in the ringtest model at all ! : $ nvprof ./bin/special-core-gpu -e 1 -d ../tests/integration/ring --gpu --model-stats
==112982== NVPROF is profiling process 112982, command: ./bin/special-core-gpu -e 1 -d ../tests/integration/ring --gpu --model-stats
Info : 4 GPUs shared by 1 ranks per node
Duke, Yale, and the BlueBrain Project -- Copyright 1984-2020
Version : 1.0 4de7951f (2022-01-04 16:29:44 +0100)
Additional mechanisms from files
exp2syn.mod expsyn.mod halfgap.mod hh.mod netstim.mod passive.mod pattern.mod stim.mod svclmp.mod
Memory (MBs) : After mk_mech : Max 298.0469, Min 298.0469, Avg 298.0469
Memory (MBs) : After MPI_Init : Max 296.1094, Min 296.1094, Avg 296.1094
Memory (MBs) : Before nrn_setup : Max 296.1406, Min 296.1406, Avg 296.1406
WARNING : GPU execution requires --cell-permute type 1 or 2. Setting it to 1.
================ MECHANISMS COUNT BY TYPE ==================
Id Name Count
0 (null) 0
1 (null) 0
2 morphology 0
3 capacitance 392
4 pas 372
5 extracellular 0
6 fastpas 0
7 IClamp 0
8 AlphaSynapse 0
9 ExpSyn 40
10 Exp2Syn 0
11 SEClamp 0
12 VClamp 0
13 OClamp 0
14 APCount 0
15 na_ion 20
16 k_ion 20
17 hh 20
18 NetStim 1
19 IntFire1 0
20 IntFire2 0
21 IntFire4 0
22 PointProcessMark 0
23 PatternStim 0
24 HalfGap 0 You can see that
$ nvprof ./bin/special-core-gpu -e 1 -d ../tests/integration/ring --gpu --model-stats
==113465== NVPROF is profiling process 113465, command: ./bin/special-core-gpu -e 1 -d ../tests/integration/ring --gpu --model-stats
Info : 4 GPUs shared by 1 ranks per node
Duke, Yale, and the BlueBrain Project -- Copyright 1984-2020
Version : 1.0 4de7951f (2022-01-04 16:29:44 +0100)
CUDA error: Error returned from cuModuleLoadDataEx
CUDA error: out of memory
Libomptarget error: Unable to generate entries table for device id 0.
Libomptarget error: Failed to init globals on device 0
Libomptarget error: Run with LIBOMPTARGET_INFO=4 to dump host-target pointer mappings.
Libomptarget error: Source location information not present. Compile with -g or -gline-tables-only.
Libomptarget fatal error 1: failure of target construct while offloading is mandatory
kumbhar@r2i3n6:~/workarena/systems/bbpv/repos/bbp/coreneuron/build/coreneuron$ rm ../bin/special-core-gpu
kumbhar@r2i3n6:~/workarena/systems/bbpv/repos/bbp/coreneuron/build/coreneuron$ /gpfs/bbp.cscs.ch/ssd/apps/bsd/2021-11/stage_externals/install_gcc-11.2.0-skylake/llvm-13.0.0-klsplt/bin/clang++ -DADDITIONAL_MECHS -DCORENEURON_BUILD -DCORENEURON_CUDA_PROFILING -DCORENEURON_ENABLE_GPU -DCORENEURON_PREFER_OPENMP_OFFLOAD -DDISABLE_HOC_EXP -DENABLE_SPLAYTREE_QUEUING -DLAYOUT=0 -DNRNMPI=0 -DNRN_MULTISEND=0 -I/gpfs/bbp.cscs.ch/ssd/apps/bsd/2021-11/stage_externals/install_gcc-11.2.0-skylake/cuda-11.4.2-yopegt/targets/x86_64-linux/include -I/gpfs/bbp.cscs.ch/home/kumbhar/workarena/systems/bbpv/repos/bbp/coreneuron/build/include -I/gpfs/bbp.cscs.ch/home/kumbhar/workarena/systems/bbpv/repos/bbp/coreneuron/coreneuron/utils/randoms -I/gpfs/bbp.cscs.ch/home/kumbhar/workarena/systems/bbpv/repos/bbp/coreneuron -I/gpfs/bbp.cscs.ch/home/kumbhar/workarena/systems/bbpv/repos/bbp/coreneuron/build/coreneuron -isystem /gpfs/bbp.cscs.ch/ssd/apps/bsd/2021-11/stage_externals/install_gcc-11.2.0-skylake/cuda-11.4.2-yopegt/include -fopenmp=libomp -fopenmp=libomp -fopenmp-targets=nvptx64-nvidia-cuda -Wno-unknown-cuda-version -I/gpfs/bbp.cscs.ch/ssd/apps/bsd/2021-11/stage_externals/install_gcc-11.2.0-skylake/cuda-11.4.2-yopegt/include -std=c++14 -MD -MT coreneuron/CMakeFiles/special-core-gpu.dir/exe/exp2syn.cpp.o -MF CMakeFiles/special-core-gpu.dir/exe/exp2syn.cpp.o.d -o CMakeFiles/special-core-gpu.dir/exe/exp2syn.cpp.o -c /gpfs/bbp.cscs.ch/home/kumbhar/workarena/systems/bbpv/repos/bbp/coreneuron/coreneuron/exe/exp2syn.cpp -O1
kumbhar@r2i3n6:~/workarena/systems/bbpv/repos/bbp/coreneuron/build/coreneuron$ make
[ 1%] Built target pyembed
[ 2%] Built target fmt
[ 3%] Built target pywrapper
[ 3%] Built target pyastgen
[ 18%] Built target lexer_obj
[ 18%] Built target lexer
[ 21%] Built target util_obj
[ 22%] Built target util
[ 38%] Built target visitor_obj
[ 39%] Built target visitor
[ 43%] Built target codegen
[ 45%] Built target printer_obj
[ 45%] Built target printer
[ 46%] Built target symtab_obj
[ 47%] Built target symtab
[ 48%] Built target nmodl
[ 48%] Built target nrnivmodl-core
[ 50%] Built target kin_deriv_header
[ 83%] Built target coreneuron
[ 86%] Built target scopmath
Consolidate compiler generated dependencies of target special-core-gpu
[ 87%] Linking CXX executable ../bin/special-core-gpu
[ 93%] Built target special-core-gpu
[100%] Built target coreneuron-copy-nrnivmodl-core-dependencies
kumbhar@r2i3n6:~/workarena/systems/bbpv/repos/bbp/coreneuron/build/coreneuron$ nvprof ../bin/special-core-gpu -e 1 -d ../../tests/integration/ring --gpu --model-stats
==115317== NVPROF is profiling process 115317, command: ../bin/special-core-gpu -e 1 -d ../../tests/integration/ring --gpu --model-stats
Info : 4 GPUs shared by 1 ranks per node
Duke, Yale, and the BlueBrain Project -- Copyright 1984-2020
Version : 1.0 4de7951f (2022-01-04 16:29:44 +0100)
Additional mechanisms from files
exp2syn.mod expsyn.mod halfgap.mod hh.mod netstim.mod passive.mod pattern.mod stim.mod svclmp.mod
Exp2Syn Reg
Memory (MBs) : After mk_mech : Max 297.8047, Min 297.8047, Avg 297.8047
Memory (MBs) : After MPI_Init : Max 296.2422, Min 296.2422, Avg 296.2422
Memory (MBs) : Before nrn_setup : Max 296.2734, Min 296.2734, Avg 296.2734
WARNING : GPU execution requires --cell-permute type 1 or 2. Setting it to 1.
....
Memory (MBs) : After nrn_finitialize : Max 301.6562, Min 301.6562, Avg 301.6562
psolve |=========================================================| t: 1.00 ETA: 0h00m01s
Solver Time : 0.253621
Simulation Statistics
Number of cells: 20
Number of compartments: 804
Number of presyns: 21
Number of input presyns: 0
Number of synapses: 21
Number of point processes: 41
Number of transfer sources: 0
Number of transfer targets: 0
Number of spikes: 40
Number of spikes with non negative gid-s: 40
==115317== Profiling application: ../bin/special-core-gpu -e 1 -d ../../tests/integration/ring --gpu --model-stats
==115317== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 11.69% 21.776ms 160 136.10us 127.74us 143.14us __omp_offloading_2f_188133fe__ZN10coreneuron11nrn_cur_ionEPNS_9NrnThreadEPNS_9Memb_listEi_l273
9.66% 17.998ms 80 224.98us 198.98us 236.25us __omp_offloading_2f_2979dc5__ZN10coreneuron12nrn_state_hhEPNS_9NrnThreadEPNS_9Memb_listEi_l520
8.98% 16.731ms 160 104.57us 99.968us 108.10us __omp_offloading_2f_12055abf__ZN10coreneuron22net_buf_receive_ExpSynEPNS_9NrnThreadE_l290
7.82% 14.565ms 80 182.06us 176.06us 187.84us __omp_offloading_2f_2979dc5__ZN10coreneuron10nrn_cur_hhEPNS_9NrnThreadEPNS_9Memb_listEi_l472
6.86% 12.780ms 80 159.75us 153.70us 163.58us __omp_offloading_2f_12055abf__ZN10coreneuron14nrn_cur_ExpSynEPNS_9NrnThreadEPNS_9Memb_listEi_l383
6.82% 12.697ms 80 158.71us 152.13us 163.20us __omp_offloading_2f_3962f8e__ZN10coreneuron11nrn_cur_pasEPNS_9NrnThreadEPNS_9Memb_listEi_l276
6.04% 11.251ms 80 140.64us 135.58us 144.13us __omp_offloading_2f_12055abf__ZN10coreneuron16nrn_state_ExpSynEPNS_9NrnThreadEPNS_9Memb_listEi_l426
6.04% 11.250ms 80 140.62us 134.53us 146.59us __omp_offloading_2f_191507f3__ZN10coreneuronL7nrn_rhsEPNS_9NrnThreadE_l96
6.03% 11.240ms 80 140.50us 135.33us 143.87us __omp_offloading_2f_198f6a66__ZN10coreneuron21nrn_jacob_capacitanceEPNS_9NrnThreadEPNS_9Memb_listEi_l74
6.01% 11.203ms 80 140.04us 133.09us 145.22us __omp_offloading_2f_191507f3__ZN10coreneuronL7nrn_lhsEPNS_9NrnThreadE_l166
5.95% 11.076ms 80 138.45us 133.34us 144.96us __omp_offloading_2f_191507c2__ZN10coreneuron8NetCvode12check_threshEPNS_9NrnThreadE_l541
5.92% 11.035ms 80 137.94us 132.42us 141.38us __omp_offloading_2f_198f6a66__ZN10coreneuron19nrn_cur_capacitanceEPNS_9NrnThreadEPNS_9Memb_listEi_l120
5.80% 10.795ms 80 134.94us 130.05us 139.90us __omp_offloading_2f_1d8e98f0__ZN10coreneuron6updateEPNS_9NrnThreadE_l217
5.77% 10.751ms 80 134.39us 129.34us 137.79us __omp_offloading_2f_191507f3__ZN10coreneuronL7nrn_rhsEPNS_9NrnThreadE_l37
0.32% 598.59us 6 99.765us 2.7200us 447.39us [CUDA memset]
0.17% 323.17us 246 1.3130us 1.2470us 1.8240us [CUDA memcpy HtoD]
0.11% 195.62us 126 1.5520us 1.3760us 3.0090us [CUDA memcpy DtoH]
API calls: 39.19% 222.11ms 1450 153.18us 811ns 3.8766ms cuStreamSynchronize
38.77% 219.72ms 1 219.72ms 219.72ms 219.72ms cuDevicePrimaryCtxRetain
9.24% 52.366ms 1 52.366ms 52.366ms 52.366ms cuModuleLoadDataEx
4.44% 25.182ms 1 25.182ms 25.182ms 25.182ms cuModuleUnload
3.59% 20.330ms 6 3.3883ms 9.0800us 20.277ms cudaMallocManaged
2.13% 12.063ms 84 143.60us 11.176us 157.59us cuMemcpyDtoHAsync
1.20% 6.7851ms 1280 5.3000us 4.5340us 19.189us cuLaunchKernel
0.29% 1.6442ms 407 4.0390us 133ns 187.16us cuDeviceGetAttribute
0.28% 1.5656ms 4 391.41us 387.70us 395.00us cuDeviceTotalMem
0.25% 1.4421ms 1619 890ns 159ns 579.67us cuCtxSetCurrent
0.19% 1.0798ms 246 4.3890us 3.7420us 17.665us cuMemcpyHtoDAsync
0.15% 853.90us 6 142.32us 27.310us 519.82us cudaMemset
0.11% 647.30us 42 15.411us 11.264us 49.087us cuMemcpyDtoH
0.07% 412.45us 32 12.889us 1.7620us 191.92us cuStreamCreate
0.03% 168.17us 4 42.041us 36.633us 56.245us cuDeviceGetName
0.02% 112.19us 32 3.5050us 2.1360us 21.674us cuStreamDestroy
0.02% 108.93us 373 292ns 138ns 4.2190us cuGetProcAddress
0.01% 44.237us 55 804ns 401ns 3.3740us cuModuleGetGlobal
0.01% 34.313us 42 816ns 354ns 6.8130us cuModuleGetFunction
0.00% 14.817us 1 14.817us 14.817us 14.817us cudaSetDevice
0.00% 13.742us 1 13.742us 13.742us 13.742us cuMemAlloc
0.00% 11.865us 4 2.9660us 937ns 7.8230us cuDeviceGetPCIBusId
0.00% 7.1150us 9 790ns 142ns 4.0720us cuDeviceGet
0.00% 2.6710us 1 2.6710us 2.6710us 2.6710us cuDevicePrimaryCtxGetState
0.00% 2.5190us 2 1.2590us 224ns 2.2950us cuCtxGetLimit
0.00% 2.2450us 5 449ns 231ns 744ns cuFuncGetAttribute
0.00% 1.2930us 4 323ns 180ns 528ns cuDeviceGetCount
0.00% 1.1640us 1 1.1640us 1.1640us 1.1640us cuInit
0.00% 984ns 4 246ns 162ns 333ns cuDeviceGetUuid
0.00% 674ns 1 674ns 674ns 674ns cuDevicePrimaryCtxSetFlags
0.00% 502ns 1 502ns 502ns 502ns cuCtxGetDevice
0.00% 359ns 1 359ns 359ns 359ns cuDevicePrimaryCtxRelease
0.00% 215ns 1 215ns 215ns 215ns cuDriverGetVersion
.. So using #734 I am able to avoid the TODO: Go back to nrnivmodl-core based build and see how this could be tested there. (tomorrow) |
Summary of the IBM XL CompilerIssues reported on llvm openmp-dev mailing list: https://lists.llvm.org/pipermail/openmp-dev/2022-January/004276.html
+ CXX=xlc++_r
+ CXXFLAGS='-fopenmp -fPIC -qsmp=omp -qoffload -g -O2'
+ xlc++_r -fopenmp -fPIC -qsmp=omp -qoffload -g -O2 -c test.cpp
+ ar cq libtest.a test.o
+ xlc++_r -fopenmp -fPIC -qsmp=omp -qoffload -g -O2 -o test1 main.cpp -L. -ltest
+ xlc++_r -fopenmp -fPIC -qsmp=omp -qoffload -g -O2 -o test2 main.cpp test.o
+ rm test.o
+ xlc++_r -fopenmp -fPIC -qsmp=omp -qoffload -g -O2 -fpic -shared test.cpp -o libtest.so
/cineca/prod/opt/compilers/xl/16.1.1_sp4.1/binary/xlC/16.1.1/bin/.orig/xlc++_r: warning: 1501-269 fpic is not supported on this Operating System platform. Option fpic will be ignored.
/cineca/prod/opt/compilers/xl/16.1.1_sp4.1/binary/xlC/16.1.1/bin/.orig/xlc++_r: warning: 1501-308 The device linker only supports static linking. Any device code placed into a shared library by the qmkshrobj option will be inaccessible.
/cineca/prod/opt/compilers/xl/16.1.1_sp4.1/binary/xlC/16.1.1/bin/.orig/minilink: warning: 1501-308 The device linker only supports static linking. Any device code placed into a shared library by the qmkshrobj option will be inaccessible.
+ xlc++_r -fopenmp -fPIC -qsmp=omp -qoffload -g -O2 -o test3 main.cpp -L. -ltest -Wl,-rpath .
nvlink error : Undefined reference to 'y' in '/tmp/24507_0.o'
...
...
$ nvprof ./test1
==29304== NVPROF is profiling process 29304, command: ./test1
--> 0
--> 1
--> 4
--> 2
--> 3
==29304== Profiling application: ./test1
==29304== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 79.20% 41.659us 1 41.659us 41.659us 41.659us __xl_main_l11_OL_1
11.99% 6.3040us 3 2.1010us 1.8560us 2.5280us [CUDA memcpy DtoH]
4.98% 2.6200us 1 2.6200us 2.6200us 2.6200us __xl__Z4testv_l8_OL_1
Historically famous one:
vs.
In #if 0
#pragma omp target update to(_displ[0:nrb->_displ_cnt])
#else
#pragma omp target update to(nrb->_displ[0:nrb->_displ_cnt])
#endif
#pragma omp target
{
printf("nrb->_displ :%p \n", nrb->_displ);
}
abort(); And this produces following in first and second case: Updating nrb now
nrb->_displ :0x7ffef000b100
Aborted (core dumped)
....
nrb->_displ :0x26a6b9c0
Aborted (core dumped) |
How to test this?
Outstanding issues?
Running for
0
msec gives:CI_BRANCHES:NMODL_BRANCH=hackathon_main,NEURON_BRANCH=master,