Added hybrid MPI+OpenMP test in CI #299

iomaganaris · 2020-04-29T17:43:41Z

Added nrntraub test and run it with 4 ranks and 9 threads on BB5 with the SoA CoreNEURON build
Uses https://github.com/iomaganaris/nrntraub/tree/icei which creates the coredat by default in the NEURON run
Closes Enable multi-threading tests (MPI+OpenMP) under CI #292

@pramodk I didn't manage to run NEURON with the threading option still with nrntraub. If you think that this should be tested maybe we can have a look together at some point.
Also, let me know if I should create a PR for my fork of nrntraub

pramodk · 2020-04-30T05:46:40Z

I think we should do that. Can you put here instructions with error message and tag Michale here?

iomaganaris · 2020-04-30T08:06:24Z

Hello @nrnhines
We were trying to run the nrntraub test from https://github.com/pramodk/nrntraub/tree/icei with threading enabled in NEURON to launch CoreNEURON from NEURON and test OpenMP.
After cloning the repo I did the following:

nrnivmodl mod
srun -n 1 ./x86_64/special -c nthread=9 -mpi -c mytstop=100 -c use_coreneuron=0 init.hoc

Note that I am using 1 rank because pc.nthread gets set only if pc.nhost == 1 and I am setting use_coreneuron=0 for debugging in this case. With use_coreneuron=1 there is the same issue.
And I get the following error:

...
SetupTime: 4.8000002
mytstop  100
/gpfs/bbp.cscs.ch/project/proj16/magkanar/spack/software/install/linux-rhel7-x86_64/intel-19.0.4.243/neuron-develop-3csnze/x86_64/bin/nrniv: usable mindelay is 0 (or less than dt for fixed step method)
 in init.hoc near line 65
 prun()
       ^
        finitialize(-70)
      init()
    stdinit()
  prun()

I figured out that the issue comes from calling stdinit() from prun() in hoc/parlib.hoc.
I am using NEURON master and Intel compiler.
Could you help us with this issue?
Thank you very much in advance!

nrnhines · 2020-04-30T12:38:42Z

If you are using threads you cannot have any NetCon.delay = 0. (or less than dt). Of the 109982 NetCon, 265 of them have a delay of 0. Just to see if that is the problem try again with

diff --git a/hoc/parlib2.hoc b/hoc/parlib2.hoc
index d9eb164..1fbdee3 100755
--- a/hoc/parlib2.hoc
+++ b/hoc/parlib2.hoc
@@ -50,7 +50,7 @@ proc par_netstim_create() {local gid  localobj cell, syn, nc, ns, r
                netstims.append(ns)
                nc = new NetCon(ns.pp, syn)
                netstim_netcons.append(nc)
-               nc.delay = 0
+               nc.delay = 1
                r = new Random()
                r.negexp(1)
 //             r.Isaac64(netstim_random_seedoffset + netstim_base_)

For mpi and nthread=1 i is generally ok to have NetCon.delay=0 but only if they are not interprocessor NetCon (ie. source and target must be on same process).

nrnhines · 2020-04-30T12:43:03Z

By the way, I noticed another problem when launching python from within the nrntraub repository.

hines@hines-T7500:~/models/nrntraub-icei$ python
Python 3.7.6 (default, Feb 17 2020, 15:09:28) 
[GCC 7.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import neuron
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/hines/neuron/nrncmake/build/install/lib/python/neuron/__init__.py", line 132, in <module>
    import nrn
ModuleNotFoundError: No module named 'nrn'
>>>

This seems to be an artifact of having a 'hoc' folder in the repository.

iomaganaris · 2020-05-19T15:21:47Z

I got some time to work again on this test. Thank you very much for your suggestion @nrnhines to set nc.delay = 1. NEURON and CoreNEURON with threading worked with this.
I get however the following issues with threading enabled.
First, NEURON generates different spikes when the simulation runs with more that one thread and more than one mpi rank than when running the simulation with 1 mpi rank and multiple threads or multiple mpi ranks and no threading.
For example:

bash-4.2$ srun -n 1 ./x86_64/special -mpi -c use_coreneuron=0 -c nthread=36 -c mytstop=100 init.hoc
bash-4.2$ srun -n 4 ./x86_64/special -mpi -c use_coreneuron=0 -c nthread=9 -c mytstop=100 init.hoc
bash-4.2$ sort -n -k'1,1' -k2 < out1.dat | awk 'NR==1 { print; next } { printf "%.3f\t%d\n", $1, $2 }' > out1.sorted
bash-4.2$ sort -n -k'1,1' -k2 < out4.dat | awk 'NR==1 { print; next } { printf "%.3f\t%d\n", $1, $2 }' > out4.sorted
bash-4.2$ sdiff -s out1.sorted out4.sorted
10.375  186                                                   <
                                                              > 10.400  186
                                                              > 11.125  199
11.150  199                                                   <
                                                              > 12.950  220
12.975  220                                                   <
                                                              > 13.000  188
13.025  188                                                   <
13.025  264                                                   | 13.050  264
                                                              > 13.525  102
13.550  102                                                   <
13.675  288                                                   <
                                                              > 13.700  288
                                                              > 13.925  323
13.950  323                                                   <
14.275  312                                                   <
                                                              > 14.300  312
                                                              > 14.300  318
14.325  318                                                   | 14.350  87
14.350  192                                                   <
14.375  87                                                    <
...

During the first timesteps the spikes are the same but then there are these differences in the timesteps that the spikes are generated. In most cases the generated spikes differ by 1 timestep. Running NEURON with 36 MPI ranks and 1 thread generates the same spikes with 1 MPI rank and 36 threads.
The other issue is with the spikes generated by CoreNEURON. In all of the above cases CoreNEURON generates the same spikes with NEURON in the beginning but then after a timestep spikes start to shift in time. For example:

bash-4.2$ srun -n 4 ./x86_64/special -mpi -c use_coreneuron=1 -c nthread=9 -c mytstop=100 init.hoc
bash-4.2$ sort -n -k'1,1' -k2 < out.dat | awk 'NR==1 { print; next } { printf "%.3f\t%d\n", $1, $2 }' > out4.cn.sorted
bash-4.2$ sdiff -s out4.sorted out4.cn.sorted
bash-4.2$ sdiff -s out4.sorted out4.cn.sorted | more
                                                              > 5.900   160
                                                              > 6.050   176
                                                              > 6.050   180
6.750   160                                                   <
6.750   176                                                   <
6.825   180                                                   <
6.925   188                                                   <
                                                              > 6.950   188
6.975   168                                                   <
                                                              > 7.000   168
                                                              > 7.375   287
7.400   287                                                   <
                                                              > 7.550   290
7.575   290                                                   <
...

I am using my fork of nrntraub and the branch icei from here which includes the change in the delay and allows the selection of the number of threads when more than 1 MPI ranks are used.
Are the issues mentioned before related to the thread implementation or there is something going on with the test?
Any help would be greatly appreciated.

Thank you very much,
Ioannis

tests/jenkins/Jenkinsfile

pramodk · 2020-08-16T18:04:40Z

@nrnhines : Similar to olfactory bulb model, do you think the above described issue might be with the model itself? In that case I will go ahead and use whatever baseline model provide with X mpi ranks and Y threads per mpi thread.

nrnhines · 2020-08-16T18:10:17Z

Discrepancies between NEURON and CoreNEURON in this situation are presumptively bugs. I assume there is no intra-NEURON or intra CoreNEURON differences on this time scale with different nhost and nthread.

iomaganaris added 9 commits April 23, 2020 18:26

Added nrntraub test

f382ce4

Fixed checking out nrntraub

51c79c5

Fixed nrnivmodl stage of nrntraub

a7098b6

Avoid doing an exclusive allocation

f9143ce

Run nrntraub with corenrn SoA and openmp and AoS pure MPI

e34f520

Added --voltage 1000 in corenrn to get same results

86343c7

Fix SoA nrnrtraub run

e591d01

Run neuron test only once

77f6357

Set proper number of threads and ranks to run SoA and AoS builds

8cedc1c

iomaganaris requested a review from pramodk April 29, 2020 17:43

pramodk reviewed May 19, 2020

View reviewed changes

tests/jenkins/Jenkinsfile Outdated Show resolved Hide resolved

iomaganaris added 2 commits May 20, 2020 09:55

Changed nrntraub repo and coreneuron command for nrntraub

cd8a0fa

Fixed execution with threads and added CORENEURONLIB

2c25e34

iomaganaris mentioned this pull request Jun 29, 2021

Different results with and without OpenMP threads #154

Open

olupton added the frontiers-paper-2021 label Nov 3, 2021

olupton removed the frontiers-paper-2021 label Feb 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Added hybrid MPI+OpenMP test in CI #299

Added hybrid MPI+OpenMP test in CI #299

Uh oh!

iomaganaris commented Apr 29, 2020

Uh oh!

pramodk commented Apr 30, 2020

Uh oh!

iomaganaris commented Apr 30, 2020

Uh oh!

nrnhines commented Apr 30, 2020

Uh oh!

nrnhines commented Apr 30, 2020

Uh oh!

iomaganaris commented May 19, 2020

Uh oh!

Uh oh!

pramodk commented Aug 16, 2020

Uh oh!

nrnhines commented Aug 16, 2020

Uh oh!

Uh oh!

Added hybrid MPI+OpenMP test in CI #299

Are you sure you want to change the base?

Added hybrid MPI+OpenMP test in CI #299

Uh oh!

Conversation

iomaganaris commented Apr 29, 2020

Uh oh!

pramodk commented Apr 30, 2020

Uh oh!

iomaganaris commented Apr 30, 2020

Uh oh!

nrnhines commented Apr 30, 2020

Uh oh!

nrnhines commented Apr 30, 2020

Uh oh!

iomaganaris commented May 19, 2020

Uh oh!

Uh oh!

pramodk commented Aug 16, 2020

Uh oh!

nrnhines commented Aug 16, 2020

Uh oh!

Uh oh!