Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improving GROMACS performance #89

Open
ppxasjsm opened this issue May 25, 2019 · 12 comments
Open

Improving GROMACS performance #89

ppxasjsm opened this issue May 25, 2019 · 12 comments

Comments

@ppxasjsm
Copy link
Collaborator

Default settings for a 200 ps equilibration simulation with a system of 50k atoms takes more than 24 hours to complete on a typical GPU cluster with an optimised Gromacs installation for Binding free energy calculations.

@lohedges
Copy link
Member

lohedges commented Jun 3, 2019

Just to clarify, you mention binding free energy calculations, but are seeing poor performance solely for an equilibration?

Thoughts / questions below:

  • Could you post a Gromacs log file for the equlibration. Hardware detection information will be at the top, timing statistics at the bottom. Note that Gromacs tries to optimise certain options depending on what hardware it detects, which might end up being sub-optimal if it gets things wrong (see below).
  • Was this simulation run on your cluster while the other nodes were active? If so, could you try running it as the only job on the cluster. During the workshop week it was apparent that gmx was detecting the resources of the whole cluster, rather than the individual Jupyter server. As such, Gromacs processes tried to grab too many resources and ended up running far more slowly than expected. (This was apprarent when running top, which showed a CPU load in the 1000s of percent.)
  • If this is for an equlibration containing a perturbable molecule, could you possibly run an equilibration for a similar sized system that only contains regular molecules. (Perhaps using the same protein and one of the two ligands, solvated in the same size box.) I wonder if something funny is going on with dummy atoms, or the additional properties for lambda = 1 (which should be redundant).

I'll update this comment if I come up with any more ideas.

(I would try running on our cluster here, but it isn't optimised for GPU simulations since we can't enable certain kernel features that allow acceleration through overclocking. When I tried tweaking options and command-line parameters to improve the speed of the ethane-methanol simulations I saw no improvement regardless of what I tried.)

@ppxasjsm
Copy link
Collaborator Author

ppxasjsm commented Jun 3, 2019

The cluster is being used now, but I can see what I can post.

Equilibration run (attached as equib.tar.gz)

Command line:
  gmx mdrun -v -deffnm md -nb gpu -gpu_id 0 -nt 8

GROMACS version:    2019.1
Precision:          single
Memory model:       64 bit
MPI library:        thread_mpi
OpenMP support:     enabled (GMX_OPENMP_MAX_THREADS = 64)
GPU support:        CUDA
SIMD instructions:  AVX2_256
FFT library:        fftw-3.3.8-sse2-avx-avx2-avx2_128-avx512
RDTSCP usage:       enabled
TNG support:        enabled
Hwloc support:      disabled
Tracing support:    disabled
C compiler:         /usr/bin/cc GNU 7.3.0
C compiler flags:    -mavx2 -mfma     -O3 -DNDEBUG -funroll-all-loops -fexcess-precision=fast
C++ compiler:       /usr/bin/c++ GNU 7.3.0
C++ compiler flags:  -mavx2 -mfma    -std=c++11   -O3 -DNDEBUG -funroll-all-loops -fexcess-precision=fast
CUDA compiler:      /usr/local/cuda-9.2/bin/nvcc nvcc: NVIDIA (R) Cuda compiler driver;Copyright (c) 2005-2018 NVIDIA Corporation;Built on Tue_Jun_12_23:07:04_CDT_2018;Cuda compilation tools, release 9.2, V9.2.148
CUDA compiler flags:-gencode;arch=compute_30,code=sm_30;-gencode;arch=compute_35,code=sm_35;-gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_52,code=sm_52;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_70,code=compute_70;-use_fast_math;-D_FORCE_INLINES;; ;-mavx2;-mfma;-std=c++11;-O3;-DNDEBUG;-funroll-all-loops;-fexcess-precision=fast;
CUDA driver:        10.10
CUDA runtime:       9.20


Running on 1 node with total 16 cores, 32 logical cores, 2 compatible GPUs
Hardware detected:
  CPU info:
    Vendor: Intel
    Brand:  Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz
    Family: 6   Model: 79   Stepping: 1
    Features: aes apic avx avx2 clfsh cmov cx8 cx16 f16c fma hle htt intel lahf mmx msr nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdrnd rdtscp rtm sse2 sse3 sse4.1 sse4.2 ssse3 tdt x2apic
  Hardware topology: Basic
    Sockets, cores, and logical processors:
      Socket  0: [   0  16] [   1  17] [   2  18] [   3  19] [   4  20] [   5  21] [   6  22] [   7  23]
      Socket  1: [   8  24] [   9  25] [  10  26] [  11  27] [  12  28] [  13  29] [  14  30] [  15  31]
  GPU info:
    Number of GPUs detected: 2
    #0: NVIDIA GeForce GTX 1080 Ti, compute cap.: 6.1, ECC:  no, stat: compatible
    #1: NVIDIA GeForce GTX 1080 Ti, compute cap.: 6.1, ECC:  no, stat: compatible
[...]

        M E G A - F L O P S   A C C O U N T I N G

 NB=Group-cutoff nonbonded kernels    NxN=N-by-N cluster Verlet kernels
 RF=Reaction-Field  VdW=Van der Waals  QSTab=quadratic-spline table
 W3=SPC/TIP3p  W4=TIP4p (single or pairs)
 V&F=Potential and force  V=Potential only  F=Force only

 Computing:                               M-Number         M-Flops  % Flops
-----------------------------------------------------------------------------
 Pair Search distance check            5990.760160       53916.841     0.0
 NxN Ewald Elec. + LJ [F]           6007635.399744   396503936.383    97.9
 NxN Ewald Elec. + LJ [V&F]           60743.534464     6499558.188     1.6
 1,4 nonbonded interactions            1237.912379      111412.114     0.0
 Shift-X                                 51.029979         306.180     0.0
 Bonds                                  243.502435       14366.644     0.0
 Angles                                 859.408594      144380.644     0.0
 Propers                               1508.515085      345449.954     0.1
 Impropers                               98.200982       20425.804     0.0
 Virial                                  51.075024         919.350     0.0
 Update                                5097.950979      158036.480     0.0
 Stop-CM                                 51.080958         510.810     0.0
 Calc-Ekin                              102.059958        2755.619     0.0
 Lincs                                  465.409308       27924.558     0.0
 Lincs-Mat                             2390.447808        9561.791     0.0
 Constraint-V                         10186.552796       81492.422     0.0
 Constraint-Vir                          48.653605        1167.687     0.0
 Settle                                3085.261704      996539.530     0.2
-----------------------------------------------------------------------------
 Total                                               404972661.001   100.0
-----------------------------------------------------------------------------


     R E A L   C Y C L E   A N D   T I M E   A C C O U N T I N G

On 1 MPI rank, each using 8 OpenMP threads

 Computing:          Num   Num      Call    Wall time         Giga-Cycles
                     Ranks Threads  Count      (s)         total sum    %
-----------------------------------------------------------------------------
 Neighbor search        1    8       1001       4.651         78.142   2.7
 Launch GPU ops.        1    8     200002       9.354        157.143   5.5
 Force                  1    8     100001      14.120        237.211   8.3
 Wait PME GPU gather    1    8     100001      20.025        336.417  11.7
 Reduce GPU PME F       1    8     100001       2.262         38.009   1.3
 Wait GPU NB local      1    8     100001      39.169        658.038  22.9
 NB X/F buffer ops.     1    8     199001      18.270        306.940  10.7
 Write traj.            1    8        201       1.438         24.166   0.8
 Update                 1    8     200002      33.657        565.435  19.7
 Constraints            1    8     200004      21.894        367.823  12.8
 Rest                                           6.048        101.599   3.5
-----------------------------------------------------------------------------
 Total                                        170.888       2870.923 100.0
-----------------------------------------------------------------------------

               Core t (s)   Wall t (s)        (%)
       Time:     1367.041      170.888      800.0
                 (ns/day)    (hour/ns)
Performance:       50.560        0.475
Finished mdrun on rank 0 Sat May 25 23:22:20 2019

lambda=0.00 run same system, but running a perturbed molecule:

GROMACS:      gmx mdrun, version 2019.1
Executable:   /export/users/common/Gromacs19.1/bin/gmx
Data prefix:  /export/users/common/Gromacs19.1
Working dir:  /export/users/ppxasjsm/Projects/Tyk2/BSS/GROMACS/6340/TYK2_17_8/bound/lambda_0.0000
Process ID:   12141
Command line:
  gmx mdrun -v -deffnm gromacs -nb gpu -gpu_id 0 -nt 8

GROMACS version:    2019.1
Precision:          single
Memory model:       64 bit
MPI library:        thread_mpi
OpenMP support:     enabled (GMX_OPENMP_MAX_THREADS = 64)
GPU support:        CUDA
SIMD instructions:  AVX2_256
FFT library:        fftw-3.3.8-sse2-avx-avx2-avx2_128-avx512
RDTSCP usage:       enabled
TNG support:        enabled
Hwloc support:      disabled
Tracing support:    disabled
C compiler:         /usr/bin/cc GNU 7.3.0
C compiler flags:    -mavx2 -mfma     -O3 -DNDEBUG -funroll-all-loops -fexcess-precision=fast
C++ compiler:       /usr/bin/c++ GNU 7.3.0
C++ compiler flags:  -mavx2 -mfma    -std=c++11   -O3 -DNDEBUG -funroll-all-loops -fexcess-precision=fast
CUDA compiler:      /usr/local/cuda-9.2/bin/nvcc nvcc: NVIDIA (R) Cuda compiler driver;Copyright (c) 2005-2018 NVIDIA Corporation;Built on Tue_Jun_12_23:07:04_CDT_2018;Cuda compilation tools, release 9.2, V9.2.148
CUDA compiler flags:-gencode;arch=compute_30,code=sm_30;-gencode;arch=compute_35,code=sm_35;-gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_52,code=sm_52;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_70,code=compute_70;-use_fast_math;-D_FORCE_INLINES;; ;-mavx2;-mfma;-std=c++11;-O3;-DNDEBUG;-funroll-all-loops;-fexcess-precision=fast;
CUDA driver:        10.10
CUDA runtime:       9.20


Running on 1 node with total 16 cores, 32 logical cores, 2 compatible GPUs
Hardware detected:
  CPU info:
    Vendor: Intel
    Brand:  Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz
    Family: 6   Model: 79   Stepping: 1
    Features: aes apic avx avx2 clfsh cmov cx8 cx16 f16c fma hle htt intel lahf mmx msr nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdrnd rdtscp rtm sse2 sse3 sse4.1 sse4.2 ssse3 tdt x2apic
  Hardware topology: Basic
    Sockets, cores, and logical processors:
      Socket  0: [   0  16] [   1  17] [   2  18] [   3  19] [   4  20] [   5  21] [   6  22] [   7  23]
      Socket  1: [   8  24] [   9  25] [  10  26] [  11  27] [  12  28] [  13  29] [  14  30] [  15  31]
  GPU info:
    Number of GPUs detected: 2
    #0: NVIDIA GeForce GTX 1080 Ti, compute cap.: 6.1, ECC:  no, stat: compatible
    #1: NVIDIA GeForce GTX 1080 Ti, compute cap.: 6.1, ECC:  no, stat: compatible




     R E A L   C Y C L E   A N D   T I M E   A C C O U N T I N G

On 1 MPI rank, each using 8 OpenMP threads

 Computing:          Num   Num      Call    Wall time         Giga-Cycles
                     Ranks Threads  Count      (s)         total sum    %
-----------------------------------------------------------------------------
 Neighbor search        1    8      30001     342.311       5750.826   1.7
 Launch GPU ops.        1    8    3000001     288.609       4848.619   1.4
 Force                  1    8    3000001    7842.374     131751.752  38.9
 PME mesh               1    8    3000001    9103.332     152935.835  45.2
 Wait Bonded GPU        1    8      30001       0.149          2.505   0.0
 Wait GPU NB local      1    8    3000001      51.893        871.810   0.3
 NB X/F buffer ops.     1    8    5970001     484.335       8136.822   2.4
 Write traj.            1    8       6018      39.080        656.539   0.2
 Update                 1    8    6000002    1231.074      20682.030   6.1
 Constraints            1    8    6000004     650.709      10931.903   3.2
 Rest                                         123.452       2073.993   0.6
-----------------------------------------------------------------------------
 Total                                      20157.319     338642.633 100.0
-----------------------------------------------------------------------------
 Breakdown of PME mesh computation
-----------------------------------------------------------------------------
 PME spread             1    8    6000002    3330.029      55944.432  16.5
 PME gather             1    8    6000002    1925.679      32351.369   9.6
 PME 3D-FFT             1    8   12000004    3380.822      56797.759  16.8
 PME solve Elec         1    8    6000002     449.137       7545.490   2.2
-----------------------------------------------------------------------------

               Core t (s)   Wall t (s)        (%)
       Time:   161258.471    20157.319      800.0
                         5h35:57
                 (ns/day)    (hour/ns)
Performance:       12.859        1.866
Finished mdrun on rank 0 Sun May 26 05:20:04 2019

Slurm submission file:

#!/bin/bash -login
#SBATCH --nodes=1
#SBATCH --cpus-per-task=8
#SBATCH --job-name=NAME
#SBATCH --output=NAME.out
#SBATCH --time=7-00:00:00
#SBATCH -p gpu
#SBATCH --gres=gpu:2
#SBATCH --gres-flags=enforce-binding

# Disable Sire analytics.
export SIRE_DONT_PHONEHOME=1
export SIRE_SILENT_PHONEHOME=1

# Make sure nvcc is in the path.
export PATH=/usr/local/cuda-9.2/bin:$PATH

# Set path to local AmberTools installation.
#export AMBERHOME=/mnt/shared/software/amber18

# Source the GROMACS shell rc, making sure mount point exists.
#while [ ! -f /mnt/shared/software/gromacs/bin/GMXRC ]; do
#    sleep 1s
#done
#source /mnt/shared/software/gromacs/bin/GMXRC

# Set the OpenMM plugin directory.
export OPENMM_PLUGIN_DIR=/export/users/ppxasjsm/miniconda3/lib/plugins

# Make a unique directory for this job and move to it.
mkdir $SLURM_SUBMIT_DIR/$SLURM_JOB_ID
cd $SLURM_SUBMIT_DIR/$SLURM_JOB_ID

export JOB_DIR=$SLURM_SUBMIT_DIR

# Run the script using the BioSimSpace python interpreter.

# Make sure GPU ID 0 is first.

# Forwards.
time /export/users/ppxasjsm/miniconda3/bin/sire_python --ppn=8 $JOB_DIR/binding_freenrg_gmx.py LIG0 LIG1 0 &

# Make sure GPU ID 1 is first.
export CUDA_VISIBLE_DEVICES=1,0

# Backwards.
time /export/users/ppxasjsm/miniconda3/bin/sire_python --ppn=8 $JOB_DIR/binding_freenrg_gmx.py LIG1 LIG0 1
wait
~          

@ppxasjsm
Copy link
Collaborator Author

ppxasjsm commented Jun 3, 2019

Finished there slightly too early.

If I run the same script, but do not update the gromacs commandline arguments, i.e.

gmx mdrun -v -deffnm

and not add:

d = OrderedDict([('mdrun', True), ('-v', True), ('-deffnm', 'gromacs'), ('-nb', 'gpu'), ('-gpu_id', num3), ('-nt', 8)])
freenrg._update_run_args(d)

I get the following performance:

Command line:
  gmx mdrun -v -deffnm md


Back Off! I just backed up md.log to ./#md.log.2#
Reading file md.tpr, VERSION 2019.1 (single precision)
Changing nstlist from 10 to 100, rlist from 1.2 to 1.294

Using 8 MPI threads
Using 4 OpenMP threads per tMPI thread

On host node01 4 GPUs selected for this run.
Mapping of GPU IDs to the 8 GPU tasks in the 8 ranks on this node:
  PP:0,PP:0,PP:1,PP:1,PP:2,PP:2,PP:3,PP:3
PP tasks will do (non-perturbed) short-ranged and most bonded interactions on the GPU

Back Off! I just backed up md.trr to ./#md.trr.2#

Back Off! I just backed up md.edr to ./#md.edr.2#

NOTE: DLB will not turn on during the first phase of PME tuning
starting mdrun 'BioSimSpace System'
100000 steps,    100.0 ps.
step  200: timed with pme grid 72 72 72, coulomb cutoff 1.200: 19205.2 M-cycles
step  400: timed with pme grid 60 60 60, coulomb cutoff 1.333: 21291.8 M-cycles
^C

Received the INT signal, stopping within 200 steps

step  600: timed with pme grid 52 52 52, coulomb cutoff 1.538: 20658.8 M-cycles


Dynamic load balancing report:
 DLB was locked at the end of the run due to unfinished PP-PME balancing.
 Average load imbalance: 3.1%.
 The balanceable part of the MD step is 42%, load imbalance is computed from this.
 Part of the total run time spent waiting due to load imbalance: 1.3%.


               Core t (s)   Wall t (s)        (%)
       Time:     2308.998       72.164     3199.6
                 (ns/day)    (hour/ns)
Performance:        0.839       28.596

GROMACS reminds you: "A Pretty Village Burning Makes a Pretty Fire" (David Sandstrom)

equib.tar.gz

@lohedges
Copy link
Member

lohedges commented Jun 3, 2019

Interesting, I tried explicitly adding -nb gpu on our cluster but it made no difference. (Perhaps it won't for a small system.) When I looked at this page it seems that there are a bunch of things that you can set to auto, cpu, or gpu, such as nb. By default, it is set to auto and should use a compatible GPU if found. It seems stupid that you get better performance by setting this explicitly, since the log clearly states that it has found a compatible GPU, so should use it for the non-bonded calculation! Do you know if you get even better performance by enabling the gpu option for other calculations, such as bonded and pme, or is the non-bonded calculation the real bottleneck?

Looking at the output above...

PP tasks will do (non-perturbed) short-ranged and most bonded interactions on the GPU

... it looks like it has only chosen to do short-ranged and bonded interactions on the GPU.

I notice that you also explicitly set the gpu_id. Is this needed to get good performance, or does GROMACS not autodetect things correctly? I've not done this myself and it still has found the correct GPU that is available on the node. (Perhaps you were doing this for the forward and reverse simulations.)

I also notice that you set the number of threads with -nt 8. When you don't do this, it looks like GROMACS still sets things correctly:

Using 8 MPI threads
Using 4 OpenMP threads per tMPI thread

I'm a little confused by the second output above (with the unmodified arguments) since it seems to be missing some info at the start of the log, e.g. the GROMACS version info and detected hardware.

@lohedges
Copy link
Member

lohedges commented Jun 5, 2019

For reference, here are the relevant sections from a GROMACS log for one of the free legs of an ethane-methanol perturbation on BlueCrystal 4:

GROMACS:      gmx mdrun, version 2018
Executable:   /mnt/storage/software/apps/GROMACS-2018-MPI-GPU-Intel-2017/bin/gmx_mpi
Data prefix:  /mnt/storage/software/apps/GROMACS-2018-MPI-GPU-Intel-2017
Working dir:  /mnt/storage/scratch/lh17146/solvation_freenrgy/ethane_methanol/free/lambda_0.0000
Command line:
  gmx_mpi mdrun -v -deffnm gromacs

GROMACS version:    2018
Precision:          single
Memory model:       64 bit
MPI library:        MPI
OpenMP support:     enabled (GMX_OPENMP_MAX_THREADS = 64)
GPU support:        CUDA
SIMD instructions:  AVX2_256
FFT library:        fftw-3.3.5-fma-sse2-avx-avx2-avx2_128-avx512
RDTSCP usage:       enabled
TNG support:        enabled
Hwloc support:      disabled
Tracing support:    disabled
Built on:           2018-10-02 11:51:51
Built by:           [email protected] [CMAKE]
Build OS/arch:      Linux 3.10.0-514.10.2.el7.x86_64 x86_64
Build CPU vendor:   Intel
Build CPU brand:    Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz
Build CPU family:   6   Model: 79   Stepping: 1
Build CPU features: aes apic avx avx2 clfsh cmov cx8 cx16 f16c fma hle htt intel lahf mmx msr nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdrnd rdtscp rtm sse2 sse3 sse4.1 sse4.2 ssse3 tdt x2apic
C compiler:         /mnt/storage/apps/intel/impi/2017.1.132/bin64/mpiicc Intel 17.0.1.20161005
C compiler flags:    -march=core-avx2   -O3 -xHost -ip -no-prec-div -static-intel -std=gnu99  -O3 -DNDEBUG -ip -funroll-all-loops -alias-const -ansi-alias -no-prec-div -fimf-domain-exclusion=14 -qoverride-limits  
C++ compiler:       /mnt/storage/apps/intel/impi/2017.1.132/bin64/mpiicpc Intel 17.0.1.20161005
C++ compiler flags:  -march=core-avx2   -O3 -xHost -ip -no-prec-div -static-intel -std=c++11   -O3 -DNDEBUG -ip -funroll-all-loops -alias-const -ansi-alias -no-prec-div -fimf-domain-exclusion=14 -qoverride-limits  
CUDA compiler:      /mnt/storage/software/libraries/nvidia/cuda-9.0/bin/nvcc nvcc: NVIDIA (R) Cuda compiler driver;Copyright (c) 2005-2017 NVIDIA Corporation;Built on Fri_Sep__1_21:08:03_CDT_2017;Cuda compilation tools, release 9.0, V9.0.176
CUDA compiler flags:-gencode;arch=compute_30,code=sm_30;-gencode;arch=compute_35,code=sm_35;-gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_52,code=sm_52;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_70,code=compute_70;-use_fast_math;;; ;-march=core-avx2;-O3;-xHost;-ip;-no-prec-div;-static-intel;-std=c++11;-O3;-DNDEBUG;-ip;-funroll-all-loops;-alias-const;-ansi-alias;-no-prec-div;-fimf-domain-exclusion=14;-qoverride-limits;
CUDA driver:        9.10
CUDA runtime:       9.0


Running on 1 node with total 28 cores, 28 logical cores, 1 compatible GPU
Hardware detected on host gpu22.bc4.acrc.priv (the node of MPI rank 0):
  CPU info:
    Vendor: Intel
    Brand:  Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz
    Family: 6   Model: 79   Stepping: 1
    Features: aes apic avx avx2 clfsh cmov cx8 cx16 f16c fma hle htt intel lahf mmx msr nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdrnd rdtscp rtm sse2 sse3 sse4.1 sse4.2 ssse3 tdt x2apic
  Hardware topology: Basic
    Sockets, cores, and logical processors:
      Socket  0: [   0] [   1] [   2] [   3] [   4] [   5] [   6] [   7] [   8] [   9] [  10] [  11] [  12] [  13]
      Socket  1: [  14] [  15] [  16] [  17] [  18] [  19] [  20] [  21] [  22] [  23] [  24] [  25] [  26] [  27]
  GPU info:
    Number of GPUs detected: 1
    #0: NVIDIA Tesla P100-PCIE-16GB, compute cap.: 6.0, ECC: yes, stat: compatible

...

Using 1 MPI process
Using 28 OpenMP threads 

1 GPU auto-selected for this run.
Mapping of GPU IDs to the 1 GPU task in the 1 rank on this node:
  PP:0

NOTE: GROMACS was configured without NVML support hence it can not exploit
      application clocks of the detected Tesla P100-PCIE-16GB GPU to improve performance.
      Recompile with the NVML library (compatible with the driver used) or set application clocks manually.

...

       P P   -   P M E   L O A D   B A L A N C I N G

 PP/PME load balancing changed the cut-off and PME settings:
           particle-particle                    PME
            rcoulomb  rlist            grid      spacing   1/beta
   initial  1.200 nm  1.205 nm      25  25  25   0.120 nm  0.384 nm
   final    1.200 nm  1.205 nm      25  25  25   0.120 nm  0.384 nm
 cost-ratio           1.00             1.00
 (note that these numbers concern only part of the total PP and PME load)


	M E G A - F L O P S   A C C O U N T I N G

 NB=Group-cutoff nonbonded kernels    NxN=N-by-N cluster Verlet kernels
 RF=Reaction-Field  VdW=Van der Waals  QSTab=quadratic-spline table
 W3=SPC/TIP3p  W4=TIP4p (single or pairs)
 V&F=Potential and force  V=Potential only  F=Force only

 Computing:                               M-Number         M-Flops  % Flops
-----------------------------------------------------------------------------
 NB Free energy kernel               751541.952726      751541.953     0.6
 Pair Search distance check            1664.380960       14979.429     0.0
 NxN Ewald Elec. + LJ [F]           1763797.692096   116410647.678    93.5
 NxN Ewald Elec. + LJ [V&F]           17819.671104     1906704.808     1.5
 1,4 nonbonded interactions               5.040117         453.611     0.0
 Calc Weights                          3967.507935      142830.286     0.1
 Spread Q Bspline                    169280.338560      338560.677     0.3
 Gather F Bspline                    169280.338560     1015682.031     0.8
 3D-FFT                              435338.844320     3482710.755     2.8
 Solve PME                              624.981650       39998.826     0.0
 Shift-X                                 13.227645          79.366     0.0
 Bonds                                    0.560013          33.041     0.0
 Angles                                   6.420096        1078.576     0.0
 Propers                                  4.680045        1071.730     0.0
 Virial                                 134.502690        2421.048     0.0
 Update                                1322.502645       40997.582     0.0
 Stop-CM                                 13.230290         132.303     0.0
 P-Coupling                             132.252645         793.516     0.0
 Calc-Ekin                              264.505290        7141.643     0.0
 Lincs                                    6.000024         360.001     0.0
 Lincs-Mat                               72.000288         288.001     0.0
 Constraint-V                          2649.007947       21192.064     0.0
 Constraint-Vir                         132.152643        3171.663     0.0
 Settle                                 879.003516      283918.136     0.2
-----------------------------------------------------------------------------
 Total                                               124466788.723   100.0
-----------------------------------------------------------------------------


     R E A L   C Y C L E   A N D   T I M E   A C C O U N T I N G

On 1 MPI rank, each using 28 OpenMP threads

 Computing:          Num   Num      Call    Wall time         Giga-Cycles
                     Ranks Threads  Count      (s)         total sum    %
-----------------------------------------------------------------------------
 Neighbor search        1   28       5001       9.126        613.290   0.7
 Launch GPU ops.        1   28     500001      25.700       1727.034   2.0
 Force                  1   28     500001     855.686      57502.194  67.6
 PME mesh               1   28     500001     208.135      13986.672  16.4
 Wait GPU NB local      1   28     500001       4.359        292.954   0.3
 NB X/F buffer ops.     1   28     995001      99.815       6707.581   7.9
 Write traj.            1   28       1002       1.018         68.387   0.1
 Update                 1   28    1000002      21.893       1471.195   1.7
 Constraints            1   28    1000002      25.935       1742.839   2.0
 Rest                                          14.493        973.919   1.1
-----------------------------------------------------------------------------
 Total                                       1266.160      85086.066 100.0
-----------------------------------------------------------------------------
 Breakdown of PME mesh computation
-----------------------------------------------------------------------------
 PME spread             1   28    1000002      77.285       5193.538   6.1
 PME gather             1   28    1000002      52.921       3556.271   4.2
 PME 3D-FFT             1   28    2000004      70.670       4749.053   5.6
 PME solve Elec         1   28    1000002       3.495        234.855   0.3
-----------------------------------------------------------------------------

               Core t (s)   Wall t (s)        (%)
       Time:    35452.480     1266.160     2800.0
                 (ns/day)    (hour/ns)
Performance:       68.238        0.352
Finished mdrun on rank 0 Wed Mar 13 09:43:37 2019

As you can see, GROMACS correctly detected the CPUs and GPU on the node without needing additional command-line arguments. The only concern that it raises is the lack of NVML support, but this can't be used on our cluster anyway. For this system, I see no improvement in performance if I set -nb gpu.

@lohedges
Copy link
Member

lohedges commented Jun 5, 2019

I also noticed that our GROMACS version isn't compiled to use thread_mpi as its MPI library. According to this, I shouldn't expect to get as good single node performance as you.

@ppxasjsm
Copy link
Collaborator Author

ppxasjsm commented Jun 6, 2019

So I see better performance if I only use one GPU rather than 4. I am not sure setting the gpu_id is necessary I was just playing around with that option when I hadn't figured out to restrict the visible GPUs with a slurm script.

You get 68 ns/day on Blue crystal for non perturbed equilibrations?

@lohedges
Copy link
Member

lohedges commented Jun 6, 2019

No, my results are for the actual lambda = 0 stage of the free leg, so it's not a direct comparison. I was just showing how GROMACS auto hardware detection seemed to work for me. I've not got data for the equilibration part, since it was run in a temporary working directory.

I'd be happy to test performance here if you give me your input files, although BC4 seems to be totally unresponsive today.

@lohedges lohedges changed the title Gromacs default settings are very poor Improving GROMACS performance Jun 10, 2019
@lohedges
Copy link
Member

Changed to a more appropriate issue title. From poking around it seems like this isn't a specific BioSimSpace problem. We should use this thread to debug and document reliable ways of getting good Gromacs performance.

@JenkeScheen
Copy link
Collaborator

for a bound FEP simulation with gromacs/20.4 (TYK2 in a 20nm waterbox) I was seeing similar behaviour on our cluster (presumably the same as @ppxasjsm is referring to). What helped in my case was using MPIRUN with a single copy, i.e.
mpirun -np 1 gmx mdrun -v -deffnm gromacs 1> gromacs.log 2> gromacs.err
while also supplying the slurm job with a single GPU. Any other configuration ended up oversubscribing CPU on our nodes. Finished simulation output:

step 2000000, remaining wall clock time:   0 s      
        Core t (s)  Wall t (s)    (%)
    Time:  587682.841  18365.100   3200.0
                 5h06:05
              (ns/day)  (hour/ns)
Performance:    18.818    1.275

@kexul
Copy link
Contributor

kexul commented Apr 21, 2022

Update Gromacs to the latest version might help, from its release note:

Free-energy kernels are accelerated using SIMD, which make free-energy calculations up to three times as fast when using GPUs.

My output of a system with 54016 atoms on A100 GPU:

Equilibration:

                      Core t (s)   Wall t (s)        (%)
       Time:     5329.276      231.708     2300.0
                 (ns/day)    (hour/ns)
Performance:       74.577        0.322


Production (free energy=yes):

               Core t (s)   Wall t (s)        (%)
       Time:   200244.094     8706.265     2300.0
                         2h25:06
                 (ns/day)    (hour/ns)
Performance:       39.696        0.605

@lohedges
Copy link
Member

Thanks for reporting, that's good to know. Since we don't bundle a version of GROMACS it's hard to provide settings that are optimised for any version (and hardware environment). I'm glad the free energy kernels are improving, though.

annamherz pushed a commit that referenced this issue Sep 18, 2023
annamherz pushed a commit that referenced this issue Sep 18, 2023
annamherz pushed a commit that referenced this issue Sep 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants