Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

compiling and running ASPECT on TACC Stampede3 #5569

Closed
ryanstoner1 opened this issue Feb 7, 2024 · 17 comments
Closed

compiling and running ASPECT on TACC Stampede3 #5569

ryanstoner1 opened this issue Feb 7, 2024 · 17 comments

Comments

@ryanstoner1
Copy link
Contributor

Compiling and running ASPECT on TACC Stampede3 diverges from Stampede2 because gcc/9.1.0 is no longer available (see the excellent and helpful previous wiki). I've successfully compiled ASPECT on Stampede3, but I run into a segfault when trying to run ASPECT.

Running mpirun -np 2 aspect convection-box.prm for the convection box cookbook:
Output for the debug (!) version of ASPECT:
BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES RANK 0 PID 211033 RUNNING AT c455-011 KILLED BY SIGNAL: 11 (Segmentation fault)

Currently Stampede 3 uses gcc 13, and I'm using candi for the install. I correct the compiling issues that are mentioned in #5186 by splitting the candi.sh into two files. In the first, the packages are downloaded and extracted. The issues from #5186 are fixed, and then the second file is run.

The currently loaded modules (module list) are:
1) autotools/1.3 2) xalt/3.0.1 3) TACC 4) cmake/3.28.1 5) gcc/13.2.0 6) mkl/24.0 7) impi/21.11

As far as I know candi doesn't have a .platform file for Rocky Linux, which is what Stampede3 is using. I was using centos7 as the closest equivalent. Otherwise I was using Trilinos 13.2., deal.II-9.5.2 (although I tried with 9.5.1 with the same result), but otherwise used the same packages as the previous wiki.

This may be related to the already open #5566, but is different in that I'm able to compile ASPECT.

@gassmoeller
Copy link
Member

gassmoeller commented Feb 8, 2024

@ryanstoner1 did you try starting ASPECT with the ibrun command instead of mpirun? As far as I remember you cannot use mpirun directly on Stampede, and only ibrun will use the correct MPI parameters. Also did you do this on the compute nodes or the login nodes? Login nodes cannot start MPI processes.

@ryanstoner1
Copy link
Contributor Author

That makes sense, I haven't yet run it with ibrun. Stampede3 is down right now, but as soon as it gets back up I'll test it with ibrun. I was running an interactive session on the compute nodes, so the MPI processes should get started.

@ryanstoner1
Copy link
Contributor Author

@gassmoeller Ok, no luck with ibrun unfortunately. I had the same issue running step-32 though, so it looks like it may be an issue upstream. Tried running with Trilinos 14.4 instead, but that is hanging on linking libteko.

[ 82%] Building CXX object packages/teko/src/CMakeFiles/teko.dir/Epetra/Teko_InterlacedEpetra.cpp.o
[ 82%] Building CXX object packages/teko/src/CMakeFiles/teko.dir/Epetra/Teko_InverseFactoryOperator.cpp.o
[ 82%] Building CXX object packages/teko/src/CMakeFiles/teko.dir/Epetra/Teko_ReorderedMappingStrategy.cpp.o
[ 82%] Building CXX object packages/teko/src/CMakeFiles/teko.dir/Epetra/Teko_StridedEpetraOperator.cpp.o
[ 82%] Building CXX object packages/teko/src/CMakeFiles/teko.dir/Epetra/Teko_StridedMappingStrategy.cpp.o
[ 82%] Linking CXX shared library libteko.so

@ryanstoner1
Copy link
Contributor Author

@gassmoeller Seems like the previous issues were likely related to deal.II. I have a functional version of deal.II (9.5.2) downloaded and tested, with step-32 working fine. I'm using the intel compiler.

There's some internal error from world builder but the error is from /source/utilities.cc from macro expansion:

In file included from /work2/09184/rstoner1/stampede3/software/stampede3/aspect_2_5/build/CMakeFiles/aspect.dir/Unity/unity_40_cxx.cxx:10:
/work2/09184/rstoner1/stampede3/software/stampede3/aspect_2_5/source/utilities.cc:85:7: error: invalid token at start of a preprocessor expression
   85 | #  if DEAL_II_MPI_VERSION_GTE(2, 2)
      |       ^
/work2/09184/rstoner1/stampede3/software/stampede3/dealii-9.5_trilinos/installation/include/deal.II/base/config.h:454:30: note: expanded from macro 'DEAL_II_MPI_VERSION_GTE'
  454 |  ((DEAL_II_MPI_VERSION_MAJOR * 100 + \

Here's my module list output:

Currently Loaded Modules:
  1) intel/24.0   3) autotools/1.3   5) xalt/3.0.1   7) p4est/2.8.5    9) petsc/3.20-singlei64  11) trilinos/14.4.0  13) metis/5.1.0.3
  2) impi/21.11   4) cmake/3.28.1    6) TACC         8) phdf5/1.14.3  10) boost/1.83.0          12) netcdf/4.9.2

I also have quite a few warnings from world builder that I'm not familiar with such as:

/work2/09184/rstoner1/stampede3/software/stampede3/aspect_2_5/contrib/world_builder/include/world_builder/assert.h:29:14: note: expanded from macro 'WBAssert'
   29 |       if (! (condition)) { \
      |              ^~~~~~~~~
/work2/09184/rstoner1/stampede3/software/stampede3/aspect_2_5/contrib/world_builder/source/world_builder/features/subducting_plate_models/temperature/mass_conserving.cc:336:25: warning: explicit comparison with NaN in fast floating point mode [-Wtautological-constant-compare]
  336 |               WBAssert(!std::isnan(background_temperature), "Internal error: temp is not a number: " << background_temperature << ". In exponent: "
      |                         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/work2/09184/rstoner1/stampede3/software/stampede3/aspect_2_5/contrib/world_builder/include/world_builder/assert.h:29:14: note: expanded from macro 'WBAssert'
   29 |       if (! (condition)) { \
      |              ^~~~~~~~~
/work2/09184/rstoner1/stampede3/software/stampede3/aspect_2_5/contrib/world_builder/source/world_builder/features/subducting_plate_models/temperature/mass_conserving.cc:500:25: warning: explicit comparison with NaN in fast floating point mode [-Wtautological-constant-compare]
  500 |               WBAssert(!std::isnan(temperature), "Internal error: temperature is not a number: " << temperature << '.');
      |                         ^~~~~~~~~~~~~~~~~~~~~~~
/work2/09184/rstoner1/stampede3/software/stampede3/aspect_2_5/contrib/world_builder/include/world_builder/assert.h:29:14: note: expanded from macro 'WBAssert'
   29 |       if (! (condition)) { \
      |              ^~~~~~~~~
2 warnings generated.
3 warnings generated.
4 warnings generated.
3 warnings generated.

@gassmoeller
Copy link
Member

Glad to hear you got farther this time. Does that mean you now have a working version of ASPECT? (if you dont need world builder you can disable it with -D ASPECT_WITH_WORLD_BUILDER=OFF, it is optional).

The world builder issues look like it might be an incompatibility with the new intel compiler. Since Stampede3 uses extremely new compilers and libraries some of our code may not be compatible with it (ASPECT 2.5 came out months before these compilers were released). We can try to look into the specific problems over time and fix them in the development version. If you find any solution before anyone else does, please post them here so we can incorporate them into the main branch.

@ryanstoner1
Copy link
Contributor Author

As of the previous comment I did not have a working version of ASPECT, but now I do.

Seems like it may be a cmake issue in deal.II specific to my particular setup. I went to deal.II/base/config.h and manually changed the lines (450-451):

#  define DEAL_II_MPI_VERSION_MAJOR 
#  define DEAL_II_MPI_VERSION_MINOR

to

#  define DEAL_II_MPI_VERSION_MAJOR 3 
#  define DEAL_II_MPI_VERSION_MINOR 1

Those not being set by cmake caused issues farther upstream. The world builder errors were still there, but they didn't prevent ASPECT from compiling. Thank you for the tip of turning the World Builder off.

If it's okay, I'd like to keep this issue open until the approach for Stampede3 is documented for the wiki.

@gassmoeller
Copy link
Member

Yes absolutely we should keep this open until we find a better and documented way to solve this (including world builder). So the fix for the MPI version you found likely means while building deal.II cmake could not determine the MPI version correctly (maybe because of the new Intel MPI used on Stampede 3). I will look into this when I have some time. Maybe @tjhei has seen this before?

I am compiling ASPECT myself on Stampede 3 right now, let's see if I run into the same issues.

@ryanstoner1
Copy link
Contributor Author

ryanstoner1 commented Feb 14, 2024

Okay, I've started a first cut of a new wiki.

PR #5574 should help, but unless I'm mistaken isn't Sundials a requirement moving forward? I couldn't install ASPECT cloning from main. In that case there'll be another section using candi for Sundials.

I'm also following the PR for candi with Trilinos 14.4.

@gassmoeller
Copy link
Member

Thanks for the wiki page! Please let me know when you are done with the editing. I also compiled ASPECT last night and are working through some updates that may make it a bit easier (like #5574). I would also recommend using candi to install sundials, astyle, and deal.II, because it allows to include some optimizations (you are right, your ASPECT is probably not compiling right now, because deal.II is missing sundials).

@gassmoeller
Copy link
Member

I updated the description in the wiki to how to use candi to install astyle, sundials (required for ASPECT 2.6.0), and deal.II. It also includes instructions for how to disable the annoying warnings and the PR that caused the MPI issues is merged as well. Finally, my instructions for candi enable vector instructions for deal.II, which improves the performance of the GMG preconditioner significantly. @ryanstoner1 could you give the new instructions a try? Edits and improvements are welcome too.

@ryanstoner1
Copy link
Contributor Author

Yes, I changed a couple of things on the wiki. The main ones are that I was unsuccessful in getting ASPECT 2.4 to work with my instructions (assuming it's the combination with deal.II 9.5.2). Also, the maximum time limit for Stampede3 is now 24 hours.

Testing the new set of instructions today.

@ryanstoner1
Copy link
Contributor Author

Ok, successfully tested the latest version.

Two last issues:

  1. centos7 causes cmake to be reinstalled by candi, but we already recommed users module load cmake. This seems a bit redundant.
  2. When I compile in release mode then I have issues with saving files because isnan() is called in an if statement in core/postprocess/visualization.cc. The warnings about nans are then applicable for the intel compiler.
      if (std::isnan(last_output_time))
        {
          last_output_time = this->get_time() - output_interval;
          last_output_timestep = this->get_timestep_number();
        }

@gassmoeller
Copy link
Member

Thanks, your changes make sense. Just for curiosity, do you need the fp-model=precise flag? What happens without it?

Regarding 2. Can you describe what issues with saving files means? Do you get the same annoying warnings about comparison to NaNs or is there an actual bug?

@ryanstoner1
Copy link
Contributor Author

There is an actual bug, but only if using the Intel compiler, it seems. If I compiled in release mode then no files would be saved after the first timestep, but they would be saved in debug mode. The reason is core/postprocess/visualization.cc, which I found with print statements (since a debugger wouldn't work because debug mode behaves as expected).

The -fp-model=precise is to prevent isnan from always returning false, which it does by default. I'm not familiar enough with the intel compiler to know if there's a more elegant way to do this. -fhonor-nan-compares might be possible given this thread in the Intel community. But if there's any isinf that gets used then I think it would return false as well.

@tjhei
Copy link
Member

tjhei commented Feb 22, 2024

Oh wow, I didn't think about this before! We use NaN in the control logic and compiling with O3 will break this. You will need to go to O2 or disable the floating point optimizations regarding NaNs.

@ryanstoner1
Copy link
Contributor Author

I added -fno-finite-math-only instead of fp-model=precise because it evaluates isinf and isnan correctly and isn't overkill.
Also tested it to make sure it was saving files correctly.

Is the control logic safe with this modification?

@tjhei @gassmoeller

@ryanstoner1
Copy link
Contributor Author

I modified the wiki because master of deal.II was not compiling on Stampede3. Now it does. Currently that issue is open, and in the future there may be alternative workarounds than me adding an extra -D DEAL_II_COMPILER_HAS_RESTRICT_KEYWORD=no. As it is now, 9.6-pre and 9.5 of ASPECT work as well as 9.5.2 and master of deal.II.

If there's no more additions or modifications then I'll close this issue in the coming week. Many thanks for all the help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants