Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Corrupted .h5 files w/ particles #5222

Open
weqoll opened this issue Nov 17, 2024 · 19 comments · Fixed by #5229
Open

Corrupted .h5 files w/ particles #5222

weqoll opened this issue Nov 17, 2024 · 19 comments · Fixed by #5229
Labels

Comments

@weqoll
Copy link

weqoll commented Nov 17, 2024

Hello everyone!

I'm trying to repeat setup with preionized foil (tilted at 45 degrees) and femtosecond pulse. Fs-pulse goes along x-axis and reflects from foil, causing some extra light radiation to generate via redistribution of electrons. Everything goes OK at the calculations phase, however there are issues with openPMD output and its formation into .h5-files.

All works fine with fields, but in the case of particles everything goes weird;
Find this issue with almost standard Probe Particles setup: everything repeats your workflow documentation except for its distribution. I formed an x-line with thickness about 1-2 x-steps. So it all goes like this:
image

MacroParticlesCounter reports that everything is OK with probes initialization, their number is about the size of simulation area linear size.
However, when I try to read probes dataset from this file with matlab function h5read, there goes an error:

Error using h5readc
The HDF5 library encountered an error and produced the following stack trace information:


    H5HL__hdr_deserialize    bad local heap signature

Error in h5read (line 93)
[data,var_class] = h5readc(Filename,Dataset,start,count,stride);

Function h5disp() gives me such warning:

>> h5disp("./picongpu_files/run-16nov24-try.new.1/simOutput/openPMD/simData_000005.h5","/data")
Warning: Unable to read '/e' from the file. A portion of the file may be corrupt. 
> In h5info (line 125)
In matlab.io.internal.imagesci.HDF5DisplayUtils.displayHDF5 (line 34)
In h5disp (line 123) 
HDF5 simData_000005.h5 
Group '/data' 

That's kinda weird, because when i try to write .h5 files earlier everything was ok and electrons+ions particle distributions was taken without any errors.

Could you help me with this issue? Or at least could you give me some ways to debug such issue?
I've recompiled a lot of times this probe setup even with standard EveryNthCell density profile, but this issue's recreating over and over. Thank you!

I've installed the actual PIConGPU version for November 15th at dev-branch, my OS version is:

Operating System: Debian GNU/Linux 12 (bookworm)
          Kernel: Linux 6.1.0-27-amd64
    Architecture: x86-64

density.param snippet:

        struct ProbeXLineParam {
            HDINLINE float_64 operator()(const floatD_64& position_SI, const float3_64& cellSize_SI) {
                const float_64 x(position_SI.x() * 1e6);
                const float_64 y(position_SI.y() * 1e6);

                constexpr float_64 wlen(3.9);
                constexpr float_64 y0(0.5*wlen);
                constexpr float_64 lthk(0.003907*wlen);
                constexpr float_64 yb(y0-lthk/2);
                constexpr float_64 ye(y0+lthk/2);
                float_64 s(0.);

                if(x > 0.125*wlen && x < 6.875*wlen){
                    if(y > yb && y < ye) {
                        s = 1;
                    }
                }
                s *= float_X(s >= 0.0);
                return s;
            }
        };

        using ProbeXLine = FreeFormulaImpl<ProbeXLineParam>;

particle.param snippet:

            /** Configuration of initial in-cell particle position
             *
             * Here, macro-particles sit directly in lower corner of the cell.
             */
            struct OnePositionParameter
            {
                /** Maximum number of macro-particles per cell during density profile evaluation.
                 *
                 * Determines the weighting of a macro particle as well as the number of
                 * macro-particles which sample the evolution of the particle distribution
                 * function in phase space.
                 *
                 * unit: none
                 */
                static constexpr uint32_t numParticlesPerCell = 1u;

                /** each x, y, z in-cell position component in range [0.0, 1.0)
                 *
                 * @details in 2D the last component is ignored
                 */
                static constexpr auto inCellOffset = float3_X(0.5, 0.5, 0.);
            };
            /** Definition of OnePosition start position functor that
             * places macro-particles at the initial in-cell position defined above.
             */
            using OnePosition = OnePositionImpl<OnePositionParameter>;
        } 

speciesDefinition.param snippet:

    /*---------------------------- probes -----------------------------------------------*/

    using ParticleFlagsProbes = MakeSeq_t<
        particlePusher< particles::pusher::Probe >,
        shape< UsedParticleShape >,
        interpolation< UsedField2Particle >
    >;

    using ProbeX = Particles<
        PMACC_CSTRING( "probe" ),
        ParticleFlagsProbes,
        MakeSeq_t<
            position< position_pic >,
            probeB,
            probeE
        >
    >;

speciesInitialization.param snippet:

       using InitPipeline = pmacc::mp_list<
            CreateDensity<densityProfiles::Tilt45FoilWithARamp, startPosition::Random, PIC_Ions>,
            Manipulate<manipulators::SetOnceIonized, PIC_Ions>,
            Derive<PIC_Ions, PIC_Electrons>,
            CreateDensity<densityProfiles::ProbeXLine, startPosition::OnePosition, ProbeX>
        >;

fileOutput.param snippet:

    using FileOutputParticles = MakeSeq_t<ProbeX,PIC_Electrons,PIC_Ions>;

@psychocoderHPC
Copy link
Member

Could you please check with h5dump --contents=1 FILENAME if the hdf5 folder structure is readable and the species named as expected. With h5dump you can try to read the file on the terminal to check if the file is corrupted.
In many cases this happens if the file was not correctly closed during writing.

@weqoll
Copy link
Author

weqoll commented Nov 18, 2024

With h5dump --contents ./openPMD/simData_000005.h5 I get such error:

h5dump error: internal error (file ../../../../../tools/src/h5dump/h5dump.c:line 1430)

With h5dump --enable-error-stack ./openPMD/simData_000005.h5 I get this stack:


HDF5-DIAG: Error detected in HDF5 (1.10.8) thread 1:
  #000: ../../../src/H5O.c line 510 in H5Oget_info_by_name2(): can't get info for object: 'data/5/fields'
    major: Object header
    minor: Can't get value
  #001: ../../../src/H5Gloc.c line 702 in H5G_loc_info(): can't find object
    major: Symbol table
    minor: Object not found
  #002: ../../../src/H5Gtraverse.c line 832 in H5G_traverse(): internal path traversal failed
    major: Symbol table
    minor: Object not found
  #003: ../../../src/H5Gtraverse.c line 608 in H5G__traverse_real(): traversal operator failed
    major: Symbol table
    minor: Callback failed
  #004: ../../../src/H5Gloc.c line 660 in H5G__loc_info_cb(): can't get object info
    major: Symbol table
    minor: Can't get value
  #005: ../../../src/H5Oint.c line 2184 in H5O_get_info(): unable to determine object class
    major: Object header
    minor: Can't get value
  #006: ../../../src/H5Oint.c line 1765 in H5O__obj_class_real(): unable to determine object type
    major: Object header
    minor: Unable to initialize object
HDF5-DIAG: Error detected in HDF5 (1.10.8) thread 1:
  #000: ../../../src/H5L.c line 1292 in H5Lvisit_by_name(): link visitation failed
    major: Links
    minor: Iteration failed
  #001: ../../../src/H5Gint.c line 1150 in H5G_visit(): can't visit links
    major: Symbol table
    minor: Iteration failed
  #002: ../../../src/H5Gobj.c line 674 in H5G__obj_iterate(): can't iterate over symbol table
    major: Symbol table
    minor: Iteration failed
  #003: ../../../src/H5Gstab.c line 537 in H5G__stab_iterate(): iteration operator failed
    major: Symbol table
    minor: Can't move to next iterator location
  #004: ../../../src/H5B.c line 1195 in H5B_iterate(): B-tree iteration failed
    major: B-Tree node
    minor: Iteration failed
  #005: ../../../src/H5B.c line 1154 in H5B__iterate_helper(): B-tree iteration failed
    major: B-Tree node
    minor: Iteration failed
  #006: ../../../src/H5Gnode.c line 977 in H5G__node_iterate(): iteration operator failed
    major: Symbol table
    minor: Can't move to next iterator location
  #007: ../../../src/H5Gobj.c line 674 in H5G__obj_iterate(): can't iterate over symbol table
    major: Symbol table
    minor: Iteration failed
  #008: ../../../src/H5Gstab.c line 537 in H5G__stab_iterate(): iteration operator failed
    major: Symbol table
    minor: Can't move to next iterator location
  #009: ../../../src/H5B.c line 1195 in H5B_iterate(): B-tree iteration failed
    major: B-Tree node
    minor: Iteration failed
  #010: ../../../src/H5B.c line 1154 in H5B__iterate_helper(): B-tree iteration failed
    major: B-Tree node
    minor: Iteration failed
  #011: ../../../src/H5Gnode.c line 977 in H5G__node_iterate(): iteration operator failed
    major: Symbol table
    minor: Can't move to next iterator location
  #012: ../../../src/H5Gobj.c line 674 in H5G__obj_iterate(): can't iterate over symbol table
    major: Symbol table
    minor: Iteration failed
  #013: ../../../src/H5Gstab.c line 537 in H5G__stab_iterate(): iteration operator failed
    major: Symbol table
    minor: Can't move to next iterator location
  #014: ../../../src/H5B.c line 1195 in H5B_iterate(): B-tree iteration failed
    major: B-Tree node
    minor: Iteration failed
  #015: ../../../src/H5B.c line 1154 in H5B__iterate_helper(): B-tree iteration failed
    major: B-Tree node
    minor: Iteration failed
  #016: ../../../src/H5Gnode.c line 977 in H5G__node_iterate(): iteration operator failed
    major: Symbol table
    minor: Can't move to next iterator location
  #017: ../../../src/H5O.c line 510 in H5Oget_info_by_name2(): can't get info for object: 'data/5/fields'
    major: Object header
    minor: Can't get value
  #018: ../../../src/H5Gloc.c line 702 in H5G_loc_info(): can't find object
    major: Symbol table
    minor: Object not found
  #019: ../../../src/H5Gtraverse.c line 832 in H5G_traverse(): internal path traversal failed
    major: Symbol table
    minor: Object not found
  #020: ../../../src/H5Gtraverse.c line 608 in H5G__traverse_real(): traversal operator failed
    major: Symbol table
    minor: Callback failed
  #021: ../../../src/H5Gloc.c line 660 in H5G__loc_info_cb(): can't get object info
    major: Symbol table
    minor: Can't get value
  #022: ../../../src/H5Oint.c line 2184 in H5O_get_info(): unable to determine object class
    major: Object header
    minor: Can't get value
  #023: ../../../src/H5Oint.c line 1765 in H5O__obj_class_real(): unable to determine object type
    major: Object header
    minor: Unable to initialize object
h5dump error: internal error (file ../../../../../tools/src/h5dump/h5dump.c:line 1430)
H5tools-DIAG: Error detected in HDF5:tools (1.10.8) thread 1:
  #000: ../../../../tools/lib/h5tools_utils.c line 618 in init_objs(): finding shared objects failed
    major: Failure in tools library
    minor: error in function
  #001: ../../../../tools/lib/h5trav.c line 1040 in h5trav_visit(): traverse failed
    major: Failure in tools library
    minor: error in function
  #002: ../../../../tools/lib/h5trav.c line 286 in traverse(): H5Lvisit_by_name failed
    major: Failure in tools library
    minor: error in function

@weqoll
Copy link
Author

weqoll commented Nov 18, 2024

here is output for 5 steps of simulation with OPENPMD_VERBOSE=1:
simulationOutput.txt
also, here is snippet from .cfg file:

TBG_openPMD_sp="--openPMD.period 0:60000:5 --openPMD.file simData --openPMD.ext h5 --openPMD.source species_all --openPMD.json='{\"hdf5\": {\"dataset\": {\"chunks\":\"auto\"}}}'"

Without chuncking I can't write my openPMD output with species at all

@steindev
Copy link
Member

@psychocoderHPC @franzpoeschel From the output it seems that h5dump is able to read at least parts of the file. As @weqoll reported, this issue persists over multiple times of retrying to run the simulation and writing the file. So I guess something goes wrong in how the file is written. Is it related to using the option of chunking?

@weqoll I know this is a tedious workaround, but it might help to have at least some particle information: Can you dump only one species per file by specifying --openPMD.source <SPECIES_NAME> multiple times instead of --openPMD.source species_all once? And then forgo chunking. You might not even want to write all particles, since you specified the probe species?

@psychocoderHPC
Copy link
Member

@weqoll I saw in the output of PIConGPU that you are using OpenPMD version 0.17.0-dev. Can you try to use the latest release instead. https://github.com/openPMD/openPMD-api/releases

I am not sure if there are some issues in the dev branch of openPMD or if you maybe pulled a non wroking version of the dev branch.

@franzpoeschel
Copy link
Contributor

@weqoll I saw in the output of PIConGPU that you are using OpenPMD version 0.17.0-dev. Can you try to use the latest release instead. https://github.com/openPMD/openPMD-api/releases

0.17.0-dev currently only contains bugfixes compared to the release, so this should be fine. Will look in more detail later today.

@franzpoeschel
Copy link
Contributor

With HDF5, there are unfortunately sometimes combinations of parameters which can lead to troubles like this. We have seen similar issues in the past and not been able to find a clear cause for them.

There are multiple things to try:

  1. Deactivate chunking in HDF5 with export OPENPMD_HDF5_CHUNKS=OFF. We deactivate it already by default in the regular openPMD plugin due to similar issues, but apparently don't do this yet in the MacroParticleCounter plugin.
  2. Use another implementation for MPI I/O, a typical option is to try OMPI_MCA_io=^ompio. (This would normally activate the fallback romio, note that romio had trouble with chunking in the past, so better deactivate chunking for this once more).
  3. (EDIT: I just noticed that you use the MacroParticleCounter plugin which does not expose JSON options yet, so this will not work.) Since you're using openPMD-0.17.0-dev, you might try the HDF5 subfiling VFD which openPMD supports since version 0.16.0. You can find information on how to activate it here and here. You might need to reconfigure HDF5 with -DHDF5_ENABLE_SUBFILING_VFD=ON and then recompile.
  4. Check the environment variables here for other workarounds and parameters.
  5. .. maybe try the ADIOS2 backend of openPMD instead if nothing helps (e.g. via --e_macroParticlesPerSuperCell.ext bp5.

@weqoll
Copy link
Author

weqoll commented Dec 3, 2024

Thank you for help, I'll be able to try all the actions today later and after i'll write about results

@weqoll
Copy link
Author

weqoll commented Dec 9, 2024

Well, I tried what you've written about chuncking and hdf5, however it didn't help

I've just also tried to reinstall hdf5 and openmpi from sources instead of pre-built packages because of their versions
Pre-built HDF5 was 1.10.8, while I build 1.14.5
Pre-built OpenMPI was 3.0, while I build 5.0.6

I'm sure that root of this issue could be linked with misversioning, because openPMD documentation claims:

Jul 23th, 2021 ([HDFFV-11260](https://jira.hdfgroup.org/browse/HDFFV-11260)): Collective HDF5 metadata reads (OPENPMD_HDF5_COLLECTIVE_METADATA=ON) broke in 1.10.5, falling back to individual metadata operations. HDF5 releases 1.10.4 and earlier are not affected; versions 1.10.9+, 1.12.2+ and 1.13.1+ fixed the issue.

After reinstalling of MPI and HDF5 I've successfully recompiled whole PIConGPU.
It can build my project without any warnings and errors, that's output of pic-build -b "cuda:75":

labuser@thunderstorm:~/USERS/astashkin/picongpu_files/input-15nov24-foil.wlen39$ pic-build -b "cuda:75"
build directory: .build
cmake command: cmake   -DCMAKE_INSTALL_PREFIX=/home/labuser/USERS/astashkin/picongpu_files/input-15nov24-foil.wlen39 -DPIC_EXTENSION_PATH=/home/labuser/USERS/astashkin/picongpu_files/input-15nov24-foil.wlen39   -Dalpaka_ACC_GPU_CUDA_ENABLE=ON -Dalpaka_ACC_GPU_CUDA_ONLY_MODE=ON -Dalpaka_CUDA_EXPT_EXTENDED_LAMBDA=ON -DCMAKE_CUDA_ARCHITECTURES=75 /home/labuser/src/picongpu/include/picongpu
-- The CXX compiler identification is GNU 11.3.0
-- The C compiler identification is GNU 11.3.0
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Looking for a CUDA compiler
-- Looking for a CUDA compiler - /usr/bin/nvcc
-- The CUDA compiler identification is NVIDIA 11.8.89
-- Detecting CUDA compiler ABI info
-- Detecting CUDA compiler ABI info - done
-- Check for working CUDA compiler: /usr/bin/nvcc - skipped
-- Detecting CUDA compile features
-- Detecting CUDA compile features - done
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success
-- Found Threads: TRUE
-- Found Boost: /usr/lib/x86_64-linux-gnu/cmake/Boost-1.74.0/BoostConfig.cmake (found suitable version "1.74.0", minimum required is "1.74.0") found components: atomic
-- C++20 math constants not found. Falling back to non-standard constants.
-- Found CUDAToolkit: /usr/include (found version "11.8.89")
-- nvcc is used as CUDA compiler
-- alpaka_ACC_GPU_CUDA_ONLY_MODE
-- alpaka_ACC_GPU_CUDA_ENABLED

List of compiler flags added by alpaka
host compiler:
    $<$<AND:$<CONFIG:Debug>,$<CXX_COMPILER_ID:GNU>,$<COMPILE_LANGUAGE:CXX>>:SHELL:-Og>;$<$<AND:$<CONFIG:Debug>,$<CXX_COMPILER_ID:GNU>,$<COMPILE_LANGUAGE:CUDA>>:SHELL:-Xcompiler -Og>;$<$<AND:$<CONFIG:Debug>,$<CXX_COMPILER_ID:Clang,AppleClang,IntelLLVM>>:SHELL:-O0>;$<$<AND:$<CONFIG:Debug>,$<CXX_COMPILER_ID:MSVC>>:SHELL:/Od>
device compiler:
    $<$<AND:$<CONFIG:Debug>,$<CXX_COMPILER_ID:GNU>,$<COMPILE_LANGUAGE:CXX>>:SHELL:-Og>;$<$<AND:$<CONFIG:Debug>,$<CXX_COMPILER_ID:GNU>,$<COMPILE_LANGUAGE:CUDA>>:SHELL:-Xcompiler -Og>;$<$<AND:$<CONFIG:Debug>,$<CXX_COMPILER_ID:Clang,AppleClang,IntelLLVM>>:SHELL:-O0>;$<$<AND:$<CONFIG:Debug>,$<CXX_COMPILER_ID:MSVC>>:SHELL:/Od>;$<$<COMPILE_LANGUAGE:CUDA>:SHELL:--extended-lambda>;$<$<COMPILE_LANGUAGE:CUDA>:SHELL:--expt-relaxed-constexpr>;$<$<AND:$<CONFIG:Debug>,$<COMPILE_LANGUAGE:CUDA>>:SHELL:-G>;$<$<AND:$<CONFIG:RelWithDebInfo>,$<COMPILE_LANGUAGE:CUDA>>:SHELL:-g -lineinfo>;$<$<COMPILE_LANGUAGE:CUDA>:SHELL:$<IF:$<VERSION_LESS:$<CUDA_COMPILER_VERSION>,11.2.0>,-Xcudafe=--display_error_number,--display-error-number>>

-- Looking for std::filesystem::path::preferred_separator
-- Looking for std::filesystem::path::preferred_separator - found
-- Found MPI_C: /usr/local/lib/libmpi.so (found version "3.1")
-- Found MPI_CXX: /usr/local/lib/libmpi.so (found version "3.1")
-- Found MPI: TRUE (found version "3.1")
-- Found Boost: /usr/lib/x86_64-linux-gnu/cmake/Boost-1.74.0/BoostConfig.cmake (found suitable version "1.74.0", minimum required is "1.74") found components: program_options
-- Boost: deactivate std::auto_ptr
-- Found OpenMP_C: -fopenmp (found version "4.5")
-- Found OpenMP_CXX: -fopenmp (found version "4.5")
-- Found OpenMP: TRUE (found version "4.5")
-- Found Boost: /usr/lib/x86_64-linux-gnu/cmake/Boost-1.74.0/BoostConfig.cmake (found suitable version "1.74.0", minimum required is "1.65.1")
-- Using mallocMC from thirdParty/ directory
-- Found mallocMC: /home/labuser/src/picongpu/thirdParty/mallocMC/src (found suitable version "2.6.0", minimum required is "2.6.0")
-- Found Boost: /usr/lib/x86_64-linux-gnu/cmake/Boost-1.74.0/BoostConfig.cmake (found suitable version "1.74.0", minimum required is "1.74.0") found components: program_options
-- Found NVML: /usr/lib/x86_64-linux-gnu/libnvidia-ml.so
-- nvml found
-- Found Boost: /usr/lib/x86_64-linux-gnu/cmake/Boost-1.74.0/BoostConfig.cmake (found suitable version "1.74.0", minimum required is "1.66.0") found components: program_options
-- Found HDF5: /usr/local/lib/libhdf5.so;/usr/lib/x86_64-linux-gnu/libpthread.a;/usr/lib/x86_64-linux-gnu/libz.so;/usr/lib/x86_64-linux-gnu/libdl.a;/usr/lib/x86_64-linux-gnu/libm.so (found version "1.14.5")
-- Found ADIOS2: /home/labuser/lib/ADIOS2/lib/cmake/adios2/adios2-config.cmake (found version "2.10.0.513") found components: CXX MPI
-- Found openPMD: /home/labuser/lib/openPMD-api/lib/cmake/openPMD
-- Using the single-header code from /home/labuser/src/picongpu/thirdParty/nlohmann_json/single_include/
-- Implicit conversions are disabled
-- Found ZLIB: /usr/lib/x86_64-linux-gnu/libz.so (found version "1.2.13")
-- Found PNG: /usr/lib/x86_64-linux-gnu/libpng.so (found version "1.6.39")
-- Found PNGwriter: /home/labuser/lib/pngwriter/lib/cmake/PNGwriter
-- Could NOT find PkgConfig (missing: PKG_CONFIG_EXECUTABLE)
-- Could NOT find PkgConfig for fftw3- set Pkgconfig_DIR or check your CMAKE_PREFIX_PATH

Optional Dependencies:
  openPMD: ON
  PNGwriter: ON
  ISAAC: OFF
  FFTW3: OFF

-- Configuring done
-- Generating done
-- Build files have been written to: /home/labuser/USERS/astashkin/picongpu_files/input-15nov24-foil.wlen39/.build
call build: cmake --build . --target install --parallel
[  1%] Building CUDA object CMakeFiles/pmacc.dir/home/labuser/src/picongpu/include/pmacc/communication/CommunicatorMPI.cpp.o
[  2%] Building CUDA object CMakeFiles/pmacc.dir/home/labuser/src/picongpu/include/pmacc/eventSystem/Manager.cpp.o
[  5%] Building CXX object CMakeFiles/picongpu-hostonly.dir/ArgsParser.cpp.o
[  6%] Building CUDA object CMakeFiles/pmacc.dir/home/labuser/src/picongpu/include/pmacc/eventSystem/eventSystem.cpp.o
[  6%] Building CXX object CMakeFiles/picongpu-hostonly.dir/plugins/common/stringHelpers.cpp.o
[  9%] Building CUDA object CMakeFiles/pmacc.dir/home/labuser/src/picongpu/include/pmacc/eventSystem/events/EventNotify.cpp.o
[  9%] Building CUDA object build_cuda_memtest/CMakeFiles/cuda_memtest.dir/tests.cpp.o
[ 11%] Building CXX object CMakeFiles/picongpu-hostonly.dir/plugins/misc/removeSpaces.cpp.o
[ 13%] Building CUDA object CMakeFiles/pmacc.dir/home/labuser/src/picongpu/include/pmacc/eventSystem/queues/Queue.cpp.o
[ 13%] Building CUDA object CMakeFiles/pmacc.dir/home/labuser/src/picongpu/include/pmacc/dataManagement/DataConnector.cpp.o
[ 14%] Building CXX object build_mpiInfo/CMakeFiles/mpiInfo.dir/mpiInfo.cpp.o
[ 15%] Building CUDA object CMakeFiles/pmacc.dir/home/labuser/src/picongpu/include/pmacc/eventSystem/tasks/DeviceTask.cpp.o
[ 18%] Building CXX object CMakeFiles/picongpu-hostonly.dir/plugins/misc/ComponentNames.cpp.o
[ 18%] Building CUDA object CMakeFiles/pmacc.dir/home/labuser/src/picongpu/include/pmacc/eventSystem/transactions/Transaction.cpp.o
[ 19%] Building CUDA object CMakeFiles/pmacc.dir/home/labuser/src/picongpu/include/pmacc/eventSystem/waitForAllTasks.cpp.o
[ 22%] Building CXX object CMakeFiles/picongpu-hostonly.dir/plugins/openPMD/openPMDWriter.cpp.o
[ 22%] Building CXX object CMakeFiles/picongpu-hostonly.dir/plugins/openPMD/Json.cpp.o
[ 25%] Building CXX object CMakeFiles/picongpu-hostonly.dir/initialization/ParserGridDistribution.cpp.o
[ 25%] Building CUDA object build_cuda_memtest/CMakeFiles/cuda_memtest.dir/misc.cpp.o
[ 26%] Building CUDA object CMakeFiles/pmacc.dir/home/labuser/src/picongpu/include/pmacc/eventSystem/events/ComputeEvent.cpp.o
[ 27%] Building CUDA object CMakeFiles/pmacc.dir/home/labuser/src/picongpu/include/pmacc/mappings/simulation/Filesystem.cpp.o
[ 30%] Building CUDA object CMakeFiles/pmacc.dir/home/labuser/src/picongpu/include/pmacc/eventSystem/events/EventTask.cpp.o
[ 30%] Building CUDA object CMakeFiles/pmacc.dir/home/labuser/src/picongpu/include/pmacc/eventSystem/events/ComputeEventHandle.cpp.o
[ 31%] Building CXX object CMakeFiles/picongpu-hostonly.dir/plugins/common/MPIHelpers.cpp.o
[ 32%] Building CUDA object CMakeFiles/pmacc.dir/home/labuser/src/picongpu/include/pmacc/eventSystem/tasks/TaskKernel.cpp.o
[ 34%] Building CUDA object CMakeFiles/pmacc.dir/home/labuser/src/picongpu/include/pmacc/simulationControl/signal.cpp.o
[ 38%] Building CUDA object CMakeFiles/pmacc.dir/home/labuser/src/picongpu/include/pmacc/simulationControl/SimulationHelper.cpp.o
[ 39%] Building CUDA object CMakeFiles/pmacc.dir/home/labuser/src/picongpu/include/pmacc/misc/splitString.cpp.o
[ 39%] Building CUDA object build_cuda_memtest/CMakeFiles/cuda_memtest.dir/cuda_memtest.cpp.o
[ 39%] Building CXX object CMakeFiles/picongpu-hostonly.dir/plugins/misc/splitString.cpp.o
[ 40%] Building CXX object CMakeFiles/picongpu-hostonly.dir/plugins/openPMD/toml.cpp.o
[ 42%] Building CUDA object CMakeFiles/pmacc.dir/home/labuser/src/picongpu/include/pmacc/eventSystem/transactions/TransactionManager.cpp.o
[ 43%] Building CXX object CMakeFiles/picongpu-hostonly.dir/random/seed/Seed.cpp.o
[ 44%] Building CUDA object CMakeFiles/pmacc.dir/home/labuser/src/picongpu/include/pmacc/pluginSystem/PluginConnector.cpp.o
[ 46%] Linking CXX executable mpiInfo
[ 46%] Built target mpiInfo
[ 47%] Linking CUDA executable cuda_memtest
[ 47%] Built target cuda_memtest
[ 48%] Linking CXX static library libpicongpu-hostonly.a
[ 48%] Built target picongpu-hostonly
[ 50%] Linking CXX static library libpmacc.a
[ 50%] Built target pmacc
[ 51%] Building CUDA object CMakeFiles/picongpu.dir/fields/EMFieldBase.x.cpp.o
[ 52%] Building CUDA object CMakeFiles/picongpu.dir/fields/FieldE.x.cpp.o
[ 53%] Building CUDA object CMakeFiles/picongpu.dir/fields/FieldJ.x.cpp.o
[ 55%] Building CUDA object CMakeFiles/picongpu.dir/fields/FieldTmp.x.cpp.o
[ 56%] Building CUDA object CMakeFiles/picongpu.dir/fields/FieldB.x.cpp.o
[ 59%] Building CUDA object CMakeFiles/picongpu.dir/plugins/BinEnergyParticles.x.cpp.o
[ 59%] Building CUDA object CMakeFiles/picongpu.dir/fields/absorber/Absorber.x.cpp.o
[ 60%] Building CUDA object CMakeFiles/picongpu.dir/fields/absorber/AbsorberImpl.x.cpp.o
[ 61%] Building CUDA object CMakeFiles/picongpu.dir/plugins/PngPlugin.x.cpp.o
[ 63%] Building CUDA object CMakeFiles/picongpu.dir/plugins/EnergyParticles.x.cpp.o
[ 64%] Building CUDA object CMakeFiles/picongpu.dir/plugins/EnergyFields.x.cpp.o
[ 65%] Building CUDA object CMakeFiles/picongpu.dir/plugins/Emittance.x.cpp.o
[ 68%] Building CUDA object CMakeFiles/picongpu.dir/plugins/makroParticleCounter/PerSuperCell.x.cpp.o
[ 69%] Building CUDA object CMakeFiles/picongpu.dir/plugins/particleCalorimeter/ParticleCalorimeter.x.cpp.o
[ 69%] Building CUDA object CMakeFiles/picongpu.dir/particles/debyeLength/Check.x.cpp.o
[ 71%] Building CUDA object CMakeFiles/picongpu.dir/plugins/openPMD/openPMDWriter.x.cpp.o
[ 72%] Building CUDA object CMakeFiles/picongpu.dir/plugins/Checkpoint.x.cpp.o
[ 73%] Building CUDA object CMakeFiles/picongpu.dir/plugins/ChargeConservation.x.cpp.o
[ 75%] Building CUDA object CMakeFiles/picongpu.dir/plugins/CountParticles.x.cpp.o
[ 77%] Building CUDA object CMakeFiles/picongpu.dir/plugins/PhaseSpace/PhaseSpace.x.cpp.o
[ 78%] Building CUDA object CMakeFiles/picongpu.dir/plugins/radiation/Radiation.x.cpp.o
[ 80%] Building CUDA object CMakeFiles/picongpu.dir/simulation/stage/ParticleIonization.x.cpp.o
[ 80%] Building CUDA object CMakeFiles/picongpu.dir/main.x.cpp.o
[ 81%] Building CUDA object CMakeFiles/picongpu.dir/plugins/IsaacPlugin.x.cpp.o
[ 82%] Building CUDA object CMakeFiles/picongpu.dir/plugins/SumCurrents.x.cpp.o
[ 84%] Building CUDA object CMakeFiles/picongpu.dir/plugins/binning/BinningDispatcher.x.cpp.o
[ 85%] Building CUDA object CMakeFiles/picongpu.dir/simulation/stage/ParticleBoundaries.x.cpp.o
[ 89%] Building CUDA object CMakeFiles/picongpu.dir/plugins/shadowgraphy/Shadowgraphy.x.cpp.o
[ 89%] Building CUDA object CMakeFiles/picongpu.dir/simulation/stage/CurrentDeposition.x.cpp.o
[ 89%] Building CUDA object CMakeFiles/picongpu.dir/versionFormat.x.cpp.o
[ 92%] Building CUDA object CMakeFiles/picongpu.dir/simulation/stage/IterationStart.x.cpp.o
[ 92%] Building CUDA object CMakeFiles/picongpu.dir/plugins/transitionRadiation/TransitionRadiation.x.cpp.o
[ 93%] Building CUDA object CMakeFiles/picongpu.dir/simulation/stage/SynchrotronRadiation.x.cpp.o
[ 94%] Building CUDA object CMakeFiles/picongpu.dir/simulation/stage/ParticleInit.x.cpp.o
[ 97%] Building CUDA object CMakeFiles/picongpu.dir/simulation/stage/Collision.x.cpp.o
[ 97%] Building CUDA object CMakeFiles/picongpu.dir/simulation/stage/AtomicPhysics.x.cpp.o
[ 98%] Building CUDA object CMakeFiles/picongpu.dir/simulation/stage/ParticlePush.x.cpp.o
[100%] Linking CXX executable picongpu
[100%] Built target picongpu
Install the project...
-- Install configuration: "Release"
-- Installing: /home/labuser/USERS/astashkin/picongpu_files/input-15nov24-foil.wlen39/bin/cuda_memtest
-- Set runtime path of "/home/labuser/USERS/astashkin/picongpu_files/input-15nov24-foil.wlen39/bin/cuda_memtest" to "$ORIGIN:/usr/lib/x86_64-linux-gnu/nvidia/current"
-- Installing: /home/labuser/USERS/astashkin/picongpu_files/input-15nov24-foil.wlen39/bin/mpiInfo
-- Set runtime path of "/home/labuser/USERS/astashkin/picongpu_files/input-15nov24-foil.wlen39/bin/mpiInfo" to "$ORIGIN:/usr/local/lib"
-- Installing: /home/labuser/USERS/astashkin/picongpu_files/input-15nov24-foil.wlen39/bin/picongpu
-- Set runtime path of "/home/labuser/USERS/astashkin/picongpu_files/input-15nov24-foil.wlen39/bin/picongpu" to "$ORIGIN:/home/labuser/lib/openPMD-api/lib:/usr/local/lib:/home/labuser/lib/ADIOS2/lib"
-- Up-to-date: /home/labuser/USERS/astashkin/picongpu_files/input-15nov24-foil.wlen39/bin
-- Up-to-date: /home/labuser/USERS/astashkin/picongpu_files/input-15nov24-foil.wlen39/bin/picongpu-completion.bash
-- Up-to-date: /home/labuser/USERS/astashkin/picongpu_files/input-15nov24-foil.wlen39/bin/tbg
-- Up-to-date: /home/labuser/USERS/astashkin/picongpu_files/input-15nov24-foil.wlen39/bin/pic-edit
-- Up-to-date: /home/labuser/USERS/astashkin/picongpu_files/input-15nov24-foil.wlen39/bin/pic-create
-- Up-to-date: /home/labuser/USERS/astashkin/picongpu_files/input-15nov24-foil.wlen39/bin/egetopt
-- Up-to-date: /home/labuser/USERS/astashkin/picongpu_files/input-15nov24-foil.wlen39/bin/pic-compile
-- Up-to-date: /home/labuser/USERS/astashkin/picongpu_files/input-15nov24-foil.wlen39/bin/pic-configure
-- Up-to-date: /home/labuser/USERS/astashkin/picongpu_files/input-15nov24-foil.wlen39/bin/pic-build
-- Up-to-date: /home/labuser/USERS/astashkin/picongpu_files/input-15nov24-foil.wlen39/bin/cuda_memtest.sh
/home/labuser/USERS/astashkin/picongpu_files/input-15nov24-foil.wlen39

However, I can't do anything with this program. After launch of tbg -s bash -c ./etc/picongpu/1.cfg -t ./etc/picongpu/bash/mpiexec.tpl ../run-directory, it results in such output:

Running program...
--------------------------------------------------------------------------
Parsing error in variable file:

  FILE:  /home/labuser/USERS/astashkin/picongpu_files/input-15nov24-foil.wlen39/sm/tbg/openib.conf
  LINE:  btl_openib_rdma_pipeline_send_length = 100000000

Please correct and try again.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
No executable was specified on the prterun command line.

Aborting.
--------------------------------------------------------------------------

Could you help me figure out what i am missing? Thanks

@weqoll
Copy link
Author

weqoll commented Dec 9, 2024

I figured out that OpenMPI v-5.0 does not support openib, so that can be an issue open-mpi/ompi#8831
Do you have some workarounds in such case?

@psychocoderHPC
Copy link
Member

You can simply empty https://github.com/ComputationalRadiationPhysics/picongpu/blob/dev/etc/picongpu/openib.conf in your PIConGPU cloned code. This is where we set this openib config.
I will investigate tomorrow if we can remove the file completely, it was required in the past but is most likely not required anymore.

psychocoderHPC added a commit to psychocoderHPC/picongpu that referenced this issue Dec 9, 2024
fix ComputationalRadiationPhysics#5222

The ibconfig is not compatible with all OpenMPI versions.
This file was added by me over 10 years ago as workaround for infiniband issues.
If for any reasons ib configuration are required the variables should be
set in the coresponding tpl file.
@psychocoderHPC
Copy link
Member

#5229 will remove the config file completely.

@weqoll
Copy link
Author

weqoll commented Dec 9, 2024

Sorry that my issue takes so long to resolve.
I hope that i'm on the verge of happy ending, however right now it don't want to work with OpenMPI and Subfilling VFD:

[HDF5 Backend] The requested subfiling VFD of HDF5 requires the use of threaded MPI.
Warning: parts of the backend configuration for HDF5 remain unused:
{"vfd":{"ioc_selection":"every_nth_rank","stripe_count":-1,"stripe_size":33554432}}
[HDF5 Backend] The requested subfiling VFD of HDF5 requires the use of threaded MPI.
Warning: parts of the backend configuration for HDF5 remain unused:
{"vfd":{"ioc_selection":"every_nth_rank","stripe_count":-1,"stripe_size":33554432}}
[HDF5 Backend] The requested subfiling VFD of HDF5 requires the use of threaded MPI.
Warning: parts of the backend configuration for HDF5 remain unused:
{"vfd":{"ioc_selection":"every_nth_rank","stripe_count":-1,"stripe_size":33554432}}
[HDF5 Backend] The requested subfiling VFD of HDF5 requires the use of threaded MPI.
Warning: parts of the backend configuration for HDF5 remain unused:
{"vfd":{"ioc_selection":"every_nth_rank","stripe_count":-1,"stripe_size":33554432}}
[HDF5 Backend] The requested subfiling VFD of HDF5 requires the use of threaded MPI.
Warning: parts of the backend configuration for HDF5 remain unused:
{"vfd":{"ioc_selection":"every_nth_rank","stripe_count":-1,"stripe_size":33554432}}
[HDF5 Backend] The requested subfiling VFD of HDF5 requires the use of threaded MPI.
Warning: parts of the backend configuration for HDF5 remain unused:
{"vfd":{"ioc_selection":"every_nth_rank","stripe_count":-1,"stripe_size":33554432}}
  0 % =        0 | time elapsed:             8sec 766msec | avg time per step:   0msec
 20 % =        1 | time elapsed:             8sec 777msec | avg time per step:  11msec
 40 % =        2 | time elapsed:             8sec 784msec | avg time per step:   6msec
 60 % =        3 | time elapsed:             8sec 791msec | avg time per step:   7msec
 80 % =        4 | time elapsed:             8sec 798msec | avg time per step:   7msec
100 % =        5 | time elapsed:             8sec 805msec | avg time per step:   7msec
calculation  simulation time: 11sec 540msec = 11.540 sec

I've compilated openMPI 5.0 that supports multithreading out of box. In ompi_info:

$ ompi_info | grep Thread
          Thread support: posix (MPI_THREAD_MULTIPLE: yes, OPAL support: yes, OMPI progress: no, Event lib: yes)

HDF5 is also compiled with --enable-parallel and --enable-subfiling-vfd
Are there any detailes that i'm miSsing?

@franzpoeschel
Copy link
Contributor

Sorry, I forgot mentioning that you would need to activate threaded MPI for the subfiling VFD. You can do this in PIConGPU by running

export PIC_USE_THREADED_MPI=MPI_THREAD_MULTIPLE

before launching PIConGPU.
Just making sure, is this the macro particle counter plugin or the generic openPMD plugin where you are trying to use the subfiling VFD? Those are separate outputs, and I think only the openPMD plugin exposes the configuration necessary for using the subfiling VFD.
I'll need to check tomorrow why ioc_selection, stripe_count and stripe_size are not read (though you don't need to set those, default values will be picked otherwise).

@weqoll
Copy link
Author

weqoll commented Dec 9, 2024

labuser@thunderstorm:~/USERS/astashkin/picongpu_files/input-15nov24-foil.wlen39$ export PIC_USE_THREADED_MPI=MPI_THREAD_MULTIPLE                              labuser@thunderstorm:~/USERS/astashkin/picongpu_files/input-15nov24-foil.wlen39$ tbg -s bash -c ./etc/picongpu/1.cfg -t ./etc/picongpu/bash/mpiexec.tpl ./run-7
Running program...
--------------------------------------------------------------------------
prterun noticed that process rank 1 with PID 23367 on node thunderstorm exited on
signal 11 (Segmentation fault).
--------------------------------------------------------------------------

seems like somethign is missing :(

@franzpoeschel
Copy link
Contributor

Huh
Ok, let me try this tomorrow, to see if there are generally problems that PIConGPU has with HDF5 subfiling (it's a relatively new feature in openPMD)

@weqoll
Copy link
Author

weqoll commented Dec 9, 2024

Just making sure, is this the macro particle counter plugin or the generic openPMD plugin where you are trying to use the subfiling VFD?

It's for the generic openPMD, i've dropped the macro particle counter for the sake of the experiment.

Also, there is some strange behaviour if I set chunking OFF without subfilling too.
Seems simulation hang up if i set off chuncking. While if it is 'auto' everything is good (except for corruption of files).

I'll be waiting for your answers, thank you for the help!

@franzpoeschel
Copy link
Contributor

The segmentation fault was my error, fixed with #5230.

You can use something like the below command to get normal HDF5 files from the subfiling output:

for i in ./openPMD/simData_000*.h5.*.config; do h5fuse -f "$i"; done

Also, there is some strange behaviour if I set chunking OFF without subfilling too.
Seems simulation hang up if i set off chuncking. While if it is 'auto' everything is good (except for corruption of files).

Yes, this is the kind of error that I mentioned a few days ago. We also saw this sporadically, but were never really able to find out where it comes from. Generally, those issues are fixed by either switching between ROMIO and MPI-IO or between chunking and no chunking, but there seems to be no setting that works always.
Since the subfiling VFD fundamentally changes the I/O dataflow, it's a new thing to try, maybe it works better..

@franzpoeschel
Copy link
Contributor

@ikbuibui Can you please reopen the issue? Github closed it automatically.

@ikbuibui ikbuibui reopened this Dec 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants