Skip to content

More user-friendly errors and automatic restarts in case of engines crashing due to OOM #695

@sahil1105

Description

@sahil1105

The errors we report in case of OOM and Segmentation-Fault are now much better, but I was wondering is there a way to make them more "user-friendly"?

  1. Currently, at least for the MPI case, we report the mpiexec output, which is great, but could there be a way to report a cleaner error in addition to this, that could clearly identify this as a OOM error (or a seg-fault if possible)?
  2. Is there something that packages (like Bodo) could do to make this experience better/easier?
  3. What's the best way to automate restart of engines in this case? Ideally, if enabled, in cases where the engines crash, if we could clean up the processes, display a message (e.g. "engines crashed due to OOM, restarting engines..."), and then restart the engines, that would be useful.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions