Skip to content
This repository has been archived by the owner on Mar 20, 2023. It is now read-only.

Enable auto checkpointing on SIGTERM #546

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from
Draft

Conversation

ferdonline
Copy link
Contributor

Motivation
As a follow up to #252, we want CoreNeuron to be able to create checkpoints right before an allocation expires.
Since most job schedulers send a SIGTERM before sigkill, we implement a handler for such signal. It may, however, be needed to tune the time to sending this signal, since long simulations may take a bit of time to write everything out.

Implementation
Checkpoints are created in a folder _corenrn_ckpt inside the output root only if a minimum amount of time elapsed.
This directory is checked for existence on startup if no --restore is provided.

@ferdonline ferdonline requested a review from pramodk May 10, 2021 18:44
@pramodk pramodk requested a review from alkino May 10, 2021 19:36
@ferdonline ferdonline changed the title Enable auto checkplointing on SIGTERM Enable auto checkpointing on SIGTERM May 11, 2021
@ferdonline
Copy link
Contributor Author

I don't really understand why the CI could fail in GPU. @pramodk Ideas?

Comment on lines +425 to +427
std::cerr << "SIGTERM caught! Halting sim and dumping checkpoint" << std::endl;
coreneuron::stoprun = true;
};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is often do is to enable a "double sigterm", to really cut things

How long is this checkpoint can be?

@pramodk
Copy link
Collaborator

pramodk commented May 17, 2021

I don't really understand why the CI could fail in GPU. @pramodk Ideas?

sorry for delay - this issue is being investigated. You can ignore this error.

@olupton olupton closed this Jun 30, 2021
@olupton olupton reopened this Jun 30, 2021
@olupton
Copy link
Contributor

olupton commented Jun 30, 2021

Can you rebase this @ferdonline?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants