-
Notifications
You must be signed in to change notification settings - Fork 345
DAOS-17427 control: Restart excluded rank after suicide #16279
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
tanabarr
wants to merge
55
commits into
master
Choose a base branch
from
tanabarr/control-engine-suicide-restart
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
55 commits
Select commit
Hold shift + click to select a range
b94c921
DAOS-17427 control: Restart evicted rank after suicide
tanabarr af7f056
implement suicide event handlers
tanabarr 4ce711f
add unit testing and documentation
tanabarr 550ef12
fix docs and unit tests
tanabarr 1f61b98
revise unit test for suicide handler
tanabarr 8a79efb
fixup tests
tanabarr f643187
Merge branch 'master' into tanabarr/control-engine-suicide-restart
tanabarr e9c6896
rename suicide to self terminated
tanabarr d903017
rename registerFollowerSubscriptions to registerSubscriptions
tanabarr 2b8b16f
add flag to disable automatic engine restart
tanabarr a593f54
fix intermittent test fails with delay before txt comp
tanabarr 3656620
Merge remote-tracking branch 'origin/master' into tanabarr/control-en…
tanabarr 394ab67
Merge branch 'tanabarr/control-engine-suicide-restart' of github.com:…
tanabarr f445e51
implement basic rate limiting
tanabarr 2cea5aa
improve naming consistency and fix config unit tests
tanabarr f6ae57e
add rate-limiting unit test
tanabarr 180cc0c
documentation updates
tanabarr 468507f
Merge remote-tracking branch 'origin/master' into tanabarr/control-en…
tanabarr f1cdf0a
Q a single restart request if received within timeout period
tanabarr 2c30c0d
address review comments from mjmac and kjacque
tanabarr 7e95333
use channel-based restart manager for rate-limiting
tanabarr 6252edf
fix handleEngineSelfTerminated unit tests
tanabarr de033c1
add unit tests for engine restart manager
tanabarr d58bce2
Merge remote-tracking branch 'origin/master' into tanabarr/control-en…
tanabarr 09cc634
remove deprecated code
tanabarr 47883eb
Merge remote-tracking branch 'origin/master' into tanabarr/control-en…
tanabarr d89a387
DRY-up unit tests for engine restart manager
tanabarr aff68c3
Merge branch 'master' into tanabarr/control-engine-suicide-restart
tanabarr 6b4528d
DAOS-17427 test: Auto-restart after self-terminate tests (#18006)
tanabarr bdfde05
fix server package unit test helpers
tanabarr 417b33b
Merge remote-tracking branch 'origin/master' into tanabarr/control-en…
tanabarr c76507f
Revert "fix server package unit test helpers"
tanabarr 8d5a1da
fix server package unit test helpers
tanabarr 4a2938f
addressed review comments from kjacque pt1
tanabarr 049509c
allow restart manager to close and open again
tanabarr bd15522
Revert "allow restart manager to close and open again"
tanabarr 06f3f6e
address some review comments from kjacque
tanabarr 8f82660
Merge remote-tracking branch 'origin/master' into tanabarr/control-en…
tanabarr 5feadfe
comment one start/stop per process lifetime
tanabarr 969c837
address more review comments from kjacque
tanabarr 835156b
pylint fixes
tanabarr 95c061a
using self.register_cleanup (#18240)
tanabarr 9ab86a9
Apply suggestion from @daltonbohning
tanabarr 1913fd9
more ftest related review comment updates
tanabarr 9dc146c
f-string updates and remove step comments in log_step calls use Comma…
tanabarr 48ae917
Merge remote-tracking branch 'origin/master' into tanabarr/control-en…
tanabarr 28e4905
Update src/tests/ftest/control/engine_auto_restart.yaml
tanabarr 69c7327
Update src/tests/ftest/control/engine_auto_restart_disabled.yaml
tanabarr f1c54c9
Update src/tests/ftest/control/engine_auto_restart_disabled.yaml
tanabarr 335590e
Update src/tests/ftest/control/engine_auto_restart_advanced.yaml
tanabarr 031e5af
Update src/tests/ftest/control/engine_auto_restart_disabled.py
tanabarr d6b7993
Update src/tests/ftest/control/engine_auto_restart.yaml
tanabarr de28a9f
Update src/tests/ftest/control/engine_auto_restart_advanced.yaml
tanabarr 42441e3
Update src/tests/ftest/control/engine_auto_restart_disabled.py
tanabarr 8c14077
fail if delay > 200% of expected
tanabarr File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -99,7 +99,9 @@ type Server struct { | |
| Path string `yaml:"-"` // path to config file | ||
|
|
||
| // Behavior flags | ||
| AutoFormat bool `yaml:"-"` | ||
| AutoFormat bool `yaml:"-"` | ||
| DisableEngineAutoRestart bool `yaml:"disable_engine_auto_restart"` | ||
| EngineAutoRestartMinDelay int `yaml:"engine_auto_restart_min_delay,omitempty"` | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There doesn't appear to be any validation of this parameter, e.g. in the if cfg.EngineAutoRestartMinDelay < 0 {
return errors.Errorf("engine_auto_restart_min_delay must be >= 0 (got %d)",
cfg.EngineAutoRestartMinDelay)
}
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. done |
||
|
|
||
| deprecatedParams `yaml:",inline"` | ||
| } | ||
|
|
@@ -362,6 +364,18 @@ func (cfg *Server) WithTelemetryPort(port int) *Server { | |
| return cfg | ||
| } | ||
|
|
||
| // WithDisableEngineAutoRestart enables or disables automatic engine restarts on self-termination. | ||
| func (cfg *Server) WithDisableEngineAutoRestart(disabled bool) *Server { | ||
| cfg.DisableEngineAutoRestart = disabled | ||
| return cfg | ||
| } | ||
|
|
||
| // WithEngineAutoRestartMinDelay sets minimum time between automatic engine restarts. | ||
| func (cfg *Server) WithEngineAutoRestartMinDelay(secs uint) *Server { | ||
| cfg.EngineAutoRestartMinDelay = int(secs) | ||
| return cfg | ||
| } | ||
|
|
||
| // DefaultServer creates a new instance of configuration struct | ||
| // populated with defaults. | ||
| func DefaultServer() *Server { | ||
|
|
@@ -837,6 +851,11 @@ func (cfg *Server) Validate(log logging.Logger) (err error) { | |
| return FaultConfigSysRsvdZero | ||
| } | ||
|
|
||
| if cfg.EngineAutoRestartMinDelay < 0 { | ||
| return errors.Errorf("engine_auto_restart_min_delay must be >= 0 (got %d)", | ||
| cfg.EngineAutoRestartMinDelay) | ||
| } | ||
|
|
||
| // A config without engines is valid when initially discovering hardware prior to adding | ||
| // per-engine sections with device allocations. | ||
| if len(cfg.Engines) == 0 { | ||
|
|
||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.