-
Notifications
You must be signed in to change notification settings - Fork 1k
fix(cluster_family): Cancel slot migration from incoming node on OOM #5000
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
9c88c11
to
6a5d621
Compare
src/server/cluster/cluster_family.cc
Outdated
if (migration->GetState() == MigrationState::C_FATAL) { | ||
migration->Stop(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like incorrect place for this logic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If FLOW
fails with C_FATAL we'll call it, where would you put migration->Stop
to handle stopping of migration?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can move it to reportError or SetState, where we set the fatal status
e9049f0
to
16a27f6
Compare
7e7ebf4
to
40851f9
Compare
eab5c5a
to
14d2fab
Compare
@kostasrim Please review the changes in this PR related to the reply builder |
I synced with @mkaruza internally. We don't need the changes in the reply builder as they affect a broader part of the codebase. What is more is that the common denominator for all errors are What we need here is to fetch out the error message on the caller side (of DispatchCommand) on a If we ever need a |
14d2fab
to
51adcde
Compare
@@ -1364,6 +1364,10 @@ bool Service::InvokeCmd(const CommandId* cid, CmdArgList tail_args, | |||
} | |||
|
|||
if (std::string reason = builder->ConsumeLastError(); !reason.empty()) { | |||
// Set flag if OOM reported | |||
if (reason == kOutOfMemory) { | |||
cmd_cntx.conn_cntx->is_oom_ = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we put is_oom_ into command context?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CommandContext
will be create in DispatchCommand
so not available from incoming slot migration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can do refactoring and propagate it? I'm ok to do this in separate PR if it is possible
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, let's talk to improve this in separate PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't want to push any refactor for a single flow. As I wrote above, let's not rush generic solutions for non generic problems. When the time comes and we need the callers to check for reply builder errors we can discuss. Until then, sayonara as we have more important things to deal with.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mkaruza Could you create a ticket to not forget regarding this refactoring
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Guys, valid points on both sides.
I agree that is_oom_ is conceptually tied to the command rather than the connection, so placing it in CommandContext makes sense from a design perspective. However, since CommandContext isn’t available during the incoming slot migration, I support the idea of deferring this to a separate PR . I will follow up with Mario to understand the effort and if we want to do this If this is a big change and the proposed refactor is non-trivial .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't discuss reply builder errors. store OOM result in connection_context isn't logically correct,
I can argue the other way but this is not my point. My point is that we have other more important things to deal with rn than this refactor.
I am not the one who steers product direction or what to prioritize but I would ask first before considering this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think @BorysTheDev raised a design concern. His suggestion to handle it in a separate PR seems reasonable and doesn't block this one.
The point about broader priorities is important, but I don't think it conflicts with acknowledging and tracking design inconsistencies. It’s not about shifting priorities now, but recognizing a spot where the design could be improved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
If applying command on incoming node will result in OOM (we overflow max_memory_limit) we are closing migration and switch state to FATAL. Signed-off-by: mkaruza <[email protected]>
51adcde
to
3622fd3
Compare
If applying command on incoming node will result in OOM (we overflow
max_memory_limit) we are closing migration and switch state to FATAL.