Skip to content

fix(cluster_family): Cancel slot migration from incoming node on OOM #5000

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
May 29, 2025

Conversation

mkaruza
Copy link
Contributor

@mkaruza mkaruza commented Apr 25, 2025

If applying command on incoming node will result in OOM (we overflow
max_memory_limit) we are closing migration and switch state to FATAL.

@mkaruza mkaruza requested a review from BorysTheDev April 25, 2025 11:16
@mkaruza mkaruza marked this pull request as draft April 25, 2025 11:16
@mkaruza mkaruza force-pushed the mkaruza/github#4977 branch from 9c88c11 to 6a5d621 Compare May 6, 2025 12:34
@mkaruza mkaruza requested a review from BorysTheDev May 6, 2025 12:35
@mkaruza mkaruza marked this pull request as ready for review May 6, 2025 12:35
@mkaruza mkaruza changed the title fix(cluster_family): Cancel slot migration in case of OOM errors on t… fix(cluster_family): Cancel slot migration from incoming node on OOM May 6, 2025
Comment on lines 985 to 987
if (migration->GetState() == MigrationState::C_FATAL) {
migration->Stop();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like incorrect place for this logic

Copy link
Contributor Author

@mkaruza mkaruza May 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If FLOW fails with C_FATAL we'll call it, where would you put migration->Stop to handle stopping of migration?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can move it to reportError or SetState, where we set the fatal status

@mkaruza mkaruza force-pushed the mkaruza/github#4977 branch from e9049f0 to 16a27f6 Compare May 7, 2025 11:32
@mkaruza mkaruza requested a review from BorysTheDev May 7, 2025 11:33
@BorysTheDev BorysTheDev requested review from adiholden and kostasrim May 7, 2025 11:50
@mkaruza mkaruza force-pushed the mkaruza/github#4977 branch 2 times, most recently from 7e7ebf4 to 40851f9 Compare May 7, 2025 19:41
@mkaruza mkaruza requested a review from BorysTheDev May 8, 2025 13:30
@mkaruza mkaruza force-pushed the mkaruza/github#4977 branch from eab5c5a to 14d2fab Compare May 27, 2025 06:44
@adiholden
Copy link
Collaborator

@kostasrim Please review the changes in this PR related to the reply builder

@kostasrim
Copy link
Contributor

kostasrim commented May 27, 2025

@kostasrim Please review the changes in this PR related to the reply builder

I synced with @mkaruza internally. We don't need the changes in the reply builder as they affect a broader part of the codebase. What is more is that the common denominator for all errors are strings and not op status codes.

What we need here is to fetch out the error message on the caller side (of DispatchCommand) on a one very specific code path (when status code is OOM). This can be done easily by an extra std::optional<bool> within ConnetionContext and pretty much get rid of all of the changes in ReplyBuilder. If we have an error and if std::optional is emplaced and if the string is kOOM we set the boolean to true and that's how the caller knows. No changes to reply builder and no generic solution for a non generic problem. Furthermore, no extra overhead unless we are actually on this very specific flow (and IMO the 10 byte comparison of the string is nothing)

If we ever need a generic solution that on error requires to return the actual error message to the caller, we just add an extra argument to DispatchCommand() and do it properly.

@mkaruza mkaruza force-pushed the mkaruza/github#4977 branch from 14d2fab to 51adcde Compare May 27, 2025 11:28
@mkaruza mkaruza requested review from BorysTheDev and kostasrim May 27, 2025 11:30
@@ -1364,6 +1364,10 @@ bool Service::InvokeCmd(const CommandId* cid, CmdArgList tail_args,
}

if (std::string reason = builder->ConsumeLastError(); !reason.empty()) {
// Set flag if OOM reported
if (reason == kOutOfMemory) {
cmd_cntx.conn_cntx->is_oom_ = true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we put is_oom_ into command context?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CommandContext will be create in DispatchCommand so not available from incoming slot migration.

Copy link
Contributor

@BorysTheDev BorysTheDev May 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can do refactoring and propagate it? I'm ok to do this in separate PR if it is possible

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, let's talk to improve this in separate PR

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want to push any refactor for a single flow. As I wrote above, let's not rush generic solutions for non generic problems. When the time comes and we need the callers to check for reply builder errors we can discuss. Until then, sayonara as we have more important things to deal with.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mkaruza Could you create a ticket to not forget regarding this refactoring

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Guys, valid points on both sides.

I agree that is_oom_ is conceptually tied to the command rather than the connection, so placing it in CommandContext makes sense from a design perspective. However, since CommandContext isn’t available during the incoming slot migration, I support the idea of deferring this to a separate PR . I will follow up with Mario to understand the effort and if we want to do this If this is a big change and the proposed refactor is non-trivial .

Copy link
Contributor

@kostasrim kostasrim May 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't discuss reply builder errors. store OOM result in connection_context isn't logically correct,

I can argue the other way but this is not my point. My point is that we have other more important things to deal with rn than this refactor.

I am not the one who steers product direction or what to prioritize but I would ask first before considering this.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @BorysTheDev raised a design concern. His suggestion to handle it in a separate PR seems reasonable and doesn't block this one.

The point about broader priorities is important, but I don't think it conflicts with acknowledging and tracking design inconsistencies. It’s not about shifting priorities now, but recognizing a spot where the design could be improved.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mkaruza mkaruza force-pushed the mkaruza/github#4977 branch from 51adcde to 3622fd3 Compare May 28, 2025 11:59
@mkaruza mkaruza merged commit 07e6958 into main May 29, 2025
10 checks passed
@mkaruza mkaruza deleted the mkaruza/github#4977 branch May 29, 2025 06:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants