Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: AP 17 Fleet shard status #82
base: main
Are you sure you want to change the base?
feat: AP 17 Fleet shard status #82
Changes from 3 commits
1096aa3
597c1cf
034da17
e5cfb93
95f6570
eb8e943
9869245
d8536f6
85d0693
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a small drawing of fleet-manager fleet-shard operator operand could help set the context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After starting on a drawing it made me wonder if that was making this seem too complicated. So I just updated with some language about custom resources in general - I really am trying to convey this is the kubernetes custom resource paradigm with an intermediary, so it has some additional considerations on top of the base recommendations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Show examples in the form of yaml with a context of the status if necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there specific examples you had in mind?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Things like showing non omited considtions even though not use for that specific effective status report.
And that might not be here but you say that the shard in most case cannot know whether a case is transient or terminal. which makes me wonder how the control plan makes that decision from thge status it receives fromt he shard. HEre some examples of how you achieved it , or how a status is interpreted for feedback tot he end user would be interesting for context on how the global machinery is orchestrated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The guidance about minimizing is fleetshard specific. The interpretation of missing as unknown comes straight from the kubernetes recommendations. From existing non-fleetshard operators, we have counter examples like:
strimzi uses both a Ready and NotReady conditions. The NotReady condition is only populated when there's a problem. By their contract you have to know that a missing NotReady means to look instead for the Ready condition, not that NotReady is currently unknown.
But I'm honestly not sure how instructive that is.
That's the gist of the recommendations around the error handling - the control plane nor a user can assume an immediate action is needed given seeing something is currently in error - regardless of whether we're installing.
The fleetshard operator interprets the strimzi status as follows:
https://github.com/bf2fc6cc711aee1a0c2a/kas-fleetshard/blob/main/operator/src/main/java/org/bf2/operator/operands/AbstractKafkaCluster.java#L169 That is then aggregated with the other operands. The fleet manager is further able to aggregate reasoning over the cluster itself and from things like route53.
We don't expect the fleet manager to make specific decisions based upon errors that appear in our status - we are instead calling out very specific service level cases - the wrong profile or version of something is in the cr, the data plane lacks capacity for the given instance - that the control plane may use to do things like select a different strimzi version or use a different cluster. Otherwise about the best that they can infer is whether we think we're still Installing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but it's for the fleet manager to interpret in the end correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand what this is trying to say, but I think that way this is worded is a little vague. Do you think the following is better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Incorporated the above, please see if that matches what you are thinking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a state diagram with some context could help setup the context of the conversation here.
I'm not sure it will work but could help as I feel I have to think at the problem to figure out what is recommended and for what use case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There really isn't a state diagram, only very special exceptions to what qualifies as a terminal error.
Maybe restating more succinctly might help:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we give a concrete example here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Elaborated more based upon a kafka version upgrade.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Show concrete YAML example to anchor the reader
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added the yaml that matches the description.