Skip to content

Conversation

@dfawley
Copy link
Collaborator

@dfawley dfawley commented Nov 6, 2025

This is based on top of #2442; please review only the final commit in this PR.

I found this pattern for writing tests to be painful. I will try to spend some time working on another approach.

The main problem I had with the tests was that the test_utils mod creates the StubPolicy and StubPolicyData meaning the tests have a hard time observing what the children Funcs did except by observing what was called on the child_controller or work_scheduler.

cjqzhao and others added 2 commits November 4, 2025 14:21
rename mock picker and remove spaces

make test picker private
General:

- Add Debug to many traits and derive/impl in structs.
- Pass LB config to LB policies via `Option<LbConfig>` instead of
  `Option<&LbConfig>`.  It should be rare that policies want to store
  a config except for the leaf policy.

Child manager:

The original assumption was that all children would be the same
type/configuration, but several policies (including gracefulswitch)
will not have that property.  So, several changes are made:

- Children are considered unique by both their identifier and their
  LbPolicyBuilder's name().
- Make it so the sharder also can shard LbConfig and provide it via
  the ChildUpdate.child_update field in addition to the
  ResolverUpdate.
- Make ResolverUpdateSharder a generic instead of Box<dyn>.
- Add booleans so users of child manager can easily easily tell
  whether any child policies updated themselves, and which ones did.
- Pass &mut self for sharder so that it can maintain and update its
  state if needed.
- Change the sharder's output ChildUpdate.child_update field to an
  Option; if None then the child will not be called during the
  resolver update, but will remain in the child manager.
- Change child_states into children and provide the whole Child
  struct, exposing the fields it contains.
- Provide mutable access to the sharder.
- Change the LB config to be a flat JSON array to facilitate use
  within another LB policy that should not need a struct to contain on
  the children.
- Minor test cleanups

Graceful switch:

The previous implementation in hyperium#2399 contained a lot of logic to
manage child policy delegation.  It was intended that only
ChildManager should need to have this kind of logic.

- Create a new implementation of this policy that delegates to
  ChildManager.
- Uses a Sharder that simply emits the active policy with no update
  alongside any new policy in the new LbConfig.
- maybe_swap is called after every call into the ChildManager to
  determine if child updates necessitate a swap.
- This logic is simple: if the active policy is not Ready, or if there
  is a new policy and it is not Connecting, then set the new policy as
  the active policy and call resolver_update on the ChildManager.  The
  sharder will see that no LbConfig is provided and just emit the
  active policy with no config, causing the ChildManager to drop the
  previously active policy.  If no swap is needed, update the picker
  of the active policy if it had an update.
- Minor test cleanups/fixes vs. hyperium#2399.
@dfawley dfawley added this to the grpc-next milestone Nov 6, 2025
@dfawley dfawley requested a review from arjan-bal November 6, 2025 22:56
@dfawley dfawley force-pushed the child_manager_work_scheduler branch from f888c3b to c8a2166 Compare November 7, 2025 21:23
// TODO: This is mainly provided as a fairly complex example of the current LB
// policy in use. Complete tests must be written before it can be used in
// production. Also, support for the work scheduler is missing.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Probably need to remove todo about the missing work scheduler?

pub struct ChildManager<T> {
#[derive(Debug)]
pub struct ChildManager<T: Debug, S: ResolverUpdateSharder<T>> {
subchannel_child_map: HashMap<WeakSubchannel, usize>,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we consider renaming this to reflect what is being mapped to? Since usize is a pretty ambigious type it makes it difficult to understand the purpose of the map. Ex: subchannel_to_iindex

update_sharder: S,
pending_work: Arc<Mutex<HashSet<usize>>>,
runtime: Arc<dyn Runtime>,
updated: bool, // Set when any child updates its picker; cleared when accessed.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We seem to have some form of intertwined states (children, work queue, map) which seem to have some sort of interdependency. Would it make sense to encapsulate the dependent states into a something like a StateHolder which only exposes potentially valid state manipulations on the underlying data? This'd help us avoid complexities around potentially missing corner cases which result in invalid states.

self.children
.iter()
.map(|child| (&child.identifier, &child.state))
pub fn children(&self) -> impl Iterator<Item = &Child<T>> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we consider encapsulation here instead of returning accessors to non trivial data members(anything that isn't just data)?


/// Returns a mutable reference to the update sharder so operations may be
/// performed on it for instances in which it needs to retain state.
pub fn update_sharder(&mut self) -> &mut S {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above. Should we encapsulate the operations instead of returning a data member to avoid leaking implementations?

.enumerate()
.map(|(old_idx, e)| (e.identifier, (e.policy, e.state, old_idx, e.work_scheduler)));
let mut old_children: HashMap<T, _> = old_children.collect();
let old_children = old_children.into_iter().enumerate().map(|(old_idx, e)| {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we call collect on the first expression itself?

let mut old_children: HashMap<T, _> = old_children.collect();
let old_children = old_children.into_iter().enumerate().map(|(old_idx, e)| {
(
(e.builder.name(), e.identifier),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we abstract these tuples to some reasonable internal structs instead of arbitrary tuples? In general, we seem to have lots of tuples being structured and destructured, so the question would probably apply to it as well.

// Transfer children whose identifiers appear before and after the
// update, and create new children. Add entries back into the
// subchannel map.
for (new_idx, (identifier, builder)) in ids_builders.into_iter().enumerate() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we consider simplifying the reconciliation logic ? Along the lines of one of the previous comment about separating state manipulation into it's own encapsulation. Maybe we can split the updation into two steps calculate diff (which returns a set of diff operations) by reading the state and then asking the state manager to apply the set of diffs.

ConnectivityState::TransientFailure
);
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a test that verifies reconciliation business logic and branches?

@dfawley
Copy link
Collaborator Author

dfawley commented Nov 10, 2025

I started working on all the things you commented on here, but I realized now that pretty much all of your comments should be on #2442. (See PR description:)

This is based on top of #2442; please review only the final commit in this PR.

(And then I made one more commit on top of that "final commit" and fixed the old comment that you noticed. So please focus on just https://github.com/hyperium/tonic/pull/2443/files/f933d877c87e97a6c18cb14a86a861f9f7efe6a6..3332da3710d62e4c5999c88c3dd4a2d3884665b5 for this PR and we can carry forward the other discussions in #2442.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants