Skip to content

Conversation

@andrewjstone
Copy link
Contributor

When committing configurations, the trust quorum protocol relies on Nexus to be up and for Nexus to interact with each trust quorum node. This allows Nexus to continuously try to commit nodes in an RPW and record in the database which nodes have acked. This mirrors many of our existing designs where Nexus observes and records information about successful operations and retries via RPW. Specifically it matches how we do things in the Reconfigurator and the TUF Repo Depot.

Unfortunately, Nexus cannot communicate with new sleds that are not yet running a sled-agent, but still stuck in the bootstrap-agent. This is because the bootstrap agents (and trust quorum protocol) only communicate over the bootstrap network, which Nexus does not have access to. Nodes must already be part of an existing configuration, running sled-agent, and on the underlay network to talk to Nexus. In this common case, Nexus sends trust quorum related messages to the sled-agent which then calls the api of its local trust quorum NodeTask. This is not possible for newly added sleds. While the trust quroum coordinator node will tell new nodes to Prepare a configuration over the bootstrap networtk, these new nodes do not have any mechanism to receive commits from Nexus. Therefore we must proxy these commit related operations to an existing member of the trust quorum when adding a new node. We also added the ability to proxy NodeStatus requests to aid in debugging.

This PR therefore adds the ability to proxy certain requests from one node to another so that we can commit nodes to the latest trust quorum configuration, setup their encrypted storage, and boot their sled-agent.

It's worth noting that this is not the only way we could have solved this problem. There are a few possibilities in the design space.

  1. We could have had the coordinator always send commit operations and collect acknowledgements as during the Prepare phase. Unfortunately, if the coordinator dies before all nodes ack then Nexus would not be able to ensure commit at all nodes. To make this reliable, Nexus would still need to be able to reach out to uncommitted nodes and tell them to commit. Since we already have to do the latter there is no reason to do the former.

  2. We could commit at the coordinator (or a few nodes), and then have them gossip around information about commit. This is actually a promising design, and is essentially what we do for the early network config. Nexus could then wait for the sled-agent to start for those nodes and ask them directly if they committed. This would still require talking to all nodes and it adds some extra complexity, but it still seems somewhat reasonable. The rationale for our current choice of proxying was largely one of fitting our existing patterns. It's also very useful for Nexus to be able directly ask a trust quorum node on another sled about its status to diagnose issues.

So we went with the proxy mechanism as implemented here. Well, why did we introduce another level of messages at the Task layer instead of re-using the CommitAdvance functionality or adding new variants to the PeerMsg in the trust_quorum_protocol crate? The rationale here is largely that the trust quorum protocol as written in RFD 238 and specified in TLA+ doesn't include this behavior. It expects commits from the Node "API", meaning from Nexus. I didn't want to change that behavior unnecessarily due to urgency, and an existing solid design.

It was also easier to build proxy operations this way since tracking operations in async code with oneshot channels is easier than trying to insert similar tracking into the sans-io code. In short, we left the trust-quorum-protocol crate alone, and added some async helpers to the trust_quorum crate.

One additional change was made in this PR. While adding the tq_proxy test I noticed that we were unnecessarily using wait_for_condition on initial commits, after we knew about succesful prepares. These commits should always complete immediately and so I simplified this code in a few existing tests.

When committing configurations, the trust quorum protocol relies on
Nexus to be up and for Nexus to interact with each trust quorum node.
This allows Nexus to continuously try to commit nodes in an RPW and
record in the database which nodes have acked. This mirrors many of our
existing designs where Nexus observes and records information about
successful operations and retries via RPW. Specifically it matches how
we do things in the `Reconfigurator` and the `TUF Repo Depot`.

Unfortunately, Nexus cannot communicate with new sleds that are not
yet running a sled-agent, but still stuck in the bootstrap-agent.
This is because the bootstrap agents (and trust quorum protocol) only
communicate over the bootstrap network, which Nexus does not have access
to. Nodes must already be part of an existing configuration, running
sled-agent, and on the underlay network to talk to Nexus. In this common
case, Nexus sends trust quorum related messages to the sled-agent which
then calls the api of its local trust quorum `NodeTask`. This is not
possible for newly added sleds. While the trust quroum coordinator node
will tell new nodes to `Prepare` a configuration over the bootstrap
networtk, these new nodes do not have any mechanism to receive commits
from Nexus. Therefore we must proxy these commit related operations to
an existing member of the trust quorum when adding a new node. We also
added the ability to proxy `NodeStatus` requests to aid in debugging.

This PR therefore adds the ability to proxy certain requests from one
node to another so that we can commit nodes to the latest trust quorum
configuration, setup their encrypted storage, and boot their sled-agent.

It's worth noting that this is not the only way we could have solved
this problem. There are a few possibilities in the design space.

1. We could have had the coordinator always send commit operations and
collect acknowledgements as during the `Prepare` phase. Unfortunately,
if the coordinator dies before all nodes ack then Nexus would not be
able to ensure commit at all nodes. To make this reliable, Nexus would
still need to be able to reach out to uncommitted nodes and tell them to
commit. Since we already have to do the latter there is no reason to do
the former.

2. We could commit at the coordinator (or a few nodes), and then
have them gossip around information about commit. This is actually a
promising design, and is essentially what we do for the early network
config. Nexus could then wait for the sled-agent to start for those
nodes and ask them directly if they committed. This would still require
talking to all nodes and it adds some extra complexity, but it still
seems somewhat reasonable. The rationale for our current choice of
proxying was largely one of fitting our existing patterns. It's also
very useful for Nexus to be able directly ask a trust quorum node on
another sled about its status to diagnose issues.

So we went with the proxy mechanism as implemented here. Well, why did
we introduce another level of messages at the `Task` layer instead of
re-using the `CommitAdvance` functionality or adding new variants to
the `PeerMsg` in the `trust_quorum_protocol` crate? The rationale here
is largely that the trust quorum protocol as written in RFD 238 and
specified in TLA+ doesn't include this behavior. It expects commits from
the `Node` "API", meaning from `Nexus`. I didn't want to change that
behavior unnecessarily due to urgency, and an existing solid design.

It was also easier to build proxy operations this way since tracking
operations in async code with oneshot channels is easier than trying to
insert similar tracking into the `sans-io` code. In short, we left the
`trust-quorum-protocol` crate alone, and added some async helpers to the
`trust_quorum` crate.

One additional change was made in this PR. While adding the `tq_proxy`
test I noticed that we were unnecessarily using `wait_for_condition` on
initial commits, after we knew about succesful prepares. These commits
should always complete immediately and so I simplified this code in a
few existing tests.
Comment on lines +449 to +450
h.send(req).await;
ProxyConnState::Connected
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a previous PR I noted that the send() method ignores all failures and returns a success (well, it doesn't return a Result) even if the connection broke. For the trust quorum protocol that seemed fine, as the protocol is resilient to messages not being delivered.

The proxy APIs defined in another file expect either a response from the server or a Disconnected error before returning: if it receives nothing it will block forever. After spending probably way too much time thinking about failure cases related to this1, if the message is sent to the established connection actor things will eventually be fine: when we returns, the caller will add the request to the tracker, and a disconnection detected by the established connection actor will be eventually relayed to the tracker.

The failure case I still see is when the channel is busy (with 10 pending requests) and the send method silently discards the message. In that case, we return a ProxyConnState::Connected and the request will be added to the tracker, but we will never get a response unless the connection breaks due to another unrelated request failing (since this request never got to the actor).

The code as is could be fine if the caller to any proxy method takes care of adding timeouts everywhere, but this feels like a problem waiting to happen. I'd feel way more comfortable if this returned a ProxyConnState::Busy if sending a message to the channel failed.

Footnotes

  1. I lost count of how many times I rewrote this comment with other failure cases that turns out couldn't happen.

Copy link
Contributor Author

@andrewjstone andrewjstone Nov 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great analysis @pietroalbini. Thank you so much for taking the time to go through this process and understand the code and design. I actually think if this was a problem, that it would also be somewhat of a problem for the normal trust quorum protocol, as some messages from peers are never resent on a timer.

For example: If a coordinator is sending a Prepare msg to a peer and the and the message got dropped before making it to the connection task, it would not be resent unless the channel got disconnected and reconnected for some other reason. Now, that being said the volume of messages is small and this should not happen. And as you point out, there is some built in resilience. If the commit occurred nexus would prepareAndCommit this node or it would get a CommitAdvance message on the next restart (perhaps after an update). But it still could end up as a problem if too many nodes did this and the coordinator couldn't complete the prepare phase. Nexus would try a new configuration at perhaps a different coordinator after some time without prepare completing, but the system may still be overloaded.

With all that being said, I don't actually think what you pointed out is entirely true, and therefore this isn't actually a problem here. However, this analysis is also non-trivial. It almost makes me question whether using connection state for retries instead of timers is the right move. So far, I think it continues to work and has the benefit of not re-sending messages already sent over a reliable stream. Ok, so back to the problem.

The failure case I still see is when the channel is busy (with 10 pending requests) and the send method silently discards the message.

The channel being used in send is an mpsc::bounded channel and blocks on send. The error that is discarded is when the channel itself gets disconnected, presumably because the task exited. In that case the Disconnect callback will eventually fire and all is well.

To help ensure that the disconnect callback occurs when buffers start filling up, there is also a MSG_WRITE_QUEUE_CAPACITY for each established connection that will disconnect if too many `` messages are pulled off the channel and serialized before they can be sent. Somewhat importantly, this channel is sized smaller than the queue, so if the queue is full it means that the TCP connection (or serialization) is too slow to move things along. We get backpressure, and eventually a disconnection that should allow things to clear up on a reconnect. I should cleanup the TODO suggestion there as we actually can't drop messages or we will break things as you point out.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It appears there is actually one place where the ConnToMainMsgInner::Disconnected message doesn't get sent to the NodeTask task when an EstablishedConnection closes. That is when the EstablishedConnection task itself panics. I believe that we use panic = abort in production, and so this isn't actually a problem in practice.

However, this has me wondering if instead I should signal to NodeTask from the ConnectionManager::step method when an EstablishedConnection exits rather than sending a Disconnected message from the task itself. That would cover both the successful exit and panic cases for the EstablishedConnection task.

Unfortunately, I also realized that there is another race condition that may get worse if I do this. If a new connection is accepted for the same peer it will trigger an abort of the old connection. In this case the old disconnect will occur after the new connection is established. That could cause problems for the protocol, and I should gate it the Node::on_disconnect task by looking to see if the task_id matches the current established task ID, as is done for the connection manager on_disconnected callback.

Another option to solve the latter problem is to always reject a new accepted connection for the same peer if one is already established. Eventually the old one will go away, and the remote peer will retry.

I need to think about this a bit more, but will likely add a few small cleanup patches.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, part of Emily's concern is that the request tracker inserted into proxy_tracker here https://github.com/oxidecomputer/omicron/pull/9403/files#r2527260678 is leaked in the case where the channel is disconnected but this method returns ProxyConnState::Connected because the disconnect callback has not fired yet. I think we need to either remove the entry from proxy_tracker in that case or have send indicate whether the task is still there so that we could return ProxyConnState::Disconnected and not insert into proxy_tracker?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, part of Emily's concern is that the request tracker inserted into proxy_tracker here https://github.com/oxidecomputer/omicron/pull/9403/files#r2527260678 is leaked in the case where the channel is disconnected but this method returns ProxyConnState::Connected because the disconnect callback has not fired yet. I think we need to either remove the entry from proxy_tracker in that case or have send indicate whether the task is still there so that we could return ProxyConnState::Disconnected and not insert into proxy_tracker?

That behavior is intentional. There is an inherent TOCTTOU where the message can be put on the channel and then the socket can disconnect. In this case we return the Connected and then get a Disconnected callback sometime later to clear the state. This is also exactly what would happen if the message pulled off the channel, serialized, was sent over the socket, and then the channel disconnected. The key invariant to uphold is: if at any time a message is lost the disconnect callback must fire a short time after. No further messages should be able to be sent over the socket.

What makes this work is that the disconnect callback always fires after the tracked socket is recorded. We know it hasn't fired yet because the EstablishedTaskHandle is still in the main map which is in the same task that is doing the send. Therefore any disconnect will come immediately after the send if the task is gone. If there is no handle in the map then we return disconnected. Note that it's also possible that the connection had happened already but the main task hasn't learned yet. So we can discard even if connected already. TOCTTUOs all around.

panic!("Connection to main task channnel full");
}
}
WireMsg::ProxyRequest(req) => {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: all variants of this enum except for WireMsg::Ping have almost the same body, I wonder whether we could reduce code duplication?

let msg = match msg {
    WireMsg::Ping => continue,
    WireMsg::NetworkConfig => ConnToMainMsgInner::ReceivedNetworkConfig { ... },
};
if let Err(_) = ... {
    ...
}

let s = proxy.commit(dest, rack_id, Epoch(1)).await.unwrap();

// The first attempt should succeed
assert_eq!(s, CommitStatus::Committed);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are not joining this task at any point, is there a risk that this assert doesn't get executed?

Ok I see later down that we wait until the count is increased down below. Might be worth leaving a comment.

let _ = spawn(async move {
let s = proxy.commit(dest, rack_id, Epoch(1)).await.unwrap_err();

// The first attempt should succeed
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: comment is wrong, probably a copy/paste leftover?

let _ = spawn(async move {
let s = proxy.commit(dest, rack_id, Epoch(1)).await.unwrap_err();

// The first attempt should succeed
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: comment is wrong, probably a copy/paste leftover?

request_id,
tx,
);
self.proxy_tracker.insert(req);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the proxy_request method returns without actually having sent the request to the established connection actor, this will effectively leak memory by storing requests that will never receive a response. See my other comment for how this could happen.

) -> Result<V, ProxyError<E>>
where
E: std::error::Error,
F: FnOnce(Result<WireValue, WireError>) -> Result<V, ProxyError<E>>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The destructuring function adds a lot of duplication. The only way I can see to remove a lot of this duplication would be to keep WireValue and WireError in a dynamic representation similar to serde_json::Value until this function, and then require V: Deserialize, E: Deserialize, but that adds a lot of other problems :(

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, another option might be to have a separate channel for each kind of request, and select over all of them; that may not work if we are relying on the channel for ordering messages. But, on the other hand, it does mean that we could use a biased select to prioritize which message types are handled first (which may not be desirable either), and we wouldn't have to do the destructuring thing.

use bootstore::schemes::v0::NetworkConfig;

use camino::Utf8PathBuf;
use derive_more::From;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

woah 👀

Comment on lines +244 to +264
error!(
self.log,
"Failed to send received proxy msg to the main task"
);
panic!("Connection to main task channel full");
}
}
WireMsg::ProxyResponse(rsp) => {
if let Err(_) = self.main_tx.try_send(ConnToMainMsg {
task_id: self.task_id,
msg: ConnToMainMsgInner::ProxyResponseReceived {
from: self.peer_id.clone(),
rsp,
},
}) {
error!(
self.log,
"Failed to send received proxy msg to the main task"
);
panic!("Connection to main task channel full");
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, i think that the log line and panic messages here should probably include whether the error indicates that the channel is full or was disconnected because the main task exited. Assuming that try_send returning an error means the channel is full here could probably confuse people while debugging --- we shouldn't say " channel full" in the disconnected case.

) -> Result<V, ProxyError<E>>
where
E: std::error::Error,
F: FnOnce(Result<WireValue, WireError>) -> Result<V, ProxyError<E>>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, another option might be to have a separate channel for each kind of request, and select over all of them; that may not work if we are relying on the channel for ordering messages. But, on the other hand, it does mean that we could use a biased select to prioritize which message types are handled first (which may not be desirable either), and we wouldn't have to do the destructuring thing.

Inner(#[from] T),
#[error("disconnected")]
Disconnected,
#[error("response for different type received: {0}")]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this perhaps indicate what the expected response type was, or do we only expect to log this in a place where it's obvious what we were expecting?

Comment on lines +310 to +318
self.ops.retain(|mut req| {
if &req.destination == from {
let tx = req.tx.take().unwrap();
let _ = tx.send(Err(TrackerError::Disconnected));
false
} else {
true
}
});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, this is a place where it would be nice to have a bijective multimap so we could do this operation without iterating over every in-flight op, isn't it...?

but, this makes sense in the absence of that.


/// Proxy API requests to other nodes
#[tokio::test]
pub async fn tq_proxy() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

turbo nit: why is this pub? is it called from elsewhere?

Comment on lines +449 to +450
h.send(req).await;
ProxyConnState::Connected
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, part of Emily's concern is that the request tracker inserted into proxy_tracker here https://github.com/oxidecomputer/omicron/pull/9403/files#r2527260678 is leaked in the case where the channel is disconnected but this method returns ProxyConnState::Connected because the disconnect callback has not fired yet. I think we need to either remove the entry from proxy_tracker in that case or have send indicate whether the task is still there so that we could return ProxyConnState::Disconnected and not insert into proxy_tracker?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants