-
Notifications
You must be signed in to change notification settings - Fork 62
TQ: Support proxying Nexus-related API requests #9403
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
When committing configurations, the trust quorum protocol relies on Nexus to be up and for Nexus to interact with each trust quorum node. This allows Nexus to continuously try to commit nodes in an RPW and record in the database which nodes have acked. This mirrors many of our existing designs where Nexus observes and records information about successful operations and retries via RPW. Specifically it matches how we do things in the `Reconfigurator` and the `TUF Repo Depot`. Unfortunately, Nexus cannot communicate with new sleds that are not yet running a sled-agent, but still stuck in the bootstrap-agent. This is because the bootstrap agents (and trust quorum protocol) only communicate over the bootstrap network, which Nexus does not have access to. Nodes must already be part of an existing configuration, running sled-agent, and on the underlay network to talk to Nexus. In this common case, Nexus sends trust quorum related messages to the sled-agent which then calls the api of its local trust quorum `NodeTask`. This is not possible for newly added sleds. While the trust quroum coordinator node will tell new nodes to `Prepare` a configuration over the bootstrap networtk, these new nodes do not have any mechanism to receive commits from Nexus. Therefore we must proxy these commit related operations to an existing member of the trust quorum when adding a new node. We also added the ability to proxy `NodeStatus` requests to aid in debugging. This PR therefore adds the ability to proxy certain requests from one node to another so that we can commit nodes to the latest trust quorum configuration, setup their encrypted storage, and boot their sled-agent. It's worth noting that this is not the only way we could have solved this problem. There are a few possibilities in the design space. 1. We could have had the coordinator always send commit operations and collect acknowledgements as during the `Prepare` phase. Unfortunately, if the coordinator dies before all nodes ack then Nexus would not be able to ensure commit at all nodes. To make this reliable, Nexus would still need to be able to reach out to uncommitted nodes and tell them to commit. Since we already have to do the latter there is no reason to do the former. 2. We could commit at the coordinator (or a few nodes), and then have them gossip around information about commit. This is actually a promising design, and is essentially what we do for the early network config. Nexus could then wait for the sled-agent to start for those nodes and ask them directly if they committed. This would still require talking to all nodes and it adds some extra complexity, but it still seems somewhat reasonable. The rationale for our current choice of proxying was largely one of fitting our existing patterns. It's also very useful for Nexus to be able directly ask a trust quorum node on another sled about its status to diagnose issues. So we went with the proxy mechanism as implemented here. Well, why did we introduce another level of messages at the `Task` layer instead of re-using the `CommitAdvance` functionality or adding new variants to the `PeerMsg` in the `trust_quorum_protocol` crate? The rationale here is largely that the trust quorum protocol as written in RFD 238 and specified in TLA+ doesn't include this behavior. It expects commits from the `Node` "API", meaning from `Nexus`. I didn't want to change that behavior unnecessarily due to urgency, and an existing solid design. It was also easier to build proxy operations this way since tracking operations in async code with oneshot channels is easier than trying to insert similar tracking into the `sans-io` code. In short, we left the `trust-quorum-protocol` crate alone, and added some async helpers to the `trust_quorum` crate. One additional change was made in this PR. While adding the `tq_proxy` test I noticed that we were unnecessarily using `wait_for_condition` on initial commits, after we knew about succesful prepares. These commits should always complete immediately and so I simplified this code in a few existing tests.
| h.send(req).await; | ||
| ProxyConnState::Connected |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a previous PR I noted that the send() method ignores all failures and returns a success (well, it doesn't return a Result) even if the connection broke. For the trust quorum protocol that seemed fine, as the protocol is resilient to messages not being delivered.
The proxy APIs defined in another file expect either a response from the server or a Disconnected error before returning: if it receives nothing it will block forever. After spending probably way too much time thinking about failure cases related to this1, if the message is sent to the established connection actor things will eventually be fine: when we returns, the caller will add the request to the tracker, and a disconnection detected by the established connection actor will be eventually relayed to the tracker.
The failure case I still see is when the channel is busy (with 10 pending requests) and the send method silently discards the message. In that case, we return a ProxyConnState::Connected and the request will be added to the tracker, but we will never get a response unless the connection breaks due to another unrelated request failing (since this request never got to the actor).
The code as is could be fine if the caller to any proxy method takes care of adding timeouts everywhere, but this feels like a problem waiting to happen. I'd feel way more comfortable if this returned a ProxyConnState::Busy if sending a message to the channel failed.
Footnotes
-
I lost count of how many times I rewrote this comment with other failure cases that turns out couldn't happen. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great analysis @pietroalbini. Thank you so much for taking the time to go through this process and understand the code and design. I actually think if this was a problem, that it would also be somewhat of a problem for the normal trust quorum protocol, as some messages from peers are never resent on a timer.
For example: If a coordinator is sending a Prepare msg to a peer and the and the message got dropped before making it to the connection task, it would not be resent unless the channel got disconnected and reconnected for some other reason. Now, that being said the volume of messages is small and this should not happen. And as you point out, there is some built in resilience. If the commit occurred nexus would prepareAndCommit this node or it would get a CommitAdvance message on the next restart (perhaps after an update). But it still could end up as a problem if too many nodes did this and the coordinator couldn't complete the prepare phase. Nexus would try a new configuration at perhaps a different coordinator after some time without prepare completing, but the system may still be overloaded.
With all that being said, I don't actually think what you pointed out is entirely true, and therefore this isn't actually a problem here. However, this analysis is also non-trivial. It almost makes me question whether using connection state for retries instead of timers is the right move. So far, I think it continues to work and has the benefit of not re-sending messages already sent over a reliable stream. Ok, so back to the problem.
The failure case I still see is when the channel is busy (with 10 pending requests) and the send method silently discards the message.
The channel being used in send is an mpsc::bounded channel and blocks on send. The error that is discarded is when the channel itself gets disconnected, presumably because the task exited. In that case the Disconnect callback will eventually fire and all is well.
To help ensure that the disconnect callback occurs when buffers start filling up, there is also a MSG_WRITE_QUEUE_CAPACITY for each established connection that will disconnect if too many `` messages are pulled off the channel and serialized before they can be sent. Somewhat importantly, this channel is sized smaller than the queue, so if the queue is full it means that the TCP connection (or serialization) is too slow to move things along. We get backpressure, and eventually a disconnection that should allow things to clear up on a reconnect. I should cleanup the TODO suggestion there as we actually can't drop messages or we will break things as you point out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It appears there is actually one place where the ConnToMainMsgInner::Disconnected message doesn't get sent to the NodeTask task when an EstablishedConnection closes. That is when the EstablishedConnection task itself panics. I believe that we use panic = abort in production, and so this isn't actually a problem in practice.
However, this has me wondering if instead I should signal to NodeTask from the ConnectionManager::step method when an EstablishedConnection exits rather than sending a Disconnected message from the task itself. That would cover both the successful exit and panic cases for the EstablishedConnection task.
Unfortunately, I also realized that there is another race condition that may get worse if I do this. If a new connection is accepted for the same peer it will trigger an abort of the old connection. In this case the old disconnect will occur after the new connection is established. That could cause problems for the protocol, and I should gate it the Node::on_disconnect task by looking to see if the task_id matches the current established task ID, as is done for the connection manager on_disconnected callback.
Another option to solve the latter problem is to always reject a new accepted connection for the same peer if one is already established. Eventually the old one will go away, and the remote peer will retry.
I need to think about this a bit more, but will likely add a few small cleanup patches.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC, part of Emily's concern is that the request tracker inserted into proxy_tracker here https://github.com/oxidecomputer/omicron/pull/9403/files#r2527260678 is leaked in the case where the channel is disconnected but this method returns ProxyConnState::Connected because the disconnect callback has not fired yet. I think we need to either remove the entry from proxy_tracker in that case or have send indicate whether the task is still there so that we could return ProxyConnState::Disconnected and not insert into proxy_tracker?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC, part of Emily's concern is that the request tracker inserted into
proxy_trackerhere https://github.com/oxidecomputer/omicron/pull/9403/files#r2527260678 is leaked in the case where the channel is disconnected but this method returnsProxyConnState::Connectedbecause the disconnect callback has not fired yet. I think we need to either remove the entry fromproxy_trackerin that case or havesendindicate whether the task is still there so that we could returnProxyConnState::Disconnectedand not insert intoproxy_tracker?
That behavior is intentional. There is an inherent TOCTTOU where the message can be put on the channel and then the socket can disconnect. In this case we return the Connected and then get a Disconnected callback sometime later to clear the state. This is also exactly what would happen if the message pulled off the channel, serialized, was sent over the socket, and then the channel disconnected. The key invariant to uphold is: if at any time a message is lost the disconnect callback must fire a short time after. No further messages should be able to be sent over the socket.
What makes this work is that the disconnect callback always fires after the tracked socket is recorded. We know it hasn't fired yet because the EstablishedTaskHandle is still in the main map which is in the same task that is doing the send. Therefore any disconnect will come immediately after the send if the task is gone. If there is no handle in the map then we return disconnected. Note that it's also possible that the connection had happened already but the main task hasn't learned yet. So we can discard even if connected already. TOCTTUOs all around.
| panic!("Connection to main task channnel full"); | ||
| } | ||
| } | ||
| WireMsg::ProxyRequest(req) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: all variants of this enum except for WireMsg::Ping have almost the same body, I wonder whether we could reduce code duplication?
let msg = match msg {
WireMsg::Ping => continue,
WireMsg::NetworkConfig => ConnToMainMsgInner::ReceivedNetworkConfig { ... },
};
if let Err(_) = ... {
...
}| let s = proxy.commit(dest, rack_id, Epoch(1)).await.unwrap(); | ||
|
|
||
| // The first attempt should succeed | ||
| assert_eq!(s, CommitStatus::Committed); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are not joining this task at any point, is there a risk that this assert doesn't get executed?
Ok I see later down that we wait until the count is increased down below. Might be worth leaving a comment.
| let _ = spawn(async move { | ||
| let s = proxy.commit(dest, rack_id, Epoch(1)).await.unwrap_err(); | ||
|
|
||
| // The first attempt should succeed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: comment is wrong, probably a copy/paste leftover?
| let _ = spawn(async move { | ||
| let s = proxy.commit(dest, rack_id, Epoch(1)).await.unwrap_err(); | ||
|
|
||
| // The first attempt should succeed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: comment is wrong, probably a copy/paste leftover?
| request_id, | ||
| tx, | ||
| ); | ||
| self.proxy_tracker.insert(req); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the proxy_request method returns without actually having sent the request to the established connection actor, this will effectively leak memory by storing requests that will never receive a response. See my other comment for how this could happen.
| ) -> Result<V, ProxyError<E>> | ||
| where | ||
| E: std::error::Error, | ||
| F: FnOnce(Result<WireValue, WireError>) -> Result<V, ProxyError<E>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The destructuring function adds a lot of duplication. The only way I can see to remove a lot of this duplication would be to keep WireValue and WireError in a dynamic representation similar to serde_json::Value until this function, and then require V: Deserialize, E: Deserialize, but that adds a lot of other problems :(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, another option might be to have a separate channel for each kind of request, and select over all of them; that may not work if we are relying on the channel for ordering messages. But, on the other hand, it does mean that we could use a biased select to prioritize which message types are handled first (which may not be desirable either), and we wouldn't have to do the destructuring thing.
| use bootstore::schemes::v0::NetworkConfig; | ||
|
|
||
| use camino::Utf8PathBuf; | ||
| use derive_more::From; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
woah 👀
| error!( | ||
| self.log, | ||
| "Failed to send received proxy msg to the main task" | ||
| ); | ||
| panic!("Connection to main task channel full"); | ||
| } | ||
| } | ||
| WireMsg::ProxyResponse(rsp) => { | ||
| if let Err(_) = self.main_tx.try_send(ConnToMainMsg { | ||
| task_id: self.task_id, | ||
| msg: ConnToMainMsgInner::ProxyResponseReceived { | ||
| from: self.peer_id.clone(), | ||
| rsp, | ||
| }, | ||
| }) { | ||
| error!( | ||
| self.log, | ||
| "Failed to send received proxy msg to the main task" | ||
| ); | ||
| panic!("Connection to main task channel full"); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, i think that the log line and panic messages here should probably include whether the error indicates that the channel is full or was disconnected because the main task exited. Assuming that try_send returning an error means the channel is full here could probably confuse people while debugging --- we shouldn't say " channel full" in the disconnected case.
| ) -> Result<V, ProxyError<E>> | ||
| where | ||
| E: std::error::Error, | ||
| F: FnOnce(Result<WireValue, WireError>) -> Result<V, ProxyError<E>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, another option might be to have a separate channel for each kind of request, and select over all of them; that may not work if we are relying on the channel for ordering messages. But, on the other hand, it does mean that we could use a biased select to prioritize which message types are handled first (which may not be desirable either), and we wouldn't have to do the destructuring thing.
| Inner(#[from] T), | ||
| #[error("disconnected")] | ||
| Disconnected, | ||
| #[error("response for different type received: {0}")] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this perhaps indicate what the expected response type was, or do we only expect to log this in a place where it's obvious what we were expecting?
| self.ops.retain(|mut req| { | ||
| if &req.destination == from { | ||
| let tx = req.tx.take().unwrap(); | ||
| let _ = tx.send(Err(TrackerError::Disconnected)); | ||
| false | ||
| } else { | ||
| true | ||
| } | ||
| }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, this is a place where it would be nice to have a bijective multimap so we could do this operation without iterating over every in-flight op, isn't it...?
but, this makes sense in the absence of that.
|
|
||
| /// Proxy API requests to other nodes | ||
| #[tokio::test] | ||
| pub async fn tq_proxy() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
turbo nit: why is this pub? is it called from elsewhere?
| h.send(req).await; | ||
| ProxyConnState::Connected |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC, part of Emily's concern is that the request tracker inserted into proxy_tracker here https://github.com/oxidecomputer/omicron/pull/9403/files#r2527260678 is leaked in the case where the channel is disconnected but this method returns ProxyConnState::Connected because the disconnect callback has not fired yet. I think we need to either remove the entry from proxy_tracker in that case or have send indicate whether the task is still there so that we could return ProxyConnState::Disconnected and not insert into proxy_tracker?
When committing configurations, the trust quorum protocol relies on Nexus to be up and for Nexus to interact with each trust quorum node. This allows Nexus to continuously try to commit nodes in an RPW and record in the database which nodes have acked. This mirrors many of our existing designs where Nexus observes and records information about successful operations and retries via RPW. Specifically it matches how we do things in the
Reconfiguratorand theTUF Repo Depot.Unfortunately, Nexus cannot communicate with new sleds that are not yet running a sled-agent, but still stuck in the bootstrap-agent. This is because the bootstrap agents (and trust quorum protocol) only communicate over the bootstrap network, which Nexus does not have access to. Nodes must already be part of an existing configuration, running sled-agent, and on the underlay network to talk to Nexus. In this common case, Nexus sends trust quorum related messages to the sled-agent which then calls the api of its local trust quorum
NodeTask. This is not possible for newly added sleds. While the trust quroum coordinator node will tell new nodes toPreparea configuration over the bootstrap networtk, these new nodes do not have any mechanism to receive commits from Nexus. Therefore we must proxy these commit related operations to an existing member of the trust quorum when adding a new node. We also added the ability to proxyNodeStatusrequests to aid in debugging.This PR therefore adds the ability to proxy certain requests from one node to another so that we can commit nodes to the latest trust quorum configuration, setup their encrypted storage, and boot their sled-agent.
It's worth noting that this is not the only way we could have solved this problem. There are a few possibilities in the design space.
We could have had the coordinator always send commit operations and collect acknowledgements as during the
Preparephase. Unfortunately, if the coordinator dies before all nodes ack then Nexus would not be able to ensure commit at all nodes. To make this reliable, Nexus would still need to be able to reach out to uncommitted nodes and tell them to commit. Since we already have to do the latter there is no reason to do the former.We could commit at the coordinator (or a few nodes), and then have them gossip around information about commit. This is actually a promising design, and is essentially what we do for the early network config. Nexus could then wait for the sled-agent to start for those nodes and ask them directly if they committed. This would still require talking to all nodes and it adds some extra complexity, but it still seems somewhat reasonable. The rationale for our current choice of proxying was largely one of fitting our existing patterns. It's also very useful for Nexus to be able directly ask a trust quorum node on another sled about its status to diagnose issues.
So we went with the proxy mechanism as implemented here. Well, why did we introduce another level of messages at the
Tasklayer instead of re-using theCommitAdvancefunctionality or adding new variants to thePeerMsgin thetrust_quorum_protocolcrate? The rationale here is largely that the trust quorum protocol as written in RFD 238 and specified in TLA+ doesn't include this behavior. It expects commits from theNode"API", meaning fromNexus. I didn't want to change that behavior unnecessarily due to urgency, and an existing solid design.It was also easier to build proxy operations this way since tracking operations in async code with oneshot channels is easier than trying to insert similar tracking into the
sans-iocode. In short, we left thetrust-quorum-protocolcrate alone, and added some async helpers to thetrust_quorumcrate.One additional change was made in this PR. While adding the
tq_proxytest I noticed that we were unnecessarily usingwait_for_conditionon initial commits, after we knew about succesful prepares. These commits should always complete immediately and so I simplified this code in a few existing tests.