Skip to content

Drop the ChannelMonitorUpdateStatus::PermanentFailure variant #2562

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Sep 21, 2023

Conversation

TheBlueMatt
Copy link
Collaborator

@TheBlueMatt TheBlueMatt commented Sep 8, 2023

Based on #2169.

When a ChannelMonitorUpdate fails to apply, it generally means
we cannot reach our storage backend. This, in general, is a
critical issue, but is often only a transient issue.

Sadly, users see the failure variant and return it on any I/O
error, resulting in channel force-closures due to transient issues.

Users don't generally expect force-closes in most cases, and
luckily with async ChannelMonitorUpdates supported we don't take
any risk by "delaying" the ChannelMonitorUpdate indefinitely.

Thus, here we drop the PermanentFailure variant entirely, making
all failures instead be "the update is in progress, but won't ever
complete", which is equivalent if we do not close the channel
automatically.

@tnull tnull mentioned this pull request Sep 8, 2023
res.map(|_| ())
} else {
log_error!(self.logger, "Persisting initial ChannelMonitor failed, implying the funding outpoint was duplicated");
return Err(MsgHandleErrInternal::send_err_msg_no_close(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we generate an event to notify the user in this case, or is it rare enough that we won't bother?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for an inbound channel we should not - someone can spam us with these all days and the user doesn't need to care.

if let Ok(persist_state) = monitor_res {
match self.id_to_peer.lock().unwrap().entry(chan.context.channel_id()) {
hash_map::Entry::Occupied(_) => {
return Err(MsgHandleErrInternal::send_err_msg_no_close(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be unreachable now, right? Could debug_assert!(false)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but post-v2-chanels maybe not? In any case I'm actually gonna reorder the checks here anyway.

@TheBlueMatt
Copy link
Collaborator Author

Gonna wait to address feedback until #2495.

@TheBlueMatt
Copy link
Collaborator Author

Rebased and addressed comments. Apologies to reviewers, I wrote this, realized I needed to wait on #2495, then when that got closed opened it, forgetting that I had left a few things left to clean up before opening. In any case, this now removes a ton of code in the monitor handling pipeline in ChannelManager as well.

@wpaulino wpaulino self-requested a review September 11, 2023 19:16
@G8XSU
Copy link
Contributor

G8XSU commented Sep 11, 2023

I am not sure if we should be doing this, IIUC from overall description of PR.
Right now how async persistence works is, client should return InProgress and call channel_monitor_updated later.
This places additional burden on client to "remember" and ensure that monitor is persisted eventually after returning InProgress. (I don't see how will we do indefinitely retries otherwise)

What if I as client don't want to complicate my architecture and don't want to remember any non-persisted updates.

PermanentFailure is a perfectly valid use-case to return after repeated retries have failed. Now LDK might choose to either shutdown or force-close. (we can always shutdown if we think that is right for this kind of failure)

Our async is not a future that can fail but more of a "Delayable" iiuc. (because there is no channel_monitor_update_failed option)
Removing Failure option from return makes it more confusing/burdensome for clients as we seem to be forcing async interface on people who might not want it.

In a perfect world, wouldn't we want to keep our node running (for chain-watch) and fail all other operations until persistence works?

from Comments: "the node shut down (as if we cannot update stored data we can't do much of anything useful)."

Wouldn't it be better for us to have future control/visibility over this failure rather than expecting client to call shutdown on this failure.
Why not do this for clients ? (We can improve this behaviour later otherwise we lose all visibility into this)

which is equivalent if we do not close the channel automatically.

This doesn't seem equivalent from api perspective, it is async on fail.

@TheBlueMatt
Copy link
Collaborator Author

TheBlueMatt commented Sep 11, 2023

The big issue with a failure variant here is that we really can't handle it in any reasonable way - we currently force-close the channel, but that is actually incredibly unsafe! If we force-close with the latest monitor state we will broadcast a transaction which may have outputs which are only claimable with the latest monitor state (which the user told us they failed to persist!) We do this under the assumption that maybe the disk caught fire and we should try to get our money back ASAP, but that's certainly not a given - the disk may simply be disconnected (especially true as more folks move towards remote storage).

Really if I/O fails for any reason we need to get a human, there's just not much else we can do in Lightning. I chatted a bit with @domZippilli and we concluded probably all the in-LDK persistence stuff needs to expose some kind of "an error happened, please get a human and probably shut down" notification, but there's not a hell of a lot more we can do.

In the future, we may want to add a Failure variant back into this enum, as you suggest, but implement it totally differently - treat it the same as the inprogress variant but then refuse to talk to peers until the user shuts down or finishes the persist. Until then, I think we still absolutely need to remove the relatively unsafe logic we have today, and adding back in a totally different implementation is pretty orthogonal.

@domZippilli
Copy link
Contributor

PermanentFailure is a perfectly valid use-case to return after repeated retries have failed. Now LDK might choose to either shutdown or force-close. (we can always shutdown if we think that is right for this kind of failure)

@TheBlueMatt did we simply come up with a more complex way of doing this ^ ? In other words, rather than introducing some kind of novel notification, could this PR just change the way PermanentFailure is handled in LDK to be a shutdown (panic?) instead?

@G8XSU
Copy link
Contributor

G8XSU commented Sep 11, 2023

we really can't handle it in any reasonable way - we currently force-close the channel, but that is actually incredibly unsafe! If we force-close with the latest monitor state we will broadcast a transaction which may have outputs which are only claimable with the latest monitor state (which the user told us they failed to persist!)

If the current handling for PermanentFailure is not satisfactory, we could replace it with shutdown instead of getting rid of failure return.
If we can't decide what to do, then I don't see how user can.

It is better if the node is crash-looping until DB/disk comes back up than adding a constraint about storage not failing.
Shutdown-behaviour will also result in node to auto-recover once user's DB/disk is fixed.

Is shutdown and restart unsafe? Afaiu we wouldn't have committed without updated persist, and with persistence failing we can't really do much.

In the future, we may want to add a Failure variant back into this enum, as you suggest, but implement it totally differently - treat it the same as the inprogress variant but then refuse to talk to peers until the user shuts down or finishes the persist.

If we keep the failure return, it is easier to improve this in future without interface change and impacting customer's architecture.

@TheBlueMatt
Copy link
Collaborator Author

If the current handling for PermanentFailure is not satisfactory, we could replace it with shutdown instead of getting rid of failure return. If we can't decide what to do, then I don't see how user can.

Right, I think we agree here, there's just currently no mechanism in LDK to trigger the user to shut down the node. We can certainly build it (in the form of refusing to talk to peers until a resolution or just giving the user a notification from the I/O driver that we need to shut down), but it doesn't exist today.

It is better if the node is crash-looping until DB/disk comes back up than adding a constraint about storage not failing. Shutdown-behaviour will also result in node to auto-recover once user's DB/disk is fixed.

I think the API change here kinda encourages that? We could mention in the documentation that the user may want to consider panic'ing to trigger immediate shutdown in the most aggressive way possible :).

Is shutdown and restart unsafe? Afaiu we wouldn't have committed without updated persist, and with persistence failing we can't really do much.

No, its safe, as long as we dont get a ChannelMonitorUpdateStatus::Completed, LDK won't take any "irreversible action" based on something we needed to persist, so (at least as of 0.0.118) returning InProgress and never completing won't lose funds, it will just result in eventual force-closure as HTLCs time out.

Even if we skip the part of this PR that drops the PermanentFailure variant, I still think we want to keep the rest (which is the bulk of the changes here) - we really need to stop doing the current force-closure, so we really want to just take the perm-failure variant and handle it the same as InProgress, just with a log_error and then adding some logic around shutting down (somehow).

We should still encourage users to use InProgress and loop forever in the background trying to complete the persist while waiting on their human to fix the system, however.

@G8XSU
Copy link
Contributor

G8XSU commented Sep 12, 2023

Even if we skip the part of this PR that drops the PermanentFailure variant, I still think we want to keep the rest (which is the bulk of the changes here)

Yes, my only concern is with dropping permanent failure completely.

shut down the node), but it doesn't exist today.

In this rare case, I am in favor of panic! 😅, won't that work? panic in channel-manager

We should still encourage users to use InProgress and loop forever in the background

Not sure but I think this encourages unsafe behaviour, we are overloading meaning of InProgress.
Multiple requests would start looping or block each other, eventually causing system overload which is unnecessary distraction at the time of fire.

This places additional burden on client to "remember" and ensure that monitor is persisted eventually after returning InProgress.

And in worst case, this encourages silent failures, because if there is a failure, monitorUpdate is InProgress-persist, i can't handle failure any other way(i don't want to remember updates). Now channel is in disabled state, user doesn't even know it is.

@TheBlueMatt
Copy link
Collaborator Author

Yes, my only concern is with dropping permanent failure completely.

Right, we need to have a conversation about how to design any such failure case handling. If there's not something that immediately makes sense this late in a release cycle, I'm not sure that having an error code that we can't handle is a great design.

In this rare case, I am in favor of panic! 😅, won't that work? panic in channel-manager

Possibly - I kinda hate the Rust ecosystem here since by default panics actually unwind rather than actually abort, and tokio by default catches the unwind and just continues chugging along (which is more normal behavior in something like a webserver - if a request comes in which causes some bad behavior, fail the request, but don't kill the whole server, but doesn't make sense here). We do recommend users use panic = abort, which actually crashes on panics rather than allowing unwinding, but some users won't.

We do panic elsewhere in LDK, and panics in a mutex are kinda "viral" (later attempts to take the lock will also panic when we unwrap the lock), so the node would still (mostly) shut down, just not in the most orderly fashion.

Not sure but I think this encourages unsafe behaviour, we are overloading meaning of InProgress. Multiple requests would start looping or block each other, eventually causing system overload which is unnecessary distraction at the time of fire.

It wouldn't grow unbounded/huge, a few updates per channel you have open pending, after that we just kinda start locking up and not handling new messages/sending new messages until persistence starts completing again.

And in worst case, this encourages silent failures, because if there is a failure, monitorUpdate is InProgress-persist, i can't handle failure any other way(i don't want to remember updates). Now channel is in disabled state, user doesn't even know it is.

Users can always list the pending monitor updates and go through all the channelmonitors with pending updates, persist the full current state, and mark them complete. We could even expose some kind of replay-attempt function which just calls Watch::update_channel in a loop for any channels with pending updates.

@tnull
Copy link
Contributor

tnull commented Sep 12, 2023

Possibly - I kinda hate the Rust ecosystem here since by default panics actually unwind rather than actually abort, and tokio by default catches the unwind and just continues chugging along (which is more normal behavior in something like a webserver - if a request comes in which causes some bad behavior, fail the request, but don't kill the whole server, but doesn't make sense here).

I'm not sure why it wouldn't make sense here? If correctly implemented, would it be exactly what we want a server to do, in particular if multiple nodes would be handled by one server. I.e., we'd want it to propagate the panic of a failing task sufficiently upwards, catch it there, and restart that branch of tasks belong to the failing node?

@TheBlueMatt
Copy link
Collaborator Author

I'm not sure why it wouldn't make sense here? If correctly implemented, would it be exactly what we want a server to do, in particular if multiple nodes would be handled by one server. I.e., we'd want it to propagate the panic of a failing task sufficiently upwards, catch it there, and restart that branch of tasks belong to the failing node?

Right, I was thinking more the ldk-sample/ldk-node world where you have a single node with potentially many tokio tasks. If you're going the multi-tennancy route you probably do want to catch the panic, you just want to make sure that all the tasks for the given node all shut down, rather than having that node limp along half-running.

@domZippilli
Copy link
Contributor

I think what's key to making this understandable is how InProgress is defined in an "infallible" world (I think "infallible" implies "panic if there's an error", at least that is what we do). Intuitively I expect:

  • Completed: We did it, yay
  • InProgress: We're going to get back to you when it's done. We'll panic if it fails.
  • PermanentFailure: This write isn't going to work, LDK, so please do whatever is responsible and panic for me.

I guess if PF is out of the enum, I'm going to panic myself if there's a write failure. There's nothing else I can really do. If you make this trait method fallible, I would return an Err (which is effectively how I'm treating PF), but would still expect LDK to do the right/safe thing, which is probably a few minimal actions and then panic.

For me, the most confusing possibility is saying InProgress is what you should do for transient errors. Switching to an async mode in order handle transient errors is a little surprising, and makes me feel more exposed. Knowing that LDK is blocking on your write is very reassuring, because I think (implicitly) it means that channel is locked during that time, and losing that "lock" is much more fraught. In fact, I think transient errors should never be presented to this trait, if the separations of concerns is correct. They should be handled in a storage backend that knows the right amount of retrying/backing off etc., and when to give up. The backend's methods should be definitively fallible to this layer -- they either succeed within parameters, or the storage is now outside of parameters (failing), and we should panic. This is of special interest to me right now, as I think if the backend returns transient errors, it's much harder to write a generic persister implementation of any kind.

You could substitute panic for "signal an orderly shutdown" and I think the only difference is whether one attempts to unwind in some way.

Incidentally, panic-on-fail is how we (c=) handle the other infallible persistence methods, like failing to write the manager/graph/scorer. I think the appeal of signaling an error to LDK is that it can take some sort of action. It sounds like there's actually no safe action to take, so guiding customers to panic when infallible methods fail is probably the right thing? And to use panic=abort, or at least shut down the node the thread belongs to if they're multitenant (though one has to wonder: if a given multitenant process is having trouble writing for one node, would others be expected to work? Do they use different storage systems?).

@G8XSU
Copy link
Contributor

G8XSU commented Sep 12, 2023

We do panic elsewhere in LDK, and panics in a mutex are kinda "viral" (later attempts to take the lock will also panic when we unwrap the lock), so the node would still (mostly) shut down, just not in the most orderly fashion.

Ok, I was hoping panic would work because we are relying on this in other parts of our code assuming abort functionality, if it doesn't function as per our requirements then ok.

It wouldn't grow unbounded/huge

Ldk pending events won't grow unbounded, but I was more of concerned about client's threads waiting on ldk and ldk will block them, which is unbounded, all other requests to service will probably be dropped/stuck. (So i probably wouldn't recommend infinite wait inside persist.)

Users can always list the pending monitor updates and go through all the channelmonitors with pending updates

Yes, i guess people who are using it as async api can do that.

transient errors should never be presented to this trait

I am bit confused, we were not discussing to expose transient errors from trait without retries. Only worry was about what to do after retries are exhausted. (or failure can't be recovered from in short duration of retry)

InProgress: We're going to get back to you when it's done. We'll panic if it fails.

From what you describe, I think this supports Matt's thesis to remove permanent failure for now. Because it doesn't seem feasible to implement panic at failure.

In summary, I am ok with current approach, given we will document for client to panic/shutdown on non-recoverable failure. (somehow)
But I am still slightly confused on how our kvstore implementations will handle it, we have moved this problem(of panic/shutdown) to kvstore implementation i guess.
Will we panic? I don't think we can do anything else at that point. (Same goes for MUP and stuff)

@domZippilli
Copy link
Contributor

domZippilli commented Sep 12, 2023

transient errors should never be presented to this trait

I am bit confused, we were not discussing to expose transient errors from trait without retries. Only worry was about what to do after retries are exhausted. (or failure can't be recovered from in short duration of retry)

Very possible, even probable, that I'm the confused one. The opening comment on the PR includes:

When a ChannelMonitorUpdate fails to apply, it generally means
we cannot reach our storage backend. This, in general, is a
critical issue, but is often only a transient issue.

Sadly, users see the failure variant and return it on any I/O
error
, resulting in channel force-closures due to transient issues.

Users don't generally expect force-closes in most cases, and
luckily with async ChannelMonitorUpdates supported we don't take
any risk by "delaying" the ChannelMonitorUpdate indefinitely.

Between that, and comments on #2359, I thought the idea was that we expected implementations of Persist to decide what's transient and what's not, and use the InProgress variant for what is transient. If not, I guess disregard that comment other than, clarity may be required (design doc?).

InProgress: We're going to get back to you when it's done. We'll panic if it fails.

From what you describe, I think this supports Matt's thesis to remove permanent failure for now. Because it doesn't seem feasible to implement panic at failure.

Fair. FWIW I only suggested panic there because there's no mechanism I know of to call back with a failure. That would also be an option, but AFAIK once you return InProgress, there are only two paths out: call back with MonitorEvent::Completed or exit the program.

@TheBlueMatt
Copy link
Collaborator Author

Right, I guess we should define "transient errors", here - I assume a storage backend will mostly retry or have a reliable backend such that any error that happens is a "Real Error" - the disk is disconnected, the network is offline, etc. However, those errors are still mostly "transient" in that they will eventually recover - our disk isn't literally on fire, we're not hosted at OVH, etc. Thus, we shouldn't force-close, but also shouldn't really do anything...there's nothing we can do, aside from ensuring we go get a human.

So I think we could do something fancier in 0.0.118 (disconnect all peers, stop making progress until the user recovers the node, but stay alive, maybe have some kind of Future or other indicator that the user needs to shut down), we kinda have two options in 0.0.117 - keep the PermanentFailure variant and just immediately panic, or drop it and maybe improve the documentation that is in this PR to tell the user to panic.

Of the two I marginally prefer the second, as it makes the behavior more visible to the user, though only marginally so as often the user is delegating to the LDK filesystem/sqlite/etc persister, which does the panic for the user. But at least those not using those persisters can choose if they want to panic or mark it in-flight and do their own clean-ish shutdowns.

@G8XSU
Copy link
Contributor

G8XSU commented Sep 12, 2023

But at least those not using those persisters can choose if they want to panic or mark it in-flight and do their own clean-ish shutdowns.

They could always do that by just using InProgress. And clients must be already be using async if they have existing logic for it. (i dont see a reason to mark in-progress and then do a shutdown. if anything we would want to shutdown before returning)

If we want to build some better handling in near future, I prefer keeping the permanentFailure variant and just improve it's handling later. for now if immediate panic works, we can do that and maybe make it clear in docs. (or suggest clean shutdown in case of permanent fail)

This will also be consistent with how MUP or kvstore implementations will handle it currently.

Will leave it to you to decide either method.

@domZippilli
Copy link
Contributor

Right, I guess we should define "transient errors", here - I assume a storage backend will mostly retry or have a reliable backend such that any error that happens is a "Real Error" - the disk is disconnected, the network is offline, etc. However, those errors are still mostly "transient" in that they will eventually recover - our disk isn't literally on fire, we're not hosted at OVH, etc. Thus, we shouldn't force-close, but also shouldn't really do anything...there's nothing we can do, aside from ensuring we go get a human.

I've heard that naming things is hard. Perhaps this terminology works:

  • Transient error: An expected temporary error of a system dependency with <100% success rate, such as the occasional 503 from a remote storage service. Software can possibly recover from transient errors with techniques like truncated exponential backoff; though, if it cannot, a transient error may "graduate" into an intervention error.
  • Intervention error: An error severe enough that software cannot recover, and human intervention is required to possibly recover the system to known-good state.

I'm saying this for docs -- "intervention" is the key distinction you're making between these two categories of recoverable error. Feels weird to invent a term here, I feel sure I read about this in the SRE book.

You may or may not consider introducing a variant like InterventionRequired to signal "don't force close, or do anything else, we're done until my human shows up." panic!() might be a suitable synonym, though. :)

@tnull
Copy link
Contributor

tnull commented Sep 13, 2023

If we want to build some better handling in near future, I prefer keeping the permanentFailure variant and just improve it's handling later. for now if immediate panic works, we can do that and maybe make it clear in docs. (or suggest clean shutdown in case of permanent fail)

This will also be consistent with how MUP or kvstore implementations will handle it currently.

FWIW, I'm also leaning on this side, as going forward we should find a solution where it's not the KVStores responsibility to make decisions about panicking or not. If we need to, LDK should take that responsibility and initiate an orderly shutdown.
If we make it its responsibility now and Persist/KVStore implementations are required to panic instead of returning errors under certain circumstances, it may introduce a chance for further bugs/misunderstandings if we reneg on the requirements after a release or two.

@TheBlueMatt
Copy link
Collaborator Author

i dont see a reason to mark in-progress and then do a shutdown. if anything we would want to shutdown before returning

I think my point here is that you can't generally do a shutdown in the middle of a sync call stack (aside from panic). If you want to do a "semi-orderly" shutdown (even if your persistence is borked) you probably need to trigger some other code, then return so that the ChannelManager can quiet down.

If we want to build some better handling in near future, I prefer keeping the permanentFailure variant and just improve it's handling later.

I've heard that naming things is hard. Perhaps this terminology works:

Mmm, okay, I'm fine with that naming and leaving it as a failure - can re-add a "InterventionRequiredError" variant to the enum which would cause a panic. We could then leave InProgress and note in the docs that this also represents a "transient error" and that in such a case users should keep retrying to persist, utilizing the ChainMonitor's list-pending-persists method and doing a full persistence.

@G8XSU
Copy link
Contributor

G8XSU commented Sep 14, 2023

Intervention error: An error severe enough that software cannot recover, and human intervention is required to possibly recover the system to known-good state.

I am ok with either, but could this be "UnrecoverableError"?
It is not clear how a user might or might not intervene. On a mobile node, developer cannot intervene, on server it probably means a restart. And it might look strange to throw around InterventionRequired in our code multiple times on unexpected behavior and cases.

@G8XSU
Copy link
Contributor

G8XSU commented Sep 14, 2023

"InProgress and note in the docs that this also represents a "transient error" and that in such a case users should keep retrying to persist, utilizing the ChainMonitor's list-pending-persists method"

This part is still confusing,

It is either of :

  • throw UnrecoverableError if we can't recover from transient error by retries or such.

  • OR, "return InProgress and keep retrying to persist, utilizing the ChainMonitor's list-pending-persists method."

Second recommendation makes more sense for clients using async imo, but ofcourse it can be followed by anyone.

@TheBlueMatt
Copy link
Collaborator Author

Squashed all the commits, the doc changes since yesterday in total are:

$ git diff-tree -U3 d168ba6c83be92009b7a59d4ead6f742f1891b51 b0ff3050
diff --git a/lightning/src/chain/chainmonitor.rs b/lightning/src/chain/chainmonitor.rs
index df56bda11..268b29257 100644
--- a/lightning/src/chain/chainmonitor.rs
+++ b/lightning/src/chain/chainmonitor.rs
@@ -78,34 +78,48 @@ impl MonitorUpdateId {
 /// `Persist` defines behavior for persisting channel monitors: this could mean
 /// writing once to disk, and/or uploading to one or more backup services.
 ///
-/// Each method can return three possible values:
-///  * If persistence (including any relevant `fsync()` calls) happens immediately, the
-///    implementation should return [`ChannelMonitorUpdateStatus::Completed`], indicating normal
-///    channel operation should continue.
+/// Persistence can happen in one of two ways - synchronously completing before the trait method
+/// calls return or asynchronously in the background.
 ///
-///  * If persistence happens asynchronously, implementations can return
-///    [`ChannelMonitorUpdateStatus::InProgress`] while the update continues in the background.
-///    Once the update completes, [`ChainMonitor::channel_monitor_updated`] should be called with
-///    the corresponding [`MonitorUpdateId`].
+/// # For those implementing synchronous persistence
 ///
-///    Note that unlike the direct [`chain::Watch`] interface,
-///    [`ChainMonitor::channel_monitor_updated`] must be called once for *each* update which occurs.
+///  * If persistence completes fully (including any relevant `fsync()` calls), the implementation
+///    should return [`ChannelMonitorUpdateStatus::Completed`], indicating normal channel operation
+///    should continue.
 ///
-///    If persistence fails for some reason, implementations should consider returning
-///    [`ChannelMonitorUpdateStatus::InProgress`] and attempt to shut down or retry the persistence
-///    operation in the background through [`ChainMonitor::list_pending_monitor_updates`] and
+///  * If persistence fails for some reason, implementations should consider returning
+///    [`ChannelMonitorUpdateStatus::InProgress`] and retry all pending persistence operations in
+///    the background with [`ChainMonitor::list_pending_monitor_updates`] and
 ///    [`ChainMonitor::get_monitor`].
 ///
 ///    Once a full [`ChannelMonitor`] has been persisted, all pending updates for that channel can
 ///    be marked as complete via [`ChainMonitor::channel_monitor_updated`].
 ///
+///    If at some point no further progress can be made towards persisting the pending updates, the
+///    node should simply shut down.
+///
 ///  * If the persistence has failed and cannot be retried,
 ///    [`ChannelMonitorUpdateStatus::UnrecoverableError`] can be used, though this will result in
 ///    an immediate panic and future operations in LDK generally failing.
 ///
-/// Third-party watchtowers may be built as a part of an implementation of this trait, with the
-/// advantage that you can control whether to resume channel operation depending on if an update
-/// has been persisted to a watchtower. For this, you may find the following methods useful:
+/// # For those implementing asynchronous persistence
+///
+///  All calls should generally spawn a background task and immediately return
+///  [`ChannelMonitorUpdateStatus::InProgress`]. Once the update completes,
+///  [`ChainMonitor::channel_monitor_updated`] should be called with the corresponding
+///  [`MonitorUpdateId`].
+///
+///  Note that unlike the direct [`chain::Watch`] interface,
+///  [`ChainMonitor::channel_monitor_updated`] must be called once for *each* update which occurs.
+///
+///  If at some point no further progress can be made towards persisting a pending update, the node
+///  should simply shut down.
+///
+/// # Using remote watchtowers
+///
+/// Watchtowers may be updated as a part of an implementation of this trait, utilizing the async
+/// update process described above while the watchtower is being updated. The following methods are
+/// provided for bulding transactions for a watchtower:
 /// [`ChannelMonitor::initial_counterparty_commitment_tx`],
 /// [`ChannelMonitor::counterparty_commitment_txs_from_update`],
 /// [`ChannelMonitor::sign_to_local_justice_tx`], [`TrustedCommitmentTransaction::revokeable_output_index`],
$ 

/// [`ChannelMonitorUpdateStatus::PermanentFailure`], in which case the channel will likely be
/// closed without broadcasting the latest state. See
/// [`ChannelMonitorUpdateStatus::PermanentFailure`] for more details.
/// * If persistence fails for some reason, implementations should consider returning
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If persistence fails for some reason:

Currently in this we have two options,

  1. either return UnrecoverableError after retries have exhausted.
    OR
  2. return InProgress and retry out of band using list_pending_monitor_updates.

I would prefer the first option to be mentioned before second one, since that is what sync really means, to take decision then and there. I would also suggest removing the second option from sync, since user can use async persistence if they already have a process to retry-out-of-band at zero extra cost. Otherwise, we are mixing the two.

My thought process for this is : If someone is server-side and persistence has failed after exhausted retries, then there is really something wrong with my high-availability(99.99%) persistence(infra- outage). I am ok with shutting down instead of creating a separate process for retrying async, to keep things simple

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kinda prefer this order because we prefer users to do more retries in the background, but I understand many folks won't really bother with that cause its hard :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, its not that hard, just additional complexity some may want to avoid. :)

/// # For those implementing asynchronous persistence
///
/// All calls should generally spawn a background task and immediately return
/// [`ChannelMonitorUpdateStatus::InProgress`]. Once the update completes,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

persistence can still fail in async.
list_pending_monitor_updates retrying process is necessary in this as well.

Infact if user has list_pending_monitor_updates retrying process, they should just use async.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not quite sure I understand this - the documentation mentions how to handle failure in both cases.

Copy link
Contributor

@G8XSU G8XSU Sep 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, it only mentions "Once the update completes".
There is still a case where update failed, and client needs to have mechanism to

retry all pending persistence operations in the background with [`ChainMonitor::list_pending_monitor_updates`] and [`ChainMonitor::get_monitor`].

while using async persistence.

depending on how their async works.

@TheBlueMatt TheBlueMatt force-pushed the 2023-08-no-perm-fail branch 2 times, most recently from 80e41b1 to 914c299 Compare September 21, 2023 18:39
@TheBlueMatt
Copy link
Collaborator Author

Pushed some (squashed) grammar/spelling fixes:

$ git diff-tree -U1 b0ff3050 914c2999
diff --git a/lightning-persister/src/fs_store.rs b/lightning-persister/src/fs_store.rs
index 4a70d238c..42b28018f 100644
--- a/lightning-persister/src/fs_store.rs
+++ b/lightning-persister/src/fs_store.rs
@@ -438,3 +438,3 @@ mod tests {
 	// Test that if the store's path to channel data is read-only, writing a
-	// monitor to it results in the store returning a InProgress.
+	// monitor to it results in the store returning an InProgress.
 	// Windows ignores the read-only flag for folders, so this test is Unix-only.
diff --git a/lightning/src/chain/chainmonitor.rs b/lightning/src/chain/chainmonitor.rs
index 268b29257..f41512b14 100644
--- a/lightning/src/chain/chainmonitor.rs
+++ b/lightning/src/chain/chainmonitor.rs
@@ -100,3 +100,3 @@ impl MonitorUpdateId {
 ///
-///  * If the persistence has failed and cannot be retried,
+///  * If the persistence has failed and cannot be retried further (e.g. because of some timeout),
 ///    [`ChannelMonitorUpdateStatus::UnrecoverableError`] can be used, though this will result in
diff --git a/lightning/src/ln/chanmon_update_fail_tests.rs b/lightning/src/ln/chanmon_update_fail_tests.rs
index dc4b3d57a..dc9bbd3e4 100644
--- a/lightning/src/ln/chanmon_update_fail_tests.rs
+++ b/lightning/src/ln/chanmon_update_fail_tests.rs
@@ -106,3 +106,3 @@ fn test_monitor_and_persister_update_fail() {

-				// Apply the montior update to the original ChainMonitor, ensuring the
+				// Apply the monitor update to the original ChainMonitor, ensuring the
 				// ChannelManager and ChannelMonitor aren't out of sync.
diff --git a/lightning/src/ln/channelmanager.rs b/lightning/src/ln/channelmanager.rs
index 93e8f617d..8b524d8f3 100644
--- a/lightning/src/ln/channelmanager.rs
+++ b/lightning/src/ln/channelmanager.rs
@@ -928,3 +928,3 @@ where
 /// [`chain::Watch::watch_channel`]/[`update_channel`] or before completing async writes. With
-/// ChannelManagers, writing updates happens out-of-band (and will prevent any other
+/// `ChannelManager`s, writing updates happens out-of-band (and will prevent any other
 /// `ChannelManager` operations from occurring during the serialization process). If the
$ 

tnull
tnull previously approved these changes Sep 21, 2023
Copy link
Contributor

@tnull tnull left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, happy to have any further (documentation) changes happen in follow-ups.

@valentinewallace
Copy link
Contributor

CI needs a fix.

When a `ChannelMonitorUpdate` fails to apply, it generally means
we cannot reach our storage backend. This, in general, is a
critical issue, but is often only a transient issue.

Sadly, users see the failure variant and return it on any I/O
error, resulting in channel force-closures due to transient issues.

Users don't generally expect force-closes in most cases, and
luckily with async `ChannelMonitorUpdate`s supported we don't take
any risk by "delaying" the `ChannelMonitorUpdate` indefinitely.

Thus, here we drop the `PermanentFailure` variant entirely, making
all failures instead be "the update is in progress, but won't ever
complete", which is equivalent if we do not close the channel
automatically.
Now that `PermanentFailure` is not a possible return value, we can
simply remove handling of it in `ChannelMonitor`.
The `MonitorEvent::CommitmentTxConfirmed` has always been a result
of us force-closing the channel, not the counterparty doing so.
Thus, it was always a bit of a misnomer. Worse, it carried over
into the channel's `ClosureReason` in the event API.

Here we simply rename it and use the proper `ClosureReason`.
Since we now (almost) support async monitor update persistence, the
documentation on `ChannelMonitorUpdateStatus` can be updated to no
longer suggest users must keep a local copy that persists before
returning. However, because there are still a few remaining issues,
we note that async support is currently beta and explicily warn of
potential for funds-loss.

Fixes lightningdevkit#1684
In the previous commit various dead code was removed. Here we
finish that cleanup by removing uneccessary indentation and syntax.
Now that `handle_new_monitor_update` can no longer return an error,
it similarly no longer needs any code to handle errors. Here we
remove that code, substantially reducing macro variants.
In general, doc comments on trait impl blocks are not very visible
in rustdoc output, and unless they provide useful information they
should be elided.

Here we drop useless doc comments on `ChainMonitor`'s `Watch` impl
methods.
While there is no great way to handle a true failure to persist a
`ChannelMonitorUpdate`, it is confusing for users for there to be
no error variant at all on an I/O operation.

Thus, here we re-add the error variant removed over the past
handful of commits, but rather than handle it in a truly unsafe
way, we simply panic, optimizing for maximum mutex poisoning to
ensure any future operations fail and return immediately.

In the future, we may consider changing the handling of this to
instead set some "disconnect all peers and fail all operations"
bool to give the user a better chance to shutdown in a semi-orderly
fashion, but there's only so much that can be done in lightning if
we truly cannot persist new updates.
@TheBlueMatt
Copy link
Collaborator Author

Grrrr, sorry about that. Somehow the UnwindUnsafe trait isn't exposed in core in our MSRV, fixed the CI issues:

$ git diff-tree -U1 914c2999 f254c565
diff --git a/fuzz/src/chanmon_consistency.rs b/fuzz/src/chanmon_consistency.rs
index b6d41fb99..6a2007165 100644
--- a/fuzz/src/chanmon_consistency.rs
+++ b/fuzz/src/chanmon_consistency.rs
@@ -140,3 +140,3 @@ impl TestChainMonitor {
 impl chain::Watch<TestChannelSigner> for TestChainMonitor {
-	fn watch_channel(&self, funding_txo: OutPoint, monitor: channelmonitor::ChannelMonitor<TestChannelSigner>) -> chain::ChannelMonitorUpdateStatus {
+	fn watch_channel(&self, funding_txo: OutPoint, monitor: channelmonitor::ChannelMonitor<TestChannelSigner>) -> Result<chain::ChannelMonitorUpdateStatus, ()> {
 		let mut ser = VecWriter(Vec::new());
@@ -502,3 +502,3 @@ pub fn do_test<Out: Output>(data: &[u8], underlying_out: Out) {
 				assert_eq!(chain_monitor.chain_monitor.watch_channel(funding_txo, mon),
-					ChannelMonitorUpdateStatus::Completed);
+					Ok(ChannelMonitorUpdateStatus::Completed));
 			}
diff --git a/lightning/src/chain/chainmonitor.rs b/lightning/src/chain/chainmonitor.rs
index f41512b14..b6909cb3e 100644
--- a/lightning/src/chain/chainmonitor.rs
+++ b/lightning/src/chain/chainmonitor.rs
@@ -1034,2 +1034,3 @@ mod tests {
 	#[test]
+	#[cfg(feature = "std")]
 	fn update_during_chainsync_poisons_channel() {
diff --git a/lightning/src/ln/functional_test_utils.rs b/lightning/src/ln/functional_test_utils.rs
index 9a5a3466a..5a1a21e29 100644
--- a/lightning/src/ln/functional_test_utils.rs
+++ b/lightning/src/ln/functional_test_utils.rs
@@ -424,4 +424,6 @@ pub struct Node<'chan_man, 'node_cfg: 'chan_man, 'chan_mon_cfg: 'node_cfg> {
 }
-impl<'a, 'b, 'c> core::panic::UnwindSafe for Node<'a, 'b, 'c> {}
-impl<'a, 'b, 'c> core::panic::RefUnwindSafe for Node<'a, 'b, 'c> {}
+#[cfg(feature = "std")]
+impl<'a, 'b, 'c> std::panic::UnwindSafe for Node<'a, 'b, 'c> {}
+#[cfg(feature = "std")]
+impl<'a, 'b, 'c> std::panic::RefUnwindSafe for Node<'a, 'b, 'c> {}
 impl<'a, 'b, 'c> Node<'a, 'b, 'c> {
$ 

$self.get_channel_update_for_broadcast(&$chan).ok(), $chan.context.get_value_satoshis()));
$remove;
res
},
ChannelMonitorUpdateStatus::Completed => {
$completed;
Ok(true)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

channelmanager.rs L3379 has a reference to permanent monitor failure

@@ -470,7 +470,7 @@ mod tests {
index: 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

10 lines up is a reference to permanent failure, though I suppose it's not really incorrect per se. Also in the test name.

@@ -436,7 +436,7 @@ mod tests {
}

// Test that if the store's path to channel data is read-only, writing a
// monitor to it results in the store returning a PermanentFailure.
// monitor to it results in the store returning an InProgress.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be UnrecoverableError now

/// If at some point no further progress can be made towards persisting the pending updates, the
/// node should simply shut down.
///
/// * If the persistence has failed and cannot be retried further (e.g. because of some timeout),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit(non-blocking): timeout is generally transient and retry should fix that

instead it should be "e.g. because of some outage"

@TheBlueMatt TheBlueMatt merged commit 4f4e84e into lightningdevkit:main Sep 21, 2023
@TheBlueMatt
Copy link
Collaborator Author

Opened a followup PR at #2591

tnull added a commit that referenced this pull request Sep 29, 2023
@@ -134,7 +134,7 @@ pub enum MonitorEvent {
HTLCEvent(HTLCUpdate),

/// A monitor event that the Channel's commitment transaction was confirmed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this piece of docs still correct?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, good point, it should be changed, yes!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed at #2668

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants