Drop the ChannelMonitorUpdateStatus::PermanentFailure variant #2562

TheBlueMatt · 2023-09-08T00:11:17Z

~~Based on #2169.~~

When a ChannelMonitorUpdate fails to apply, it generally means
we cannot reach our storage backend. This, in general, is a
critical issue, but is often only a transient issue.

Sadly, users see the failure variant and return it on any I/O
error, resulting in channel force-closures due to transient issues.

Users don't generally expect force-closes in most cases, and
luckily with async ChannelMonitorUpdates supported we don't take
any risk by "delaying" the ChannelMonitorUpdate indefinitely.

Thus, here we drop the PermanentFailure variant entirely, making
all failures instead be "the update is in progress, but won't ever
complete", which is equivalent if we do not close the channel
automatically.

lightning/src/util/persist.rs

lightning/src/chain/channelmonitor.rs

lightning/src/chain/mod.rs

lightning/src/chain/chainmonitor.rs

lightning/src/chain/mod.rs

lightning/src/chain/chainmonitor.rs

valentinewallace · 2023-09-08T17:49:49Z

lightning/src/ln/channelmanager.rs

+					res.map(|_| ())
+				} else {
+					log_error!(self.logger, "Persisting initial ChannelMonitor failed, implying the funding outpoint was duplicated");
+					return Err(MsgHandleErrInternal::send_err_msg_no_close(


Should we generate an event to notify the user in this case, or is it rare enough that we won't bother?

I think for an inbound channel we should not - someone can spam us with these all days and the user doesn't need to care.

valentinewallace · 2023-09-08T17:52:11Z

lightning/src/ln/channelmanager.rs

+				if let Ok(persist_state) = monitor_res {
+					match self.id_to_peer.lock().unwrap().entry(chan.context.channel_id()) {
+						hash_map::Entry::Occupied(_) => {
+							return Err(MsgHandleErrInternal::send_err_msg_no_close(


Should be unreachable now, right? Could debug_assert!(false)

Yes, but post-v2-chanels maybe not? In any case I'm actually gonna reorder the checks here anyway.

lightning/src/ln/channelmanager.rs

TheBlueMatt · 2023-09-08T21:04:49Z

Gonna wait to address feedback until #2495.

TheBlueMatt · 2023-09-10T22:55:12Z

Rebased and addressed comments. Apologies to reviewers, I wrote this, realized I needed to wait on #2495, then when that got closed opened it, forgetting that I had left a few things left to clean up before opening. In any case, this now removes a ton of code in the monitor handling pipeline in ChannelManager as well.

G8XSU · 2023-09-11T20:56:24Z

I am not sure if we should be doing this, IIUC from overall description of PR.
Right now how async persistence works is, client should return InProgress and call channel_monitor_updated later.
This places additional burden on client to "remember" and ensure that monitor is persisted eventually after returning InProgress. (I don't see how will we do indefinitely retries otherwise)

What if I as client don't want to complicate my architecture and don't want to remember any non-persisted updates.

PermanentFailure is a perfectly valid use-case to return after repeated retries have failed. Now LDK might choose to either shutdown or force-close. (we can always shutdown if we think that is right for this kind of failure)

Our async is not a future that can fail but more of a "Delayable" iiuc. (because there is no channel_monitor_update_failed option)
Removing Failure option from return makes it more confusing/burdensome for clients as we seem to be forcing async interface on people who might not want it.

In a perfect world, wouldn't we want to keep our node running (for chain-watch) and fail all other operations until persistence works?

from Comments: "the node shut down (as if we cannot update stored data we can't do much of anything useful)."

Wouldn't it be better for us to have future control/visibility over this failure rather than expecting client to call shutdown on this failure.
Why not do this for clients ? (We can improve this behaviour later otherwise we lose all visibility into this)

which is equivalent if we do not close the channel automatically.

This doesn't seem equivalent from api perspective, it is async on fail.

TheBlueMatt · 2023-09-11T22:48:19Z

The big issue with a failure variant here is that we really can't handle it in any reasonable way - we currently force-close the channel, but that is actually incredibly unsafe! If we force-close with the latest monitor state we will broadcast a transaction which may have outputs which are only claimable with the latest monitor state (which the user told us they failed to persist!) We do this under the assumption that maybe the disk caught fire and we should try to get our money back ASAP, but that's certainly not a given - the disk may simply be disconnected (especially true as more folks move towards remote storage).

Really if I/O fails for any reason we need to get a human, there's just not much else we can do in Lightning. I chatted a bit with @domZippilli and we concluded probably all the in-LDK persistence stuff needs to expose some kind of "an error happened, please get a human and probably shut down" notification, but there's not a hell of a lot more we can do.

In the future, we may want to add a Failure variant back into this enum, as you suggest, but implement it totally differently - treat it the same as the inprogress variant but then refuse to talk to peers until the user shuts down or finishes the persist. Until then, I think we still absolutely need to remove the relatively unsafe logic we have today, and adding back in a totally different implementation is pretty orthogonal.

domZippilli · 2023-09-11T23:14:08Z

PermanentFailure is a perfectly valid use-case to return after repeated retries have failed. Now LDK might choose to either shutdown or force-close. (we can always shutdown if we think that is right for this kind of failure)

@TheBlueMatt did we simply come up with a more complex way of doing this ^ ? In other words, rather than introducing some kind of novel notification, could this PR just change the way PermanentFailure is handled in LDK to be a shutdown (panic?) instead?

G8XSU · 2023-09-11T23:20:05Z

we really can't handle it in any reasonable way - we currently force-close the channel, but that is actually incredibly unsafe! If we force-close with the latest monitor state we will broadcast a transaction which may have outputs which are only claimable with the latest monitor state (which the user told us they failed to persist!)

If the current handling for PermanentFailure is not satisfactory, we could replace it with shutdown instead of getting rid of failure return.
If we can't decide what to do, then I don't see how user can.

It is better if the node is crash-looping until DB/disk comes back up than adding a constraint about storage not failing.
Shutdown-behaviour will also result in node to auto-recover once user's DB/disk is fixed.

Is shutdown and restart unsafe? Afaiu we wouldn't have committed without updated persist, and with persistence failing we can't really do much.

In the future, we may want to add a Failure variant back into this enum, as you suggest, but implement it totally differently - treat it the same as the inprogress variant but then refuse to talk to peers until the user shuts down or finishes the persist.

If we keep the failure return, it is easier to improve this in future without interface change and impacting customer's architecture.

TheBlueMatt · 2023-09-11T23:26:21Z

If the current handling for PermanentFailure is not satisfactory, we could replace it with shutdown instead of getting rid of failure return. If we can't decide what to do, then I don't see how user can.

Right, I think we agree here, there's just currently no mechanism in LDK to trigger the user to shut down the node. We can certainly build it (in the form of refusing to talk to peers until a resolution or just giving the user a notification from the I/O driver that we need to shut down), but it doesn't exist today.

It is better if the node is crash-looping until DB/disk comes back up than adding a constraint about storage not failing. Shutdown-behaviour will also result in node to auto-recover once user's DB/disk is fixed.

I think the API change here kinda encourages that? We could mention in the documentation that the user may want to consider panic'ing to trigger immediate shutdown in the most aggressive way possible :).

Is shutdown and restart unsafe? Afaiu we wouldn't have committed without updated persist, and with persistence failing we can't really do much.

No, its safe, as long as we dont get a ChannelMonitorUpdateStatus::Completed, LDK won't take any "irreversible action" based on something we needed to persist, so (at least as of 0.0.118) returning InProgress and never completing won't lose funds, it will just result in eventual force-closure as HTLCs time out.

Even if we skip the part of this PR that drops the PermanentFailure variant, I still think we want to keep the rest (which is the bulk of the changes here) - we really need to stop doing the current force-closure, so we really want to just take the perm-failure variant and handle it the same as InProgress, just with a log_error and then adding some logic around shutting down (somehow).

We should still encourage users to use InProgress and loop forever in the background trying to complete the persist while waiting on their human to fix the system, however.

G8XSU · 2023-09-12T00:12:41Z

Even if we skip the part of this PR that drops the PermanentFailure variant, I still think we want to keep the rest (which is the bulk of the changes here)

Yes, my only concern is with dropping permanent failure completely.

shut down the node), but it doesn't exist today.

In this rare case, I am in favor of panic! 😅, won't that work? panic in channel-manager

We should still encourage users to use InProgress and loop forever in the background

Not sure but I think this encourages unsafe behaviour, we are overloading meaning of InProgress.
Multiple requests would start looping or block each other, eventually causing system overload which is unnecessary distraction at the time of fire.

This places additional burden on client to "remember" and ensure that monitor is persisted eventually after returning InProgress.

And in worst case, this encourages silent failures, because if there is a failure, monitorUpdate is InProgress-persist, i can't handle failure any other way(i don't want to remember updates). Now channel is in disabled state, user doesn't even know it is.

TheBlueMatt · 2023-09-12T04:28:39Z

Yes, my only concern is with dropping permanent failure completely.

Right, we need to have a conversation about how to design any such failure case handling. If there's not something that immediately makes sense this late in a release cycle, I'm not sure that having an error code that we can't handle is a great design.

In this rare case, I am in favor of panic! 😅, won't that work? panic in channel-manager

Possibly - I kinda hate the Rust ecosystem here since by default panics actually unwind rather than actually abort, and tokio by default catches the unwind and just continues chugging along (which is more normal behavior in something like a webserver - if a request comes in which causes some bad behavior, fail the request, but don't kill the whole server, but doesn't make sense here). We do recommend users use panic = abort, which actually crashes on panics rather than allowing unwinding, but some users won't.

We do panic elsewhere in LDK, and panics in a mutex are kinda "viral" (later attempts to take the lock will also panic when we unwrap the lock), so the node would still (mostly) shut down, just not in the most orderly fashion.

Not sure but I think this encourages unsafe behaviour, we are overloading meaning of InProgress. Multiple requests would start looping or block each other, eventually causing system overload which is unnecessary distraction at the time of fire.

It wouldn't grow unbounded/huge, a few updates per channel you have open pending, after that we just kinda start locking up and not handling new messages/sending new messages until persistence starts completing again.

And in worst case, this encourages silent failures, because if there is a failure, monitorUpdate is InProgress-persist, i can't handle failure any other way(i don't want to remember updates). Now channel is in disabled state, user doesn't even know it is.

Users can always list the pending monitor updates and go through all the channelmonitors with pending updates, persist the full current state, and mark them complete. We could even expose some kind of replay-attempt function which just calls Watch::update_channel in a loop for any channels with pending updates.

tnull · 2023-09-12T07:48:00Z

Possibly - I kinda hate the Rust ecosystem here since by default panics actually unwind rather than actually abort, and tokio by default catches the unwind and just continues chugging along (which is more normal behavior in something like a webserver - if a request comes in which causes some bad behavior, fail the request, but don't kill the whole server, but doesn't make sense here).

I'm not sure why it wouldn't make sense here? If correctly implemented, would it be exactly what we want a server to do, in particular if multiple nodes would be handled by one server. I.e., we'd want it to propagate the panic of a failing task sufficiently upwards, catch it there, and restart that branch of tasks belong to the failing node?

TheBlueMatt · 2023-09-12T16:06:19Z

I'm not sure why it wouldn't make sense here? If correctly implemented, would it be exactly what we want a server to do, in particular if multiple nodes would be handled by one server. I.e., we'd want it to propagate the panic of a failing task sufficiently upwards, catch it there, and restart that branch of tasks belong to the failing node?

Right, I was thinking more the ldk-sample/ldk-node world where you have a single node with potentially many tokio tasks. If you're going the multi-tennancy route you probably do want to catch the panic, you just want to make sure that all the tasks for the given node all shut down, rather than having that node limp along half-running.

domZippilli · 2023-09-12T16:26:02Z

I think what's key to making this understandable is how InProgress is defined in an "infallible" world (I think "infallible" implies "panic if there's an error", at least that is what we do). Intuitively I expect:

Completed: We did it, yay
InProgress: We're going to get back to you when it's done. We'll panic if it fails.
PermanentFailure: This write isn't going to work, LDK, so please do whatever is responsible and panic for me.

I guess if PF is out of the enum, I'm going to panic myself if there's a write failure. There's nothing else I can really do. If you make this trait method fallible, I would return an Err (which is effectively how I'm treating PF), but would still expect LDK to do the right/safe thing, which is probably a few minimal actions and then panic.

For me, the most confusing possibility is saying InProgress is what you should do for transient errors. Switching to an async mode in order handle transient errors is a little surprising, and makes me feel more exposed. Knowing that LDK is blocking on your write is very reassuring, because I think (implicitly) it means that channel is locked during that time, and losing that "lock" is much more fraught. In fact, I think transient errors should never be presented to this trait, if the separations of concerns is correct. They should be handled in a storage backend that knows the right amount of retrying/backing off etc., and when to give up. The backend's methods should be definitively fallible to this layer -- they either succeed within parameters, or the storage is now outside of parameters (failing), and we should panic. This is of special interest to me right now, as I think if the backend returns transient errors, it's much harder to write a generic persister implementation of any kind.

You could substitute panic for "signal an orderly shutdown" and I think the only difference is whether one attempts to unwind in some way.

Incidentally, panic-on-fail is how we (c=) handle the other infallible persistence methods, like failing to write the manager/graph/scorer. I think the appeal of signaling an error to LDK is that it can take some sort of action. It sounds like there's actually no safe action to take, so guiding customers to panic when infallible methods fail is probably the right thing? And to use panic=abort, or at least shut down the node the thread belongs to if they're multitenant (though one has to wonder: if a given multitenant process is having trouble writing for one node, would others be expected to work? Do they use different storage systems?).

G8XSU · 2023-09-12T20:14:30Z

We do panic elsewhere in LDK, and panics in a mutex are kinda "viral" (later attempts to take the lock will also panic when we unwrap the lock), so the node would still (mostly) shut down, just not in the most orderly fashion.

Ok, I was hoping panic would work because we are relying on this in other parts of our code assuming abort functionality, if it doesn't function as per our requirements then ok.

It wouldn't grow unbounded/huge

Ldk pending events won't grow unbounded, but I was more of concerned about client's threads waiting on ldk and ldk will block them, which is unbounded, all other requests to service will probably be dropped/stuck. (So i probably wouldn't recommend infinite wait inside persist.)

Users can always list the pending monitor updates and go through all the channelmonitors with pending updates

Yes, i guess people who are using it as async api can do that.

transient errors should never be presented to this trait

I am bit confused, we were not discussing to expose transient errors from trait without retries. Only worry was about what to do after retries are exhausted. (or failure can't be recovered from in short duration of retry)

InProgress: We're going to get back to you when it's done. We'll panic if it fails.

From what you describe, I think this supports Matt's thesis to remove permanent failure for now. Because it doesn't seem feasible to implement panic at failure.

In summary, I am ok with current approach, given we will document for client to panic/shutdown on non-recoverable failure. (somehow)
But I am still slightly confused on how our kvstore implementations will handle it, we have moved this problem(of panic/shutdown) to kvstore implementation i guess.
Will we panic? I don't think we can do anything else at that point. (Same goes for MUP and stuff)

domZippilli · 2023-09-12T20:22:53Z

transient errors should never be presented to this trait

I am bit confused, we were not discussing to expose transient errors from trait without retries. Only worry was about what to do after retries are exhausted. (or failure can't be recovered from in short duration of retry)

Very possible, even probable, that I'm the confused one. The opening comment on the PR includes:

When a ChannelMonitorUpdate fails to apply, it generally means
we cannot reach our storage backend. This, in general, is a
critical issue, but is often only a transient issue.

Sadly, users see the failure variant and return it on any I/O
error, resulting in channel force-closures due to transient issues.

Users don't generally expect force-closes in most cases, and
luckily with async ChannelMonitorUpdates supported we don't take
any risk by "delaying" the ChannelMonitorUpdate indefinitely.

Between that, and comments on #2359, I thought the idea was that we expected implementations of Persist to decide what's transient and what's not, and use the InProgress variant for what is transient. If not, I guess disregard that comment other than, clarity may be required (design doc?).

InProgress: We're going to get back to you when it's done. We'll panic if it fails.

From what you describe, I think this supports Matt's thesis to remove permanent failure for now. Because it doesn't seem feasible to implement panic at failure.

Fair. FWIW I only suggested panic there because there's no mechanism I know of to call back with a failure. That would also be an option, but AFAIK once you return InProgress, there are only two paths out: call back with MonitorEvent::Completed or exit the program.

TheBlueMatt · 2023-09-12T21:06:44Z

Right, I guess we should define "transient errors", here - I assume a storage backend will mostly retry or have a reliable backend such that any error that happens is a "Real Error" - the disk is disconnected, the network is offline, etc. However, those errors are still mostly "transient" in that they will eventually recover - our disk isn't literally on fire, we're not hosted at OVH, etc. Thus, we shouldn't force-close, but also shouldn't really do anything...there's nothing we can do, aside from ensuring we go get a human.

So I think we could do something fancier in 0.0.118 (disconnect all peers, stop making progress until the user recovers the node, but stay alive, maybe have some kind of Future or other indicator that the user needs to shut down), we kinda have two options in 0.0.117 - keep the PermanentFailure variant and just immediately panic, or drop it and maybe improve the documentation that is in this PR to tell the user to panic.

Of the two I marginally prefer the second, as it makes the behavior more visible to the user, though only marginally so as often the user is delegating to the LDK filesystem/sqlite/etc persister, which does the panic for the user. But at least those not using those persisters can choose if they want to panic or mark it in-flight and do their own clean-ish shutdowns.

G8XSU · 2023-09-12T23:41:39Z

But at least those not using those persisters can choose if they want to panic or mark it in-flight and do their own clean-ish shutdowns.

They could always do that by just using InProgress. And clients must be already be using async if they have existing logic for it. (i dont see a reason to mark in-progress and then do a shutdown. if anything we would want to shutdown before returning)

If we want to build some better handling in near future, I prefer keeping the permanentFailure variant and just improve it's handling later. for now if immediate panic works, we can do that and maybe make it clear in docs. (or suggest clean shutdown in case of permanent fail)

This will also be consistent with how MUP or kvstore implementations will handle it currently.

Will leave it to you to decide either method.

domZippilli · 2023-09-12T23:51:48Z

Right, I guess we should define "transient errors", here - I assume a storage backend will mostly retry or have a reliable backend such that any error that happens is a "Real Error" - the disk is disconnected, the network is offline, etc. However, those errors are still mostly "transient" in that they will eventually recover - our disk isn't literally on fire, we're not hosted at OVH, etc. Thus, we shouldn't force-close, but also shouldn't really do anything...there's nothing we can do, aside from ensuring we go get a human.

I've heard that naming things is hard. Perhaps this terminology works:

Transient error: An expected temporary error of a system dependency with <100% success rate, such as the occasional 503 from a remote storage service. Software can possibly recover from transient errors with techniques like truncated exponential backoff; though, if it cannot, a transient error may "graduate" into an intervention error.
Intervention error: An error severe enough that software cannot recover, and human intervention is required to possibly recover the system to known-good state.

I'm saying this for docs -- "intervention" is the key distinction you're making between these two categories of recoverable error. Feels weird to invent a term here, I feel sure I read about this in the SRE book.

You may or may not consider introducing a variant like InterventionRequired to signal "don't force close, or do anything else, we're done until my human shows up." panic!() might be a suitable synonym, though. :)

tnull · 2023-09-13T08:04:44Z

If we want to build some better handling in near future, I prefer keeping the permanentFailure variant and just improve it's handling later. for now if immediate panic works, we can do that and maybe make it clear in docs. (or suggest clean shutdown in case of permanent fail)

This will also be consistent with how MUP or kvstore implementations will handle it currently.

FWIW, I'm also leaning on this side, as going forward we should find a solution where it's not the KVStores responsibility to make decisions about panicking or not. If we need to, LDK should take that responsibility and initiate an orderly shutdown.
If we make it its responsibility now and Persist/KVStore implementations are required to panic instead of returning errors under certain circumstances, it may introduce a chance for further bugs/misunderstandings if we reneg on the requirements after a release or two.

TheBlueMatt · 2023-09-13T18:57:09Z

i dont see a reason to mark in-progress and then do a shutdown. if anything we would want to shutdown before returning

I think my point here is that you can't generally do a shutdown in the middle of a sync call stack (aside from panic). If you want to do a "semi-orderly" shutdown (even if your persistence is borked) you probably need to trigger some other code, then return so that the ChannelManager can quiet down.

If we want to build some better handling in near future, I prefer keeping the permanentFailure variant and just improve it's handling later.

I've heard that naming things is hard. Perhaps this terminology works:

Mmm, okay, I'm fine with that naming and leaving it as a failure - can re-add a "InterventionRequiredError" variant to the enum which would cause a panic. We could then leave InProgress and note in the docs that this also represents a "transient error" and that in such a case users should keep retrying to persist, utilizing the ChainMonitor's list-pending-persists method and doing a full persistence.

G8XSU · 2023-09-14T17:40:34Z

Intervention error: An error severe enough that software cannot recover, and human intervention is required to possibly recover the system to known-good state.

I am ok with either, but could this be "UnrecoverableError"?
It is not clear how a user might or might not intervene. On a mobile node, developer cannot intervene, on server it probably means a restart. And it might look strange to throw around InterventionRequired in our code multiple times on unexpected behavior and cases.

G8XSU · 2023-09-14T17:55:32Z

"InProgress and note in the docs that this also represents a "transient error" and that in such a case users should keep retrying to persist, utilizing the ChainMonitor's list-pending-persists method"

This part is still confusing,

It is either of :

throw UnrecoverableError if we can't recover from transient error by retries or such.
OR, "return InProgress and keep retrying to persist, utilizing the ChainMonitor's list-pending-persists method."

Second recommendation makes more sense for clients using async imo, but ofcourse it can be followed by anyone.

TheBlueMatt · 2023-09-21T18:06:28Z

Squashed all the commits, the doc changes since yesterday in total are:

$ git diff-tree -U3 d168ba6c83be92009b7a59d4ead6f742f1891b51 b0ff3050
diff --git a/lightning/src/chain/chainmonitor.rs b/lightning/src/chain/chainmonitor.rs
index df56bda11..268b29257 100644
--- a/lightning/src/chain/chainmonitor.rs
+++ b/lightning/src/chain/chainmonitor.rs
@@ -78,34 +78,48 @@ impl MonitorUpdateId {
 /// `Persist` defines behavior for persisting channel monitors: this could mean
 /// writing once to disk, and/or uploading to one or more backup services.
 ///
-/// Each method can return three possible values:
-///  * If persistence (including any relevant `fsync()` calls) happens immediately, the
-///    implementation should return [`ChannelMonitorUpdateStatus::Completed`], indicating normal
-///    channel operation should continue.
+/// Persistence can happen in one of two ways - synchronously completing before the trait method
+/// calls return or asynchronously in the background.
 ///
-///  * If persistence happens asynchronously, implementations can return
-///    [`ChannelMonitorUpdateStatus::InProgress`] while the update continues in the background.
-///    Once the update completes, [`ChainMonitor::channel_monitor_updated`] should be called with
-///    the corresponding [`MonitorUpdateId`].
+/// # For those implementing synchronous persistence
 ///
-///    Note that unlike the direct [`chain::Watch`] interface,
-///    [`ChainMonitor::channel_monitor_updated`] must be called once for *each* update which occurs.
+///  * If persistence completes fully (including any relevant `fsync()` calls), the implementation
+///    should return [`ChannelMonitorUpdateStatus::Completed`], indicating normal channel operation
+///    should continue.
 ///
-///    If persistence fails for some reason, implementations should consider returning
-///    [`ChannelMonitorUpdateStatus::InProgress`] and attempt to shut down or retry the persistence
-///    operation in the background through [`ChainMonitor::list_pending_monitor_updates`] and
+///  * If persistence fails for some reason, implementations should consider returning
+///    [`ChannelMonitorUpdateStatus::InProgress`] and retry all pending persistence operations in
+///    the background with [`ChainMonitor::list_pending_monitor_updates`] and
 ///    [`ChainMonitor::get_monitor`].
 ///
 ///    Once a full [`ChannelMonitor`] has been persisted, all pending updates for that channel can
 ///    be marked as complete via [`ChainMonitor::channel_monitor_updated`].
 ///
+///    If at some point no further progress can be made towards persisting the pending updates, the
+///    node should simply shut down.
+///
 ///  * If the persistence has failed and cannot be retried,
 ///    [`ChannelMonitorUpdateStatus::UnrecoverableError`] can be used, though this will result in
 ///    an immediate panic and future operations in LDK generally failing.
 ///
-/// Third-party watchtowers may be built as a part of an implementation of this trait, with the
-/// advantage that you can control whether to resume channel operation depending on if an update
-/// has been persisted to a watchtower. For this, you may find the following methods useful:
+/// # For those implementing asynchronous persistence
+///
+///  All calls should generally spawn a background task and immediately return
+///  [`ChannelMonitorUpdateStatus::InProgress`]. Once the update completes,
+///  [`ChainMonitor::channel_monitor_updated`] should be called with the corresponding
+///  [`MonitorUpdateId`].
+///
+///  Note that unlike the direct [`chain::Watch`] interface,
+///  [`ChainMonitor::channel_monitor_updated`] must be called once for *each* update which occurs.
+///
+///  If at some point no further progress can be made towards persisting a pending update, the node
+///  should simply shut down.
+///
+/// # Using remote watchtowers
+///
+/// Watchtowers may be updated as a part of an implementation of this trait, utilizing the async
+/// update process described above while the watchtower is being updated. The following methods are
+/// provided for bulding transactions for a watchtower:
 /// [`ChannelMonitor::initial_counterparty_commitment_tx`],
 /// [`ChannelMonitor::counterparty_commitment_txs_from_update`],
 /// [`ChannelMonitor::sign_to_local_justice_tx`], [`TrustedCommitmentTransaction::revokeable_output_index`],
$

G8XSU · 2023-09-21T18:00:36Z

lightning/src/chain/chainmonitor.rs

-///    [`ChannelMonitorUpdateStatus::PermanentFailure`], in which case the channel will likely be
-///    closed without broadcasting the latest state. See
-///    [`ChannelMonitorUpdateStatus::PermanentFailure`] for more details.
+///  * If persistence fails for some reason, implementations should consider returning


If persistence fails for some reason:

Currently in this we have two options,

either return UnrecoverableError after retries have exhausted.
OR

return InProgress and retry out of band using list_pending_monitor_updates.

I would prefer the first option to be mentioned before second one, since that is what sync really means, to take decision then and there. I would also suggest removing the second option from sync, since user can use async persistence if they already have a process to retry-out-of-band at zero extra cost. Otherwise, we are mixing the two.

My thought process for this is : If someone is server-side and persistence has failed after exhausted retries, then there is really something wrong with my high-availability(99.99%) persistence(infra- outage). I am ok with shutting down instead of creating a separate process for retrying async, to keep things simple

I kinda prefer this order because we prefer users to do more retries in the background, but I understand many folks won't really bother with that cause its hard :)

Sounds good, its not that hard, just additional complexity some may want to avoid. :)

lightning/src/chain/chainmonitor.rs

G8XSU · 2023-09-21T18:09:49Z

lightning/src/chain/chainmonitor.rs

+/// # For those implementing asynchronous persistence
+///
+///  All calls should generally spawn a background task and immediately return
+///  [`ChannelMonitorUpdateStatus::InProgress`]. Once the update completes,


persistence can still fail in async.
list_pending_monitor_updates retrying process is necessary in this as well.

Infact if user has list_pending_monitor_updates retrying process, they should just use async.

I'm not quite sure I understand this - the documentation mentions how to handle failure in both cases.

no, it only mentions "Once the update completes".
There is still a case where update failed, and client needs to have mechanism to

retry all pending persistence operations in the background with [`ChainMonitor::list_pending_monitor_updates`] and [`ChainMonitor::get_monitor`].

while using async persistence.

depending on how their async works.

lightning-persister/src/fs_store.rs

lightning/src/ln/chanmon_update_fail_tests.rs

lightning/src/ln/channelmanager.rs

TheBlueMatt · 2023-09-21T18:40:32Z

Pushed some (squashed) grammar/spelling fixes:

$ git diff-tree -U1 b0ff3050 914c2999
diff --git a/lightning-persister/src/fs_store.rs b/lightning-persister/src/fs_store.rs
index 4a70d238c..42b28018f 100644
--- a/lightning-persister/src/fs_store.rs
+++ b/lightning-persister/src/fs_store.rs
@@ -438,3 +438,3 @@ mod tests {
 	// Test that if the store's path to channel data is read-only, writing a
-	// monitor to it results in the store returning a InProgress.
+	// monitor to it results in the store returning an InProgress.
 	// Windows ignores the read-only flag for folders, so this test is Unix-only.
diff --git a/lightning/src/chain/chainmonitor.rs b/lightning/src/chain/chainmonitor.rs
index 268b29257..f41512b14 100644
--- a/lightning/src/chain/chainmonitor.rs
+++ b/lightning/src/chain/chainmonitor.rs
@@ -100,3 +100,3 @@ impl MonitorUpdateId {
 ///
-///  * If the persistence has failed and cannot be retried,
+///  * If the persistence has failed and cannot be retried further (e.g. because of some timeout),
 ///    [`ChannelMonitorUpdateStatus::UnrecoverableError`] can be used, though this will result in
diff --git a/lightning/src/ln/chanmon_update_fail_tests.rs b/lightning/src/ln/chanmon_update_fail_tests.rs
index dc4b3d57a..dc9bbd3e4 100644
--- a/lightning/src/ln/chanmon_update_fail_tests.rs
+++ b/lightning/src/ln/chanmon_update_fail_tests.rs
@@ -106,3 +106,3 @@ fn test_monitor_and_persister_update_fail() {

-				// Apply the montior update to the original ChainMonitor, ensuring the
+				// Apply the monitor update to the original ChainMonitor, ensuring the
 				// ChannelManager and ChannelMonitor aren't out of sync.
diff --git a/lightning/src/ln/channelmanager.rs b/lightning/src/ln/channelmanager.rs
index 93e8f617d..8b524d8f3 100644
--- a/lightning/src/ln/channelmanager.rs
+++ b/lightning/src/ln/channelmanager.rs
@@ -928,3 +928,3 @@ where
 /// [`chain::Watch::watch_channel`]/[`update_channel`] or before completing async writes. With
-/// ChannelManagers, writing updates happens out-of-band (and will prevent any other
+/// `ChannelManager`s, writing updates happens out-of-band (and will prevent any other
 /// `ChannelManager` operations from occurring during the serialization process). If the
$

tnull

LGTM, happy to have any further (documentation) changes happen in follow-ups.

valentinewallace · 2023-09-21T18:55:07Z

CI needs a fix.

When a `ChannelMonitorUpdate` fails to apply, it generally means we cannot reach our storage backend. This, in general, is a critical issue, but is often only a transient issue. Sadly, users see the failure variant and return it on any I/O error, resulting in channel force-closures due to transient issues. Users don't generally expect force-closes in most cases, and luckily with async `ChannelMonitorUpdate`s supported we don't take any risk by "delaying" the `ChannelMonitorUpdate` indefinitely. Thus, here we drop the `PermanentFailure` variant entirely, making all failures instead be "the update is in progress, but won't ever complete", which is equivalent if we do not close the channel automatically.

Now that `PermanentFailure` is not a possible return value, we can simply remove handling of it in `ChannelMonitor`.

The `MonitorEvent::CommitmentTxConfirmed` has always been a result of us force-closing the channel, not the counterparty doing so. Thus, it was always a bit of a misnomer. Worse, it carried over into the channel's `ClosureReason` in the event API. Here we simply rename it and use the proper `ClosureReason`.

Since we now (almost) support async monitor update persistence, the documentation on `ChannelMonitorUpdateStatus` can be updated to no longer suggest users must keep a local copy that persists before returning. However, because there are still a few remaining issues, we note that async support is currently beta and explicily warn of potential for funds-loss. Fixes lightningdevkit#1684

In the previous commit various dead code was removed. Here we finish that cleanup by removing uneccessary indentation and syntax.

Now that `handle_new_monitor_update` can no longer return an error, it similarly no longer needs any code to handle errors. Here we remove that code, substantially reducing macro variants.

In general, doc comments on trait impl blocks are not very visible in rustdoc output, and unless they provide useful information they should be elided. Here we drop useless doc comments on `ChainMonitor`'s `Watch` impl methods.

While there is no great way to handle a true failure to persist a `ChannelMonitorUpdate`, it is confusing for users for there to be no error variant at all on an I/O operation. Thus, here we re-add the error variant removed over the past handful of commits, but rather than handle it in a truly unsafe way, we simply panic, optimizing for maximum mutex poisoning to ensure any future operations fail and return immediately. In the future, we may consider changing the handling of this to instead set some "disconnect all peers and fail all operations" bool to give the user a better chance to shutdown in a semi-orderly fashion, but there's only so much that can be done in lightning if we truly cannot persist new updates.

TheBlueMatt · 2023-09-21T19:13:23Z

Grrrr, sorry about that. Somehow the UnwindUnsafe trait isn't exposed in core in our MSRV, fixed the CI issues:

$ git diff-tree -U1 914c2999 f254c565
diff --git a/fuzz/src/chanmon_consistency.rs b/fuzz/src/chanmon_consistency.rs
index b6d41fb99..6a2007165 100644
--- a/fuzz/src/chanmon_consistency.rs
+++ b/fuzz/src/chanmon_consistency.rs
@@ -140,3 +140,3 @@ impl TestChainMonitor {
 impl chain::Watch<TestChannelSigner> for TestChainMonitor {
-	fn watch_channel(&self, funding_txo: OutPoint, monitor: channelmonitor::ChannelMonitor<TestChannelSigner>) -> chain::ChannelMonitorUpdateStatus {
+	fn watch_channel(&self, funding_txo: OutPoint, monitor: channelmonitor::ChannelMonitor<TestChannelSigner>) -> Result<chain::ChannelMonitorUpdateStatus, ()> {
 		let mut ser = VecWriter(Vec::new());
@@ -502,3 +502,3 @@ pub fn do_test<Out: Output>(data: &[u8], underlying_out: Out) {
 				assert_eq!(chain_monitor.chain_monitor.watch_channel(funding_txo, mon),
-					ChannelMonitorUpdateStatus::Completed);
+					Ok(ChannelMonitorUpdateStatus::Completed));
 			}
diff --git a/lightning/src/chain/chainmonitor.rs b/lightning/src/chain/chainmonitor.rs
index f41512b14..b6909cb3e 100644
--- a/lightning/src/chain/chainmonitor.rs
+++ b/lightning/src/chain/chainmonitor.rs
@@ -1034,2 +1034,3 @@ mod tests {
 	#[test]
+	#[cfg(feature = "std")]
 	fn update_during_chainsync_poisons_channel() {
diff --git a/lightning/src/ln/functional_test_utils.rs b/lightning/src/ln/functional_test_utils.rs
index 9a5a3466a..5a1a21e29 100644
--- a/lightning/src/ln/functional_test_utils.rs
+++ b/lightning/src/ln/functional_test_utils.rs
@@ -424,4 +424,6 @@ pub struct Node<'chan_man, 'node_cfg: 'chan_man, 'chan_mon_cfg: 'node_cfg> {
 }
-impl<'a, 'b, 'c> core::panic::UnwindSafe for Node<'a, 'b, 'c> {}
-impl<'a, 'b, 'c> core::panic::RefUnwindSafe for Node<'a, 'b, 'c> {}
+#[cfg(feature = "std")]
+impl<'a, 'b, 'c> std::panic::UnwindSafe for Node<'a, 'b, 'c> {}
+#[cfg(feature = "std")]
+impl<'a, 'b, 'c> std::panic::RefUnwindSafe for Node<'a, 'b, 'c> {}
 impl<'a, 'b, 'c> Node<'a, 'b, 'c> {
$

valentinewallace · 2023-09-21T19:05:59Z

lightning/src/ln/channelmanager.rs

-					$self.get_channel_update_for_broadcast(&$chan).ok(), $chan.context.get_value_satoshis()));
-				$remove;
-				res
-			},
 			ChannelMonitorUpdateStatus::Completed => {
 				$completed;
 				Ok(true)


channelmanager.rs L3379 has a reference to permanent monitor failure

valentinewallace · 2023-09-21T19:09:21Z

lightning-persister/src/fs_store.rs

@@ -470,7 +470,7 @@ mod tests {
 			index: 0


10 lines up is a reference to permanent failure, though I suppose it's not really incorrect per se. Also in the test name.

valentinewallace · 2023-09-21T19:23:41Z

lightning-persister/src/fs_store.rs

@@ -436,7 +436,7 @@ mod tests {
 	}

 	// Test that if the store's path to channel data is read-only, writing a
-	// monitor to it results in the store returning a PermanentFailure.
+	// monitor to it results in the store returning an InProgress.


Should be UnrecoverableError now

G8XSU · 2023-09-21T19:44:03Z

lightning/src/chain/chainmonitor.rs

+///    If at some point no further progress can be made towards persisting the pending updates, the
+///    node should simply shut down.
+///
+///  * If the persistence has failed and cannot be retried further (e.g. because of some timeout),


nit(non-blocking): timeout is generally transient and retry should fix that

instead it should be "e.g. because of some outage"

TheBlueMatt · 2023-09-21T20:24:27Z

Opened a followup PR at #2591

Doc and comment followups to #2562

yellowred · 2023-10-09T17:27:02Z

lightning/src/chain/channelmonitor.rs

@@ -134,7 +134,7 @@ pub enum MonitorEvent {
 	HTLCEvent(HTLCUpdate),

 	/// A monitor event that the Channel's commitment transaction was confirmed.


Is this piece of docs still correct?

Ah, good point, it should be changed, yes!

Fixed at #2668

tnull reviewed Sep 8, 2023

View reviewed changes

lightning/src/util/persist.rs Show resolved Hide resolved

tnull mentioned this pull request Sep 8, 2023

Monitor updating persister #2359

Merged

valentinewallace reviewed Sep 8, 2023

View reviewed changes

valentinewallace added this to the 0.0.117 milestone Sep 8, 2023

valentinewallace added the blocked on dependent pr label Sep 8, 2023

valentinewallace self-assigned this Sep 8, 2023

TheBlueMatt force-pushed the 2023-08-no-perm-fail branch from 6494cd4 to dc5d362 Compare September 10, 2023 22:55

wpaulino self-requested a review September 11, 2023 19:16

TheBlueMatt mentioned this pull request Sep 13, 2023

Stop completing monitor udpate completion actions when claiming on a closed channel #2355

Open

G8XSU reviewed Sep 21, 2023

View reviewed changes

tnull reviewed Sep 21, 2023

View reviewed changes

lightning-persister/src/fs_store.rs Outdated Show resolved Hide resolved

lightning/src/ln/chanmon_update_fail_tests.rs Outdated Show resolved Hide resolved

tnull reviewed Sep 21, 2023

View reviewed changes

lightning/src/ln/channelmanager.rs Outdated Show resolved Hide resolved

TheBlueMatt force-pushed the 2023-08-no-perm-fail branch 2 times, most recently from 80e41b1 to 914c299 Compare September 21, 2023 18:39

tnull previously approved these changes Sep 21, 2023

View reviewed changes

TheBlueMatt added 10 commits September 21, 2023 19:04

Drop channel_perm_failed tracking in ChainMonitor

f24502e

Now that `PermanentFailure` is not a possible return value, we can simply remove handling of it in `ChannelMonitor`.

Drop ChannelMonitorUpdate::UpdateFailed as its now unused

6e115db

Drop PermamentFailure persistence handling in ChannelManager

574c77e

Clean up code flow in ChannelManager

341163e

In the previous commit various dead code was removed. Here we finish that cleanup by removing uneccessary indentation and syntax.

Drop error handling in handle_new_monitor_update

971f7a7

Now that `handle_new_monitor_update` can no longer return an error, it similarly no longer needs any code to handle errors. Here we remove that code, substantially reducing macro variants.

Drop doc comments on ChainMonitor trait impl methods

aa9c601

In general, doc comments on trait impl blocks are not very visible in rustdoc output, and unless they provide useful information they should be elided. Here we drop useless doc comments on `ChainMonitor`'s `Watch` impl methods.

TheBlueMatt dismissed tnull’s stale review via f254c56 September 21, 2023 19:12

TheBlueMatt force-pushed the 2023-08-no-perm-fail branch from 914c299 to f254c56 Compare September 21, 2023 19:12

tnull approved these changes Sep 21, 2023

View reviewed changes

valentinewallace approved these changes Sep 21, 2023

View reviewed changes

G8XSU reviewed Sep 21, 2023

View reviewed changes

TheBlueMatt merged commit 4f4e84e into lightningdevkit:main Sep 21, 2023

G8XSU mentioned this pull request Sep 21, 2023

Doc and comment followups to #2562 #2591

Merged

tnull added a commit that referenced this pull request Sep 29, 2023

Merge pull request #2591 from TheBlueMatt/2023-09-2562-followups

a8fa5a1

Doc and comment followups to #2562

yellowred reviewed Oct 9, 2023

View reviewed changes

		@@ -134,7 +134,7 @@ pub enum MonitorEvent {
		HTLCEvent(HTLCUpdate),

		/// A monitor event that the Channel's commitment transaction was confirmed.

Drop the ChannelMonitorUpdateStatus::PermanentFailure variant #2562

Drop the ChannelMonitorUpdateStatus::PermanentFailure variant #2562

Uh oh!

Conversation

TheBlueMatt commented Sep 8, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

TheBlueMatt commented Sep 8, 2023

Uh oh!

TheBlueMatt commented Sep 10, 2023

Uh oh!

G8XSU commented Sep 11, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TheBlueMatt commented Sep 11, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

domZippilli commented Sep 11, 2023

Uh oh!

G8XSU commented Sep 11, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TheBlueMatt commented Sep 11, 2023

Uh oh!

G8XSU commented Sep 12, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TheBlueMatt commented Sep 12, 2023

Uh oh!

tnull commented Sep 12, 2023

Uh oh!

TheBlueMatt commented Sep 12, 2023

Uh oh!

domZippilli commented Sep 12, 2023

Uh oh!

G8XSU commented Sep 12, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

domZippilli commented Sep 12, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TheBlueMatt commented Sep 12, 2023

Uh oh!

G8XSU commented Sep 12, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

domZippilli commented Sep 12, 2023

Uh oh!

tnull commented Sep 13, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TheBlueMatt commented Sep 13, 2023

Uh oh!

G8XSU commented Sep 14, 2023

Uh oh!

G8XSU commented Sep 14, 2023

Uh oh!

TheBlueMatt commented Sep 21, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

TheBlueMatt commented Sep 8, 2023 •

edited

Loading

G8XSU commented Sep 11, 2023 •

edited

Loading

TheBlueMatt commented Sep 11, 2023 •

edited

Loading

G8XSU commented Sep 11, 2023 •

edited

Loading

G8XSU commented Sep 12, 2023 •

edited

Loading

G8XSU commented Sep 12, 2023 •

edited

Loading

domZippilli commented Sep 12, 2023 •

edited

Loading

G8XSU commented Sep 12, 2023 •

edited

Loading

tnull commented Sep 13, 2023 •

edited

Loading

G8XSU Sep 21, 2023 •

edited

Loading