Skip to content

Commit 914c299

Browse files
committed
Add an UnrecoverableError variant to ChannelMonitorUpdateStatus
While there is no great way to handle a true failure to persist a `ChannelMonitorUpdate`, it is confusing for users for there to be no error variant at all on an I/O operation. Thus, here we re-add the error variant removed over the past handful of commits, but rather than handle it in a truly unsafe way, we simply panic, optimizing for maximum mutex poisoning to ensure any future operations fail and return immediately. In the future, we may consider changing the handling of this to instead set some "disconnect all peers and fail all operations" bool to give the user a better chance to shutdown in a semi-orderly fashion, but there's only so much that can be done in lightning if we truly cannot persist new updates.
1 parent ea0259e commit 914c299

File tree

6 files changed

+141
-35
lines changed

6 files changed

+141
-35
lines changed

lightning-persister/src/fs_store.rs

+2-2
Original file line numberDiff line numberDiff line change
@@ -470,7 +470,7 @@ mod tests {
470470
index: 0
471471
};
472472
match store.persist_new_channel(test_txo, &added_monitors[0].1, update_id.2) {
473-
ChannelMonitorUpdateStatus::InProgress => {},
473+
ChannelMonitorUpdateStatus::UnrecoverableError => {},
474474
_ => panic!("unexpected result from persisting new channel")
475475
}
476476

@@ -507,7 +507,7 @@ mod tests {
507507
index: 0
508508
};
509509
match store.persist_new_channel(test_txo, &added_monitors[0].1, update_id.2) {
510-
ChannelMonitorUpdateStatus::InProgress => {},
510+
ChannelMonitorUpdateStatus::UnrecoverableError => {},
511511
_ => panic!("unexpected result from persisting new channel")
512512
}
513513

lightning/src/chain/chainmonitor.rs

+100-22
Original file line numberDiff line numberDiff line change
@@ -78,26 +78,48 @@ impl MonitorUpdateId {
7878
/// `Persist` defines behavior for persisting channel monitors: this could mean
7979
/// writing once to disk, and/or uploading to one or more backup services.
8080
///
81-
/// Each method can return two possible values:
82-
/// * If persistence (including any relevant `fsync()` calls) happens immediately, the
83-
/// implementation should return [`ChannelMonitorUpdateStatus::Completed`], indicating normal
84-
/// channel operation should continue.
81+
/// Persistence can happen in one of two ways - synchronously completing before the trait method
82+
/// calls return or asynchronously in the background.
8583
///
86-
/// * If persistence happens asynchronously, implementations can return
87-
/// [`ChannelMonitorUpdateStatus::InProgress`] while the update continues in the background.
88-
/// Once the update completes, [`ChainMonitor::channel_monitor_updated`] should be called with
89-
/// the corresponding [`MonitorUpdateId`].
84+
/// # For those implementing synchronous persistence
9085
///
91-
/// Note that unlike the direct [`chain::Watch`] interface,
92-
/// [`ChainMonitor::channel_monitor_updated`] must be called once for *each* update which occurs.
86+
/// * If persistence completes fully (including any relevant `fsync()` calls), the implementation
87+
/// should return [`ChannelMonitorUpdateStatus::Completed`], indicating normal channel operation
88+
/// should continue.
9389
///
94-
/// If persistence fails for some reason, implementations should still return
95-
/// [`ChannelMonitorUpdateStatus::InProgress`] and attempt to shut down or otherwise resolve the
96-
/// situation ASAP.
90+
/// * If persistence fails for some reason, implementations should consider returning
91+
/// [`ChannelMonitorUpdateStatus::InProgress`] and retry all pending persistence operations in
92+
/// the background with [`ChainMonitor::list_pending_monitor_updates`] and
93+
/// [`ChainMonitor::get_monitor`].
9794
///
98-
/// Third-party watchtowers may be built as a part of an implementation of this trait, with the
99-
/// advantage that you can control whether to resume channel operation depending on if an update
100-
/// has been persisted to a watchtower. For this, you may find the following methods useful:
95+
/// Once a full [`ChannelMonitor`] has been persisted, all pending updates for that channel can
96+
/// be marked as complete via [`ChainMonitor::channel_monitor_updated`].
97+
///
98+
/// If at some point no further progress can be made towards persisting the pending updates, the
99+
/// node should simply shut down.
100+
///
101+
/// * If the persistence has failed and cannot be retried further (e.g. because of some timeout),
102+
/// [`ChannelMonitorUpdateStatus::UnrecoverableError`] can be used, though this will result in
103+
/// an immediate panic and future operations in LDK generally failing.
104+
///
105+
/// # For those implementing asynchronous persistence
106+
///
107+
/// All calls should generally spawn a background task and immediately return
108+
/// [`ChannelMonitorUpdateStatus::InProgress`]. Once the update completes,
109+
/// [`ChainMonitor::channel_monitor_updated`] should be called with the corresponding
110+
/// [`MonitorUpdateId`].
111+
///
112+
/// Note that unlike the direct [`chain::Watch`] interface,
113+
/// [`ChainMonitor::channel_monitor_updated`] must be called once for *each* update which occurs.
114+
///
115+
/// If at some point no further progress can be made towards persisting a pending update, the node
116+
/// should simply shut down.
117+
///
118+
/// # Using remote watchtowers
119+
///
120+
/// Watchtowers may be updated as a part of an implementation of this trait, utilizing the async
121+
/// update process described above while the watchtower is being updated. The following methods are
122+
/// provided for bulding transactions for a watchtower:
101123
/// [`ChannelMonitor::initial_counterparty_commitment_tx`],
102124
/// [`ChannelMonitor::counterparty_commitment_txs_from_update`],
103125
/// [`ChannelMonitor::sign_to_local_justice_tx`], [`TrustedCommitmentTransaction::revokeable_output_index`],
@@ -279,19 +301,31 @@ where C::Target: chain::Filter,
279301
where
280302
FN: Fn(&ChannelMonitor<ChannelSigner>, &TransactionData) -> Vec<TransactionOutputs>
281303
{
304+
let err_str = "ChannelMonitor[Update] persistence failed unrecoverably. This indicates we cannot continue normal operation and must shut down.";
282305
let funding_outpoints: HashSet<OutPoint> = HashSet::from_iter(self.monitors.read().unwrap().keys().cloned());
283306
for funding_outpoint in funding_outpoints.iter() {
284307
let monitor_lock = self.monitors.read().unwrap();
285308
if let Some(monitor_state) = monitor_lock.get(funding_outpoint) {
286-
self.update_monitor_with_chain_data(header, best_height, txdata, &process, funding_outpoint, &monitor_state);
309+
if self.update_monitor_with_chain_data(header, best_height, txdata, &process, funding_outpoint, &monitor_state).is_err() {
310+
// Take the monitors lock for writing so that we poison it and any future
311+
// operations going forward fail immediately.
312+
core::mem::drop(monitor_state);
313+
core::mem::drop(monitor_lock);
314+
let _poison = self.monitors.write().unwrap();
315+
log_error!(self.logger, "{}", err_str);
316+
panic!("{}", err_str);
317+
}
287318
}
288319
}
289320

290321
// do some followup cleanup if any funding outpoints were added in between iterations
291322
let monitor_states = self.monitors.write().unwrap();
292323
for (funding_outpoint, monitor_state) in monitor_states.iter() {
293324
if !funding_outpoints.contains(funding_outpoint) {
294-
self.update_monitor_with_chain_data(header, best_height, txdata, &process, funding_outpoint, &monitor_state);
325+
if self.update_monitor_with_chain_data(header, best_height, txdata, &process, funding_outpoint, &monitor_state).is_err() {
326+
log_error!(self.logger, "{}", err_str);
327+
panic!("{}", err_str);
328+
}
295329
}
296330
}
297331

@@ -306,7 +340,10 @@ where C::Target: chain::Filter,
306340
}
307341
}
308342

309-
fn update_monitor_with_chain_data<FN>(&self, header: &BlockHeader, best_height: Option<u32>, txdata: &TransactionData, process: FN, funding_outpoint: &OutPoint, monitor_state: &MonitorHolder<ChannelSigner>) where FN: Fn(&ChannelMonitor<ChannelSigner>, &TransactionData) -> Vec<TransactionOutputs> {
343+
fn update_monitor_with_chain_data<FN>(
344+
&self, header: &BlockHeader, best_height: Option<u32>, txdata: &TransactionData,
345+
process: FN, funding_outpoint: &OutPoint, monitor_state: &MonitorHolder<ChannelSigner>
346+
) -> Result<(), ()> where FN: Fn(&ChannelMonitor<ChannelSigner>, &TransactionData) -> Vec<TransactionOutputs> {
310347
let monitor = &monitor_state.monitor;
311348
let mut txn_outputs;
312349
{
@@ -331,7 +368,10 @@ where C::Target: chain::Filter,
331368
ChannelMonitorUpdateStatus::InProgress => {
332369
log_debug!(self.logger, "Channel Monitor sync for channel {} in progress, holding events until completion!", log_funding_info!(monitor));
333370
pending_monitor_updates.push(update_id);
334-
}
371+
},
372+
ChannelMonitorUpdateStatus::UnrecoverableError => {
373+
return Err(());
374+
},
335375
}
336376
}
337377

@@ -351,6 +391,7 @@ where C::Target: chain::Filter,
351391
}
352392
}
353393
}
394+
Ok(())
354395
}
355396

356397
/// Creates a new `ChainMonitor` used to watch on-chain activity pertaining to channels.
@@ -674,7 +715,12 @@ where C::Target: chain::Filter,
674715
},
675716
ChannelMonitorUpdateStatus::Completed => {
676717
log_info!(self.logger, "Persistence of new ChannelMonitor for channel {} completed", log_funding_info!(monitor));
677-
}
718+
},
719+
ChannelMonitorUpdateStatus::UnrecoverableError => {
720+
let err_str = "ChannelMonitor[Update] persistence failed unrecoverably. This indicates we cannot continue normal operation and must shut down.";
721+
log_error!(self.logger, "{}", err_str);
722+
panic!("{}", err_str);
723+
},
678724
}
679725
if let Some(ref chain_source) = self.chain_source {
680726
monitor.load_outputs_to_watch(chain_source);
@@ -690,7 +736,7 @@ where C::Target: chain::Filter,
690736
fn update_channel(&self, funding_txo: OutPoint, update: &ChannelMonitorUpdate) -> ChannelMonitorUpdateStatus {
691737
// Update the monitor that watches the channel referred to by the given outpoint.
692738
let monitors = self.monitors.read().unwrap();
693-
match monitors.get(&funding_txo) {
739+
let ret = match monitors.get(&funding_txo) {
694740
None => {
695741
log_error!(self.logger, "Failed to update channel monitor: no such monitor registered");
696742

@@ -722,14 +768,25 @@ where C::Target: chain::Filter,
722768
ChannelMonitorUpdateStatus::Completed => {
723769
log_debug!(self.logger, "Persistence of ChannelMonitorUpdate for channel {} completed", log_funding_info!(monitor));
724770
},
771+
ChannelMonitorUpdateStatus::UnrecoverableError => { /* we'll panic in a moment */ },
725772
}
726773
if update_res.is_err() {
727774
ChannelMonitorUpdateStatus::InProgress
728775
} else {
729776
persist_res
730777
}
731778
}
779+
};
780+
if let ChannelMonitorUpdateStatus::UnrecoverableError = ret {
781+
// Take the monitors lock for writing so that we poison it and any future
782+
// operations going forward fail immediately.
783+
core::mem::drop(monitors);
784+
let _poison = self.monitors.write().unwrap();
785+
let err_str = "ChannelMonitor[Update] persistence failed unrecoverably. This indicates we cannot continue normal operation and must shut down.";
786+
log_error!(self.logger, "{}", err_str);
787+
panic!("{}", err_str);
732788
}
789+
ret
733790
}
734791

735792
fn release_pending_monitor_events(&self) -> Vec<(OutPoint, Vec<MonitorEvent>, Option<PublicKey>)> {
@@ -973,4 +1030,25 @@ mod tests {
9731030
do_chainsync_pauses_events(false);
9741031
do_chainsync_pauses_events(true);
9751032
}
1033+
1034+
#[test]
1035+
fn update_during_chainsync_poisons_channel() {
1036+
let chanmon_cfgs = create_chanmon_cfgs(2);
1037+
let node_cfgs = create_node_cfgs(2, &chanmon_cfgs);
1038+
let node_chanmgrs = create_node_chanmgrs(2, &node_cfgs, &[None, None]);
1039+
let nodes = create_network(2, &node_cfgs, &node_chanmgrs);
1040+
create_announced_chan_between_nodes(&nodes, 0, 1);
1041+
1042+
chanmon_cfgs[0].persister.chain_sync_monitor_persistences.lock().unwrap().clear();
1043+
chanmon_cfgs[0].persister.set_update_ret(ChannelMonitorUpdateStatus::UnrecoverableError);
1044+
1045+
assert!(std::panic::catch_unwind(|| {
1046+
// Returning an UnrecoverableError should always panic immediately
1047+
connect_blocks(&nodes[0], 1);
1048+
}).is_err());
1049+
assert!(std::panic::catch_unwind(|| {
1050+
// ...and also poison our locks causing later use to panic as well
1051+
core::mem::drop(nodes);
1052+
}).is_err());
1053+
}
9761054
}

lightning/src/chain/mod.rs

+28-7
Original file line numberDiff line numberDiff line change
@@ -177,16 +177,23 @@ pub trait Confirm {
177177

178178
/// An enum representing the status of a channel monitor update persistence.
179179
///
180-
/// Note that there is no error variant - any failure to persist a [`ChannelMonitor`] should be
181-
/// retried indefinitely, the node shut down (as if we cannot update stored data we can't do much
182-
/// of anything useful).
180+
/// These are generally used as the return value for an implementation of [`Persist`] which is used
181+
/// as the storage layer for a [`ChainMonitor`]. See the docs on [`Persist`] for a high-level
182+
/// explanation of how to handle different cases.
183+
///
184+
/// While `UnrecoverableError` is provided as a failure variant, it is not truly "handled" on the
185+
/// calling side, and generally results in an immediate panic. For those who prefer to avoid
186+
/// panics, `InProgress` can be used and you can retry the update operation in the background or
187+
/// shut down cleanly.
183188
///
184189
/// Note that channels should generally *not* be force-closed after a persistence failure.
185190
/// Force-closing with the latest [`ChannelMonitorUpdate`] applied may result in a transaction
186191
/// being broadcast which can only be spent by the latest [`ChannelMonitor`]! Thus, if the
187192
/// latest [`ChannelMonitor`] is not durably persisted anywhere and exists only in memory, naively
188193
/// calling [`ChannelManager::force_close_broadcasting_latest_txn`] *may result in loss of funds*!
189194
///
195+
/// [`Persist`]: chainmonitor::Persist
196+
/// [`ChainMonitor`]: chainmonitor::ChainMonitor
190197
/// [`ChannelManager::force_close_broadcasting_latest_txn`]: crate::ln::channelmanager::ChannelManager::force_close_broadcasting_latest_txn
191198
#[derive(Clone, Copy, Debug, PartialEq, Eq)]
192199
pub enum ChannelMonitorUpdateStatus {
@@ -212,8 +219,8 @@ pub enum ChannelMonitorUpdateStatus {
212219
/// until a [`MonitorEvent::Completed`] is provided, even if you return no error on a later
213220
/// monitor update for the same channel.
214221
///
215-
/// For deployments where a copy of ChannelMonitors and other local state are backed up in a
216-
/// remote location (with local copies persisted immediately), it is anticipated that all
222+
/// For deployments where a copy of [`ChannelMonitor`]s and other local state are backed up in
223+
/// a remote location (with local copies persisted immediately), it is anticipated that all
217224
/// updates will return [`InProgress`] until the remote copies could be updated.
218225
///
219226
/// Note that while fully asynchronous persistence of [`ChannelMonitor`] data is generally
@@ -222,6 +229,18 @@ pub enum ChannelMonitorUpdateStatus {
222229
///
223230
/// [`InProgress`]: ChannelMonitorUpdateStatus::InProgress
224231
InProgress,
232+
/// Indicates that an update has failed and will not complete at any point in the future.
233+
///
234+
/// Currently returning this variant will cause LDK to immediately panic to encourage immediate
235+
/// shutdown. In the future this may be updated to disconnect peers and refuse to continue
236+
/// normal operation without a panic.
237+
///
238+
/// Applications which wish to perform an orderly shutdown after failure should consider
239+
/// returning [`InProgress`] instead and simply shut down without ever marking the update
240+
/// complete.
241+
///
242+
/// [`InProgress`]: ChannelMonitorUpdateStatus::InProgress
243+
UnrecoverableError,
225244
}
226245

227246
/// The `Watch` trait defines behavior for watching on-chain activity pertaining to channels as
@@ -261,8 +280,10 @@ pub trait Watch<ChannelSigner: WriteableEcdsaChannelSigner> {
261280
/// on-chain or the [`ChannelMonitor`] having decided to do so and broadcasted a transaction),
262281
/// and the [`ChannelManager`] state will be updated once it sees the funding spend on-chain.
263282
///
264-
/// If persistence fails, this should return [`ChannelMonitorUpdateStatus::InProgress`] and
265-
/// the node should shut down immediately.
283+
/// In general, persistence failures should be retried after returning
284+
/// [`ChannelMonitorUpdateStatus::InProgress`] and eventually complete. If a failure truly
285+
/// cannot be retried, the node should shut down immediately after returning
286+
/// [`ChannelMonitorUpdateStatus::UnrecoverableError`], see its documentation for more info.
266287
///
267288
/// [`ChannelManager`]: crate::ln::channelmanager::ChannelManager
268289
fn update_channel(&self, funding_txo: OutPoint, update: &ChannelMonitorUpdate) -> ChannelMonitorUpdateStatus;

lightning/src/ln/channelmanager.rs

+5
Original file line numberDiff line numberDiff line change
@@ -2045,6 +2045,11 @@ macro_rules! handle_new_monitor_update {
20452045
($self: ident, $update_res: expr, $chan: expr, _internal, $completed: expr) => { {
20462046
debug_assert!($self.background_events_processed_since_startup.load(Ordering::Acquire));
20472047
match $update_res {
2048+
ChannelMonitorUpdateStatus::UnrecoverableError => {
2049+
let err_str = "ChannelMonitor[Update] persistence failed unrecoverably. This indicates we cannot continue normal operation and must shut down.";
2050+
log_error!($self.logger, "{}", err_str);
2051+
panic!("{}", err_str);
2052+
},
20482053
ChannelMonitorUpdateStatus::InProgress => {
20492054
log_debug!($self.logger, "ChannelMonitor update for {} in flight, holding messages until the update completes.",
20502055
&$chan.context.channel_id());

lightning/src/ln/functional_test_utils.rs

+2
Original file line numberDiff line numberDiff line change
@@ -422,6 +422,8 @@ pub struct Node<'chan_man, 'node_cfg: 'chan_man, 'chan_mon_cfg: 'node_cfg> {
422422
&'chan_mon_cfg test_utils::TestLogger,
423423
>,
424424
}
425+
impl<'a, 'b, 'c> core::panic::UnwindSafe for Node<'a, 'b, 'c> {}
426+
impl<'a, 'b, 'c> core::panic::RefUnwindSafe for Node<'a, 'b, 'c> {}
425427
impl<'a, 'b, 'c> Node<'a, 'b, 'c> {
426428
pub fn best_block_hash(&self) -> BlockHash {
427429
self.blocks.lock().unwrap().last().unwrap().0.block_hash()

lightning/src/util/persist.rs

+4-4
Original file line numberDiff line numberDiff line change
@@ -174,8 +174,8 @@ impl<'a, A: KVStore, M: Deref, T: Deref, ES: Deref, NS: Deref, SP: Deref, F: Der
174174
impl<ChannelSigner: WriteableEcdsaChannelSigner, K: KVStore> Persist<ChannelSigner> for K {
175175
// TODO: We really need a way for the persister to inform the user that its time to crash/shut
176176
// down once these start returning failure.
177-
// An InProgress result implies we should probably just shut down the node since we're not
178-
// retrying persistence!
177+
// Then we should return InProgress rather than UnrecoverableError, implying we should probably
178+
// just shut down the node since we're not retrying persistence!
179179

180180
fn persist_new_channel(&self, funding_txo: OutPoint, monitor: &ChannelMonitor<ChannelSigner>, _update_id: MonitorUpdateId) -> chain::ChannelMonitorUpdateStatus {
181181
let key = format!("{}_{}", funding_txo.txid.to_hex(), funding_txo.index);
@@ -185,7 +185,7 @@ impl<ChannelSigner: WriteableEcdsaChannelSigner, K: KVStore> Persist<ChannelSign
185185
&key, &monitor.encode())
186186
{
187187
Ok(()) => chain::ChannelMonitorUpdateStatus::Completed,
188-
Err(_) => chain::ChannelMonitorUpdateStatus::InProgress
188+
Err(_) => chain::ChannelMonitorUpdateStatus::UnrecoverableError
189189
}
190190
}
191191

@@ -197,7 +197,7 @@ impl<ChannelSigner: WriteableEcdsaChannelSigner, K: KVStore> Persist<ChannelSign
197197
&key, &monitor.encode())
198198
{
199199
Ok(()) => chain::ChannelMonitorUpdateStatus::Completed,
200-
Err(_) => chain::ChannelMonitorUpdateStatus::InProgress
200+
Err(_) => chain::ChannelMonitorUpdateStatus::UnrecoverableError
201201
}
202202
}
203203
}

0 commit comments

Comments
 (0)