Skip to content

Conversation

tnull
Copy link
Contributor

@tnull tnull commented Sep 5, 2025

This is the second PR in a series of PRs adding persistence to lightning-liquidity (see #4058). As this is already >1000LoC, I now decided to put this up as an intermediary step instead of adding everything in one go.

In this PR we add the serialization logic for for the LSPS2 and LSPS5 service handlers as well as for the event queue. We also have LiquidityManager take a KVStore towards which it persists the respetive peer states keyed by the counterparty's node id. LiquidityManager::new now also deserializes any previously-persisted state from that given KVStore. Note that so far we don't actually persist anything, as wiring up BackgroundProcessor to drive persistence will be part of the next PR (which will also make further optimizations, such as only persisting when needed, and persisting some imporant things in-line).

This also adds a bunch of boilerplate to account for both KVStore and KVStoreSync variants, following the approach we previously took with OutputSweeper etc.

cc @martinsaposnic

@tnull tnull requested a review from TheBlueMatt September 5, 2025 14:31
@ldk-reviews-bot
Copy link

ldk-reviews-bot commented Sep 5, 2025

👋 Thanks for assigning @TheBlueMatt as a reviewer!
I'll wait for their review and will help manage the review process.
Once they submit their review, I'll check if a second reviewer would be helpful.

@tnull tnull force-pushed the 2025-01-liquidity-persistence branch 2 times, most recently from 124211d to 26f3ce3 Compare September 5, 2025 14:41
@tnull tnull self-assigned this Sep 5, 2025
@tnull tnull added the weekly goal Someone wants to land this this week label Sep 5, 2025
@tnull tnull added this to the 0.2 milestone Sep 5, 2025
@tnull tnull moved this to Goal: Merge in Weekly Goals Sep 5, 2025
@tnull tnull force-pushed the 2025-01-liquidity-persistence branch 4 times, most recently from a98dff6 to d630c4e Compare September 5, 2025 14:58
Copy link

codecov bot commented Sep 5, 2025

Codecov Report

❌ Patch coverage is 47.72313% with 287 lines in your changes missing coverage. Please review.
✅ Project coverage is 88.23%. Comparing base (867f084) to head (0367a27).

Files with missing lines Patch % Lines
lightning-liquidity/src/manager.rs 66.96% 71 Missing and 3 partials ⚠️
lightning-liquidity/src/persist.rs 47.12% 41 Missing and 5 partials ⚠️
lightning-liquidity/src/lsps2/service.rs 21.56% 38 Missing and 2 partials ⚠️
lightning-liquidity/src/lsps5/service.rs 18.42% 31 Missing ⚠️
lightning-liquidity/src/events/event_queue.rs 17.14% 29 Missing ⚠️
lightning-liquidity/src/events/mod.rs 0.00% 28 Missing ⚠️
lightning-liquidity/src/lsps5/msgs.rs 0.00% 16 Missing ⚠️
lightning-liquidity/src/lsps0/ser.rs 53.57% 10 Missing and 3 partials ⚠️
lightning-liquidity/src/lsps5/url_utils.rs 33.33% 8 Missing ⚠️
lightning-liquidity/src/lsps1/client.rs 0.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4059      +/-   ##
==========================================
- Coverage   88.39%   88.23%   -0.16%     
==========================================
  Files         177      179       +2     
  Lines      131314   131833     +519     
  Branches   131314   131833     +519     
==========================================
+ Hits       116069   116321     +252     
- Misses      12596    12852     +256     
- Partials     2649     2660      +11     
Flag Coverage Δ
fuzzing 22.06% <26.89%> (+0.05%) ⬆️
tests 88.06% <46.81%> (-0.17%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@tnull tnull force-pushed the 2025-01-liquidity-persistence branch from d630c4e to 70118e7 Compare September 5, 2025 15:15
@tnull tnull force-pushed the 2025-01-liquidity-persistence branch from 70118e7 to dd43edc Compare September 5, 2025 15:28
@martinsaposnic
Copy link
Contributor

this all LGTM.

I have a small concern: maybe I’m being a little paranoid, but read_lsps2_service_peer_states and read_lsps5_service_peer_states pull every entry from the KVStore into memory with no limit. That could lead to unbounded state, exhausting memory and crash. Maybe we can add a limit on how many entries we load into memory to protect against this dos?

not sure how realistic this is though. maybe an attacker could have access to or share the same storage with the victim, and they could dump effectively infinite data onto disk. in this scenario, probably the victim would be vulnerable to other attacks too, but still..

@tnull
Copy link
Contributor Author

tnull commented Sep 5, 2025

I have a small concern: maybe I’m being a little paranoid, but read_lsps2_service_peer_states and read_lsps5_service_peer_states pull every entry from the KVStore into memory with no limit. That could lead to unbounded state, exhausting memory and crash. Maybe we can add a limit on how many entries we load into memory to protect against this dos?

Reading state from disk (currently) happens on startup only, so crashing wouldn't be the worst thing, we would simply fail to start up properly. Some even argue that we need to panic if we hit any IO errors at this point to escalate to an operator. We could add some safeguard/upper bound, but I'm honestly not sure what it would protect against.

not sure how realistic this is though. maybe an attacker could have access to or share the same storage with the victim, and they could dump effectively infinite data onto disk. in this scenario, probably the victim would be vulnerable to other attacks too, but still..

Heh, well, if we assume the attacker has write access to our KVStore, we're very very screwed either way. Crashing could be the favorable outcome then, actually.

@ldk-reviews-bot
Copy link

🔔 1st Reminder

Hey @TheBlueMatt! This PR has been waiting for your review.
Please take a look when you have a chance. If you're unable to review, please let us know so we can find another reviewer.

@tnull tnull force-pushed the 2025-01-liquidity-persistence branch from dd43edc to f73146b Compare September 8, 2025 07:37
@@ -45,6 +46,10 @@ pub struct LSPS2GetInfoRequest {
pub token: Option<String>,
}

impl_writeable_tlv_based!(LSPS2GetInfoRequest, {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really want to have two ways to serialize all these types? Wouldn't it make more sense to just use the serde serialization we already have and wrap that so that it can't all be misused?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think I'd be in favor of using TLV serialization for our own persistence.

Note that the compat guarantees of LSPS0/the JSON/serde format might not exactly match what we require in LDK, and our Rust representation might also diverge from the pure JSON impl. On top of that JSON is of course much less efficient.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, is there some easy way to avoid exposing that in the public API, then? Maybe a wrapper struct oe extension trait for serialization somehow? Seems like kinda a footgun for users, I think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, is there some easy way to avoid exposing that in the public API, then? Maybe a wrapper struct oe extension trait for serialization somehow? Seems like kinda a footgun for users, I think?

Not quite sure I understand the footgun? You mean because these types then have Writeable as well as Serialize implementations on them and users might wrongly pick Writeable when they use the types independently from/outside of lightning-liquidity?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, for example. Someone who uses serde presumably has some wrapper that serde-writes Writeable structs and suddenly their code could read/compile totally fine and be reading the wrong kind of thing. If they have some less-used codepaths (eg writing Events before they process them and then removing them again after) they might not find immediately.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, for example. Someone who uses serde presumably has some wrapper that serde-writes Writeable structs and suddenly their code could read/compile totally fine and be reading the wrong kind of thing.

I'm confused - Writeable is an LDK concept not connected to serde? Do you mean Serialize? But that also has completely separate API? So how would they trip up? You mean they'd confuse Writeable and Serialize?

) -> Pin<Box<dyn Future<Output = Result<(), lightning::io::Error>> + Send>> {
let outer_state_lock = self.per_peer_state.read().unwrap();
let mut futures = Vec::new();
for (counterparty_node_id, peer_state) in outer_state_lock.iter() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh? Why would we ever want to do a single huge persist pass and write every peer's state at once? Shouldn't we be doing this iteratively? Same applies in the LSPS2 service.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, only persisting what's needed/changed will be part of the next PR as it ties into how we wake the BP to drive persistence (cf. "Avoid re-persisting peer states if no changes happened (needs_persist flag everywhere)" bullet over at #4058 (comment)).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused why we're adding this method then? If its going to be removed in the next PR in the series we should just not add it in the first place.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it's not gonna be removed, but extended: PeerState (here as well as in LSPS2) will gain a dirty/needs_persist flag and we'd simply skip persisting any entries that haven't been changed since the last persistence round.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems like a weird design if we need to persist something immediately while its being operated on - we have the node in question why walk a whole peer list? Can you put up the followup code so we can see how its going to be used? Given this PR is mostly boilerplate I honestly wouldn't mind it being a bit bigger, as long as the code isn't too crazy.

Copy link
Contributor Author

@tnull tnull Sep 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems like a weird design if we need to persist something immediately while its being operated on - we have the node in question why walk a whole peer list?

Yes, this is why persist_peer_state is a separate method - for inline persistence where we already hold the lock to the peer state we'd just call that. For the general/eventual persistence the background processor task calls LiquidityManager::persist which calls through to the respective LSPS*ServiceHandler::persist methods which then only persists the entries marked dirty since the last persistence round.

Can you put up the followup code so we can see how its going to be used? Given this PR is mostly boilerplate I honestly wouldn't mind it being a bit bigger, as long as the code isn't too crazy.

Sure will do as soon as it's ready an in a coherent state, although I had hoped to land this PR this week.

@tnull tnull force-pushed the 2025-01-liquidity-persistence branch from f73146b to 2971982 Compare September 9, 2025 07:35
@tnull
Copy link
Contributor Author

tnull commented Sep 9, 2025

Rebased to address minor conflict.

@tnull tnull requested a review from TheBlueMatt September 10, 2025 07:22
Copy link
Collaborator

@TheBlueMatt TheBlueMatt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Responded to the outstanding comments, not quite sure I fully get all the rationale here.

@tnull tnull requested a review from TheBlueMatt September 10, 2025 12:41
@TheBlueMatt TheBlueMatt removed their request for review September 10, 2025 17:29
We add `KVStore` to `LiquidityManager`, which will be used in the next
commits. We also add a `LiquidityManagerSync` wrapper that wraps a the
`LiquidityManager` interface which will soon become async due to usage
of the async `KVStore`.
@tnull tnull force-pushed the 2025-01-liquidity-persistence branch from 2971982 to 93c234f Compare September 11, 2025 11:36
Comment on lines +602 to +614
let mut peer_by_intercept_scid = new_hash_map();
let mut peer_by_channel_id = new_hash_map();
for (node_id, peer_state) in peer_states.iter() {
for (intercept_scid, _) in peer_state.outbound_channels_by_intercept_scid.iter() {
let res = peer_by_intercept_scid.insert(*intercept_scid, *node_id);
debug_assert!(res.is_none(), "Intercept SCIDs should never collide");
}

for (channel_id, _) in peer_state.intercept_scid_by_channel_id.iter() {
let res = peer_by_channel_id.insert(*channel_id, *node_id);
debug_assert!(res.is_none(), "Channel IDs should never collide");
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code uses debug_assert! to check for collisions in intercept_scid and channel_id mappings during state reconstruction. Since these assertions are removed in release builds, any actual collisions in persisted data (whether from corruption or malicious input) would result in silent overwrites of map entries. This could lead to incorrect routing behavior or state corruption.

Consider replacing these debug assertions with proper error handling that would be active in all build configurations. For example:

if peer_by_intercept_scid.insert(*intercept_scid, *node_id).is_some() {
    return Err(io::Error::new(
        io::ErrorKind::InvalidData,
        "Corrupted state: Intercept SCID collision detected"
    ));
}

This would ensure the integrity of the reconstructed state even in production environments.

Suggested change
let mut peer_by_intercept_scid = new_hash_map();
let mut peer_by_channel_id = new_hash_map();
for (node_id, peer_state) in peer_states.iter() {
for (intercept_scid, _) in peer_state.outbound_channels_by_intercept_scid.iter() {
let res = peer_by_intercept_scid.insert(*intercept_scid, *node_id);
debug_assert!(res.is_none(), "Intercept SCIDs should never collide");
}
for (channel_id, _) in peer_state.intercept_scid_by_channel_id.iter() {
let res = peer_by_channel_id.insert(*channel_id, *node_id);
debug_assert!(res.is_none(), "Channel IDs should never collide");
}
}
let mut peer_by_intercept_scid = new_hash_map();
let mut peer_by_channel_id = new_hash_map();
for (node_id, peer_state) in peer_states.iter() {
for (intercept_scid, _) in peer_state.outbound_channels_by_intercept_scid.iter() {
if peer_by_intercept_scid.insert(*intercept_scid, *node_id).is_some() {
return Err(io::Error::new(
io::ErrorKind::InvalidData,
"Corrupted state: Intercept SCID collision detected"
));
}
}
for (channel_id, _) in peer_state.intercept_scid_by_channel_id.iter() {
if peer_by_channel_id.insert(*channel_id, *node_id).is_some() {
return Err(io::Error::new(
io::ErrorKind::InvalidData,
"Corrupted state: Channel ID collision detected"
));
}
}
}

Spotted by Diamond

Fix in Graphite


Is this helpful? React 👍 or 👎 to let us know.

@tnull tnull force-pushed the 2025-01-liquidity-persistence branch from 93c234f to 716c06a Compare September 11, 2025 11:42
We add simple `persist` call to `LSPS2ServiceHandler` that sequentially
persist all the peer states under a key that encodes their node id.
We add simple `persist` call to `LSPS5ServiceHandler` that sequentially
persist all the peer states under a key that encodes their node id.
We add simple `persist` call to `EventQueue` that persists it under a
`event_queue` key.
.. this is likely only temporary necessary as we can drop our own
`dummy_waker` implementation once we bump MSRV.
We read any previously-persisted state upon construction of
`LiquidityManager`.
We read any previously-persisted state upon construction of
`LiquidityManager`.
We read any previously-persisted state upon construction of
`LiquidityManager`.
@tnull tnull force-pushed the 2025-01-liquidity-persistence branch from 716c06a to 0367a27 Compare September 11, 2025 11:50
@tnull tnull requested a review from TheBlueMatt September 11, 2025 13:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
weekly goal Someone wants to land this this week
Projects
Status: Goal: Merge
Development

Successfully merging this pull request may close these issues.

4 participants