Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Attributable failures #2256

Merged
merged 6 commits into from
Apr 3, 2025
Merged

Conversation

joostjager
Copy link
Contributor

@joostjager joostjager commented May 2, 2023

Implements lightning/bolts#1044

This PR implements the update_fail_htlc optional attribution_data field as tlv type 101 rather than type 1 until interop with other node software is achieved.

Until that time, the feature bit doesn't need to be set either.

@joostjager
Copy link
Contributor Author

Single hop interop testing passes.

Multi hop interop testing seems to be blocked by gossip propagation issues between lnd and rust-lightning.

@TheBlueMatt
Copy link
Collaborator

Nice! That's not too bad, thanks for working on this. I'll dig into the crypto in a day or two. Have a number of comments on the code itself and structure, but I assume its not really worth digging into until we support both old and new errors? I'm happy to give more code-level feedback now if you prefer.

Multi hop interop testing seems to be blocked by gossip propagation issues between lnd and rust-lightning.

Oh? Is this some known issue on the lnd end? I'm not aware of any gossip errors.

@joostjager
Copy link
Contributor Author

Have a number of comments on the code itself and structure, but I assume its not really worth digging into until we support both old and new errors? I'm happy to give more code-level feedback now if you prefer.

This doesn't surprise me. This is my first venture into Rust land. It is a bit of a switch from golang, and I need to get used to how things are done. The language is not bad though, I can see why people fall in love with it! But yes, can hold off on code-level comments for now.

Oh? Is this some known issue on the lnd end? I'm not aware of any gossip errors.

Not sure if it is a known issue. I've had nagging gossip issues before when I tried to do a pathfinding benchmark. For this PR, I did apply a patch to rust-lightning somewhere to force-send node announcement always and then it worked better. Will try to come up with a reasonable repro scenario.

@TheBlueMatt
Copy link
Collaborator

What are you using to do the testing? I assume ldk-sample of some form? If you change the timer at https://github.com/lightningdevkit/ldk-sample/blob/main/src/main.rs#L791 it will rebroadcast a fresh node announcement more aggressively.

@joostjager
Copy link
Contributor Author

Yes, ldk-sample. It was doing the timer alright, but somehow it got filtered out in the timer handler.

@TheBlueMatt
Copy link
Collaborator

Would be happy to take a look at logs. At the TRACE level we should basically be writing everything that is going out or coming in on the wire.

@joostjager
Copy link
Contributor Author

Yes, so I did see that the announcement wasn't going out. Will continue tomorrow and get back with more data.

@TheBlueMatt
Copy link
Collaborator

Errr, right, so to avoid dumping tons of gossip crap that message is only really logged at a high level (Broadcasting NodeAnnouncement after passing it to our own RoutingMessageHandler.), with the remaining logs on a per-peer level being at the GOSSIP level (disabled by default cause its verbose) - Sending message to all peers except {:?} or the announced node: {:?} and Skipping broadcast message to {:?} as its outbound buffer is full

@joostjager
Copy link
Contributor Author

Looks like timer is at 60s. I definitely waited much longer than that, so doesn't seem to be the problem. Enabled gossip logging and saved log files. Continuing in #2259

@joostjager
Copy link
Contributor Author

With gossip fixed with the hint in #2259, I was able to run through a few of the multi-hop inter-op scenarios:
LND -> LDK -> LDK
LDK -> LDK -> LND
LND -> LDK (intermediate failure)

Obviously there are more, but I think this is a good enough sanity check for now.

As mentioned above, attention should go to the crypto part of this feature first.

@joostjager joostjager changed the title Convert to attributable errors Attributable failures Mar 11, 2025
@joostjager joostjager marked this pull request as ready for review March 11, 2025 15:41
@joostjager
Copy link
Contributor Author

Pushed 2025 version of the code to this PR.

Copy link
Collaborator

@TheBlueMatt TheBlueMatt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One quick comment from a glance

@@ -597,12 +597,18 @@ impl InvoiceBuilder<tb::False, tb::False, tb::False, tb::False, tb::False, tb::F
/// Construct new, empty `InvoiceBuilder`. All necessary fields have to be filled first before
/// `InvoiceBuilder::build(self)` becomes available.
pub fn new(currency: Currency) -> Self {
let mut features = Bolt11InvoiceFeatures::empty();
features.set_attributable_failures_optional();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally we set features elsewhere, afaiu. Setting it here forces everyone using this to include the set features, but its possible someone builds an invoice using this logic that isn't terminating at a ChannelManager.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a different place in LDK where this feature can be set, or is it left fully up to the caller to not forget?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The invoice builder mostly has setters that set individual flags called by invoice_utils/channelmanager::create_bolt11_invoice. eg basic_mpp.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will leave it out of this PR except for the definition of the bits. Until interop is completed we don't need the feature bits.

@joostjager joostjager marked this pull request as draft March 11, 2025 16:37
@joostjager joostjager force-pushed the attr-errs branch 4 times, most recently from 3b957d4 to 369fe76 Compare March 12, 2025 13:25
@joostjager joostjager requested a review from arik-so March 12, 2025 13:36
@joostjager
Copy link
Contributor Author

Ready for a first pass

Copy link
Collaborator

@TheBlueMatt TheBlueMatt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that the last commit's commit message is wrong. It says that "Signal to senders that the node will return attributable failures", but we don't actually do any signaling, we just define the feature bit.

@@ -10664,6 +10688,7 @@ impl<'a, 'b, 'c, ES: Deref, SP: Deref> ReadableArgs<(&'a ES, &'b SP, &'c Channel
},
skimmed_fee_msat: None,
blinding_point: None,
timestamp: None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to store this, or would we simply use 0 for the timestamp after a restart?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could work with 0 as the magic value for not available, but I think it's better to be explicit with an option?

Comment on lines 925 to 928
let mut writer = VecWriter(Vec::new());
failuremsg.write(&mut writer).unwrap();
pad.write(&mut writer).unwrap();
let encoded_msg = writer.0;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ha, this is such a mess... I think this does all of the ser logic much cleaner.

let mut failuremsg = VecWriter(Vec::with_capacity(1024));
failuremsg.0.extend_from_slice(&[0; 32][..]); // leave room for the hmac
(failure_data.len() as u16 + 2).write(&mut failuremsg).expect("Cannot fail to write to a vec");
failure_type.write(&mut failuremsg).expect("Cannot fail to write to a vec");
failuremsg.0.extend_from_slice(&failure_data[..]);
if failuremsg.0.len() < 1024 { failuremsg.0.resize(1024, 0); }

let data = failuremsg.0;
let hmac = Hmac(&data[32..]);
data[..32].copy_from_slice(&hmac);

Copy link
Contributor Author

@joostjager joostjager Mar 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code isn't correct I think, because it doesn't write the pad length. Not sure if it is all that much more readable also. But I did take inspiration from it and removed one of the writers that I originally had.

Copy link
Collaborator

@TheBlueMatt TheBlueMatt Mar 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the reasons to do it this way is it avoids three entire Vecs, including potentially quite a few allocations in the VecWriter. I'd strongly prefer to do it all in one vec if possible (in general, C/Rust applications suffer from memory fragmentation when they allocate a lot, leading to ~2x the total memory usage that they actually use. Without a GC no defragmentation can occur so we end up stuck with it...hence we try to be pretty careful about unnecessary allocations)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a refactor commit that does it all in a single pre-allocated vec.

@@ -1049,6 +1082,73 @@ where
crypt_failure_packet(shared_secret.as_ref(), &mut encrypted_packet);

let um = gen_um_from_shared_secret(shared_secret.as_ref());

// Check attr error hmacs if present.
if let Some(ref attribution_data) = encrypted_packet.attribution_data {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a good opportunity to add a fuzzer directly on process_onion_failure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline, leaving it outside the scope of this PR. This new data block indeed seems like a good candidate for fuzzing.

@@ -8671,6 +8680,7 @@ impl<SP: Deref> FundedChannel<SP> where
return Ok(None);
}

let timestamp = duration_since_epoch();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment here describing why we're picking timestamp when adding the HTLC here? Presumably something about how we'd prefer to blame the downstream counterparty as much as possible, but that doing so thoroughly requires passing timestamps around from the inbound channel so we don't bother.

Copy link
Contributor Author

@joostjager joostjager Mar 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added comment describing that it doesn't matter all that much what measuring points we pick.

I'm not convinced that we should blame the downstream as much as possible. Because if our direct upstream peer is the sender, we're unnecessarily blaming ourselves.

@@ -323,6 +324,7 @@ struct OutboundHTLCOutput {
source: HTLCSource,
blinding_point: Option<PublicKey>,
skimmed_fee_msat: Option<u64>,
timestamp: Option<Duration>,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Should we call this like "queued_timestamp" or so?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've made it send_timestamp, in line with the setting of it in send_htlc.

@joostjager
Copy link
Contributor Author

Rebased onto main that now includes the fuzzer for this.

@TheBlueMatt
Copy link
Collaborator

CI is quite sad:


thread 'ln::blinded_payment_tests::test_trampoline_single_hop_receive' panicked at lightning/src/ln/onion_utils.rs:1221:28:
attempt to subtract with overflow
stack backtrace:
   0: rust_begin_unwind
             at /rustc/4eb161250e340c8f48f66e2b929ef4a5bed7c181/library/std/src/panicking.rs:692:5
   1: core::panicking::panic_fmt
             at /rustc/4eb161250e340c8f48f66e2b929ef4a5bed7c181/library/core/src/panicking.rs:75:14
   2: core::panicking::panic_const::panic_const_sub_overflow
             at /rustc/4eb161250e340c8f48f66e2b929ef4a5bed7c181/library/core/src/panicking.rs:178:21
   3: lightning::ln::onion_utils::process_onion_failure_inner
             at ./src/ln/onion_utils.rs:1221:19
   4: lightning::ln::onion_utils::process_onion_failure
             at ./src/ln/onion_utils.rs:1018:3
   5: lightning::ln::onion_utils::HTLCFailReason::decode_onion_failure
             at ./src/ln/onion_utils.rs:1650:5
   6: lightning::ln::outbound_payment::OutboundPayments::fail_htlc
             at ./src/ln/outbound_payment.rs:2167:7
   7: lightning::ln::channelmanager::ChannelManager<M,T,ES,NS,SP,F,R,MR,L>::fail_htlc_backwards_internal_without_forward_event
             at ./src/ln/channelmanager.rs:7020:26
   8: lightning::ln::channelmanager::ChannelManager<M,T,ES,NS,SP,F,R,MR,L>::fail_htlc_backwards_internal
             at ./src/ln/channelmanager.rs:6993:28
   9: lightning::ln::channelmanager::ChannelManager<M,T,ES,NS,SP,F,R,MR,L>::internal_revoke_and_ack
             at ./src/ln/channelmanager.rs:3301:4
  10: <lightning::ln::channelmanager::ChannelManager<M,T,ES,NS,SP,F,R,MR,L> as lightning::ln::msgs::ChannelMessageHandler>::handle_revoke_and_ack
             at ./src/ln/channelmanager.rs:12158:31
  11: lightning::ln::functional_test_utils::commitment_signed_dance_through_cp_raa
             at ./src/ln/functional_test_utils.rs:2075:2
  12: lightning::ln::functional_test_utils::do_commitment_signed_dance
             at ./src/ln/functional_test_utils.rs:2056:11
  13: lightning::ln::blinded_payment_tests::do_test_trampoline_single_hop_receive
             at ./src/ln/blinded_payment_tests.rs:2159:4
  14: lightning::ln::blinded_payment_tests::test_trampoline_single_hop_receive::{{closure}}
             at ./src/ln/blinded_payment_tests.rs:2170:40
  15: core::ops::function::FnOnce::call_once
             at /rustc/4eb161250e340c8f48f66e2b929ef4a5bed7c181/library/core/src/ops/function.rs:250:5
  16: core::ops::function::FnOnce::call_once
             at /rustc/4eb161250e340c8f48f66e2b929ef4a5bed7c181/library/core/src/ops/function.rs:250:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

@joostjager
Copy link
Contributor Author

CI should be happy again now.

Copy link
Contributor

@carlaKC carlaKC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, hitting the limits of my context in the codebase so my remaining comments are nitty/non-blocking.

Only thing left for me to review thoroughly are the tests in the last commit.

) -> OnionErrorPacket {
assert_eq!(shared_secret.len(), 32);
assert!(failure_data.len() <= 256 - 2);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than remove this completely, assert that it's less than u16::MAX -2? In theory (failure_len as u16) below could overflow.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Brought it back for the actual max failure_data length that brings the total message size to 65535, the spec max. Added a test for that too.

@@ -2172,6 +2268,156 @@ impl_writeable!(AttributionData, {
hmacs
});

impl AttributionData {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really nice comments here 🤩

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! This is what we were used to 😉


/// Shifts hold times and HMACs to the left, taking into account HMAC pruning. This is the inverse operation of what
/// hops do when back-propagating the failure.
fn shift_left(&mut self) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, this is fine, I guess, but it seems pretty redundant to shift the HMACs left every time we verify if we already pass position to verify and it already looks at the HMACs its HMACing based on the position, we might as well just...look at the HMACs we're HMACing based on the position?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had this very same thought, but not shifting also means that chacha needs to be performed in a complicated way, skipping the slots that would otherwise have been pruned. It seems simpler to just do the inverse operation, but open to suggestions.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, yea, that seems not worth it, nevermind.

@TheBlueMatt
Copy link
Collaborator

Otherwise largely LGTM

joostjager and others added 5 commits April 1, 2025 10:57
Easier to verifier that the correct key is used.
This commit does not yet introduce attribution data, but just adds the
field and required serialization logic to the relevant types.

Co-authored-by: Matt Corallo <[email protected]>
Record a timestamp when the htlc is sent out and record the hold
duration alongside the failure reason.
Improve efficiency by not utilizing multiple vecs.
Prepares for new test vectors that pad to 1024 bytes.
TheBlueMatt
TheBlueMatt previously approved these changes Apr 2, 2025
Copy link
Collaborator

@TheBlueMatt TheBlueMatt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

// just shift sender-applied penalties between our incoming and outgoing side. So we choose measuring points
// that are simple to implement, and we do it on the outgoing side because then the failure message that encodes
// the hold time still needs to be built in channel manager.
let send_timestamp = duration_since_epoch();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'd be nice to set this when we push into the holding cell as well, since in practice something getting stuck in the holding cell may be the fault of our peer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When it gets out of the holding cell, I think it will again get to this point. The reported hold time will be shorter because the timestamp is later, but I don't think it is a problem because of the reason mentioned in the comment. Definitely good enough for this stage I think.

Copy link
Contributor

@carlaKC carlaKC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice tests, LGTM!

Copy link
Collaborator

@TheBlueMatt TheBlueMatt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CI is sad:

  INFO [lightning::ln::onion_utils:1417] Onion Error[from 02edabbd16b41c8371b92ef2f04c1185b4f03b6dcd52ba9b78d9d7c89c8f221145: incorrect_or_unknown_payment_details(0x400f)] The final node indicated the payment hash is unknown or amount is incorrect
thread 'ln::onion_utils::tests::test_attributable_failure_packet_onion_mutations' panicked at 'assertion failed: decrypted_failure.short_channel_id == Some(mutating_node as u64)', lightning/src/ln/onion_utils.rs:2769:17
stack backtrace:
   0: rust_begin_unwind
             at /rustc/4b91a6ea7258a947e59c6522cd5898e7c0a6a88f/library/std/src/panicking.rs:584:5
   1: core::panicking::panic_fmt
             at /rustc/4b91a6ea7258a947e59c6522cd5898e7c0a6a88f/library/core/src/panicking.rs:142:14
   2: core::panicking::panic
             at /rustc/4b91a6ea7258a947e59c6522cd5898e7c0a6a88f/library/core/src/panicking.rs:48:5
   3: lightning::ln::onion_utils::tests::test_attributable_failure_packet_onion_mutations
             at ./src/ln/onion_utils.rs:2769:5
   4: lightning::ln::onion_utils::tests::test_attributable_failure_packet_onion_mutations::{{closure}}
             at ./src/ln/onion_utils.rs:2753:2
   5: core::ops::function::FnOnce::call_once
             at /rustc/4b91a6ea7258a947e59c6522cd5898e7c0a6a88f/library/core/src/ops/function.rs:248:5
   6: core::ops::function::FnOnce::call_once
             at /rustc/4b91a6ea7258a947e59c6522cd5898e7c0a6a88f/library/core/src/ops/function.rs:248:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.


failures:
    ln::onion_utils::tests::test_attributable_failure_packet_onion_mutations

This commit extends the generation, forwarding and interpretation of
htlc fail messages with attribution data. This allows senders to
identify the failing node even when this node does not want to be
identified and is generating a failure message without a valid hmac.

For more information, see the bolt spec.
@joostjager
Copy link
Contributor Author

joostjager commented Apr 3, 2025

Oops, had to be 0x4000 | 15 for the failure code. Fixed.

@joostjager joostjager merged commit f9b8d63 into lightningdevkit:main Apr 3, 2025
25 of 27 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
weekly goal Someone wants to land this this week
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants