-
Notifications
You must be signed in to change notification settings - Fork 399
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Attributable failures #2256
Attributable failures #2256
Conversation
Single hop interop testing passes. Multi hop interop testing seems to be blocked by gossip propagation issues between lnd and rust-lightning. |
Nice! That's not too bad, thanks for working on this. I'll dig into the crypto in a day or two. Have a number of comments on the code itself and structure, but I assume its not really worth digging into until we support both old and new errors? I'm happy to give more code-level feedback now if you prefer.
Oh? Is this some known issue on the lnd end? I'm not aware of any gossip errors. |
This doesn't surprise me. This is my first venture into Rust land. It is a bit of a switch from golang, and I need to get used to how things are done. The language is not bad though, I can see why people fall in love with it! But yes, can hold off on code-level comments for now.
Not sure if it is a known issue. I've had nagging gossip issues before when I tried to do a pathfinding benchmark. For this PR, I did apply a patch to rust-lightning somewhere to force-send node announcement always and then it worked better. Will try to come up with a reasonable repro scenario. |
What are you using to do the testing? I assume ldk-sample of some form? If you change the timer at https://github.com/lightningdevkit/ldk-sample/blob/main/src/main.rs#L791 it will rebroadcast a fresh node announcement more aggressively. |
Yes, ldk-sample. It was doing the timer alright, but somehow it got filtered out in the timer handler. |
Would be happy to take a look at logs. At the TRACE level we should basically be writing everything that is going out or coming in on the wire. |
Yes, so I did see that the announcement wasn't going out. Will continue tomorrow and get back with more data. |
Errr, right, so to avoid dumping tons of gossip crap that message is only really logged at a high level ( |
Looks like timer is at 60s. I definitely waited much longer than that, so doesn't seem to be the problem. Enabled gossip logging and saved log files. Continuing in #2259 |
With gossip fixed with the hint in #2259, I was able to run through a few of the multi-hop inter-op scenarios: Obviously there are more, but I think this is a good enough sanity check for now. As mentioned above, attention should go to the crypto part of this feature first. |
Pushed 2025 version of the code to this PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One quick comment from a glance
lightning-invoice/src/lib.rs
Outdated
@@ -597,12 +597,18 @@ impl InvoiceBuilder<tb::False, tb::False, tb::False, tb::False, tb::False, tb::F | |||
/// Construct new, empty `InvoiceBuilder`. All necessary fields have to be filled first before | |||
/// `InvoiceBuilder::build(self)` becomes available. | |||
pub fn new(currency: Currency) -> Self { | |||
let mut features = Bolt11InvoiceFeatures::empty(); | |||
features.set_attributable_failures_optional(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally we set features elsewhere, afaiu. Setting it here forces everyone using this to include the set features, but its possible someone builds an invoice using this logic that isn't terminating at a ChannelManager
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a different place in LDK where this feature can be set, or is it left fully up to the caller to not forget?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The invoice builder mostly has setters that set individual flags called by invoice_utils
/channelmanager::create_bolt11_invoice
. eg basic_mpp
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will leave it out of this PR except for the definition of the bits. Until interop is completed we don't need the feature bits.
3b957d4
to
369fe76
Compare
Ready for a first pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that the last commit's commit message is wrong. It says that "Signal to senders that the node will return attributable failures", but we don't actually do any signaling, we just define the feature bit.
lightning/src/ln/channel.rs
Outdated
@@ -10664,6 +10688,7 @@ impl<'a, 'b, 'c, ES: Deref, SP: Deref> ReadableArgs<(&'a ES, &'b SP, &'c Channel | |||
}, | |||
skimmed_fee_msat: None, | |||
blinding_point: None, | |||
timestamp: None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to store this, or would we simply use 0 for the timestamp after a restart?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could work with 0
as the magic value for not available, but I think it's better to be explicit with an option?
lightning/src/ln/onion_utils.rs
Outdated
let mut writer = VecWriter(Vec::new()); | ||
failuremsg.write(&mut writer).unwrap(); | ||
pad.write(&mut writer).unwrap(); | ||
let encoded_msg = writer.0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ha, this is such a mess... I think this does all of the ser logic much cleaner.
let mut failuremsg = VecWriter(Vec::with_capacity(1024));
failuremsg.0.extend_from_slice(&[0; 32][..]); // leave room for the hmac
(failure_data.len() as u16 + 2).write(&mut failuremsg).expect("Cannot fail to write to a vec");
failure_type.write(&mut failuremsg).expect("Cannot fail to write to a vec");
failuremsg.0.extend_from_slice(&failure_data[..]);
if failuremsg.0.len() < 1024 { failuremsg.0.resize(1024, 0); }
let data = failuremsg.0;
let hmac = Hmac(&data[32..]);
data[..32].copy_from_slice(&hmac);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code isn't correct I think, because it doesn't write the pad length. Not sure if it is all that much more readable also. But I did take inspiration from it and removed one of the writers that I originally had.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One of the reasons to do it this way is it avoids three entire Vecs, including potentially quite a few allocations in the VecWriter
. I'd strongly prefer to do it all in one vec if possible (in general, C/Rust applications suffer from memory fragmentation when they allocate a lot, leading to ~2x the total memory usage that they actually use. Without a GC no defragmentation can occur so we end up stuck with it...hence we try to be pretty careful about unnecessary allocations)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a refactor commit that does it all in a single pre-allocated vec.
lightning/src/ln/onion_utils.rs
Outdated
@@ -1049,6 +1082,73 @@ where | |||
crypt_failure_packet(shared_secret.as_ref(), &mut encrypted_packet); | |||
|
|||
let um = gen_um_from_shared_secret(shared_secret.as_ref()); | |||
|
|||
// Check attr error hmacs if present. | |||
if let Some(ref attribution_data) = encrypted_packet.attribution_data { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a good opportunity to add a fuzzer directly on process_onion_failure
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed offline, leaving it outside the scope of this PR. This new data block indeed seems like a good candidate for fuzzing.
lightning/src/ln/channel.rs
Outdated
@@ -8671,6 +8680,7 @@ impl<SP: Deref> FundedChannel<SP> where | |||
return Ok(None); | |||
} | |||
|
|||
let timestamp = duration_since_epoch(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a comment here describing why we're picking timestamp when adding the HTLC here? Presumably something about how we'd prefer to blame the downstream counterparty as much as possible, but that doing so thoroughly requires passing timestamps around from the inbound channel so we don't bother.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added comment describing that it doesn't matter all that much what measuring points we pick.
I'm not convinced that we should blame the downstream as much as possible. Because if our direct upstream peer is the sender, we're unnecessarily blaming ourselves.
lightning/src/ln/channel.rs
Outdated
@@ -323,6 +324,7 @@ struct OutboundHTLCOutput { | |||
source: HTLCSource, | |||
blinding_point: Option<PublicKey>, | |||
skimmed_fee_msat: Option<u64>, | |||
timestamp: Option<Duration>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Should we call this like "queued_timestamp" or so?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've made it send_timestamp
, in line with the setting of it in send_htlc
.
Rebased onto main that now includes the fuzzer for this. |
CI is quite sad:
|
CI should be happy again now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, hitting the limits of my context in the codebase so my remaining comments are nitty/non-blocking.
Only thing left for me to review thoroughly are the tests in the last commit.
) -> OnionErrorPacket { | ||
assert_eq!(shared_secret.len(), 32); | ||
assert!(failure_data.len() <= 256 - 2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than remove this completely, assert that it's less than u16::MAX -2
? In theory (failure_len as u16)
below could overflow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Brought it back for the actual max failure_data length that brings the total message size to 65535, the spec max. Added a test for that too.
@@ -2172,6 +2268,156 @@ impl_writeable!(AttributionData, { | |||
hmacs | |||
}); | |||
|
|||
impl AttributionData { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really nice comments here 🤩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! This is what we were used to 😉
|
||
/// Shifts hold times and HMACs to the left, taking into account HMAC pruning. This is the inverse operation of what | ||
/// hops do when back-propagating the failure. | ||
fn shift_left(&mut self) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean, this is fine, I guess, but it seems pretty redundant to shift the HMACs left every time we verify
if we already pass position
to verify
and it already looks at the HMACs its HMACing based on the position
, we might as well just...look at the HMACs we're HMACing based on the position
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had this very same thought, but not shifting also means that chacha needs to be performed in a complicated way, skipping the slots that would otherwise have been pruned. It seems simpler to just do the inverse operation, but open to suggestions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, yea, that seems not worth it, nevermind.
Otherwise largely LGTM |
Easier to verifier that the correct key is used.
This commit does not yet introduce attribution data, but just adds the field and required serialization logic to the relevant types. Co-authored-by: Matt Corallo <[email protected]>
Record a timestamp when the htlc is sent out and record the hold duration alongside the failure reason.
Improve efficiency by not utilizing multiple vecs.
Prepares for new test vectors that pad to 1024 bytes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
// just shift sender-applied penalties between our incoming and outgoing side. So we choose measuring points | ||
// that are simple to implement, and we do it on the outgoing side because then the failure message that encodes | ||
// the hold time still needs to be built in channel manager. | ||
let send_timestamp = duration_since_epoch(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it'd be nice to set this when we push into the holding cell as well, since in practice something getting stuck in the holding cell may be the fault of our peer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When it gets out of the holding cell, I think it will again get to this point. The reported hold time will be shorter because the timestamp is later, but I don't think it is a problem because of the reason mentioned in the comment. Definitely good enough for this stage I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice tests, LGTM!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CI is sad:
INFO [lightning::ln::onion_utils:1417] Onion Error[from 02edabbd16b41c8371b92ef2f04c1185b4f03b6dcd52ba9b78d9d7c89c8f221145: incorrect_or_unknown_payment_details(0x400f)] The final node indicated the payment hash is unknown or amount is incorrect
thread 'ln::onion_utils::tests::test_attributable_failure_packet_onion_mutations' panicked at 'assertion failed: decrypted_failure.short_channel_id == Some(mutating_node as u64)', lightning/src/ln/onion_utils.rs:2769:17
stack backtrace:
0: rust_begin_unwind
at /rustc/4b91a6ea7258a947e59c6522cd5898e7c0a6a88f/library/std/src/panicking.rs:584:5
1: core::panicking::panic_fmt
at /rustc/4b91a6ea7258a947e59c6522cd5898e7c0a6a88f/library/core/src/panicking.rs:142:14
2: core::panicking::panic
at /rustc/4b91a6ea7258a947e59c6522cd5898e7c0a6a88f/library/core/src/panicking.rs:48:5
3: lightning::ln::onion_utils::tests::test_attributable_failure_packet_onion_mutations
at ./src/ln/onion_utils.rs:2769:5
4: lightning::ln::onion_utils::tests::test_attributable_failure_packet_onion_mutations::{{closure}}
at ./src/ln/onion_utils.rs:2753:2
5: core::ops::function::FnOnce::call_once
at /rustc/4b91a6ea7258a947e59c6522cd5898e7c0a6a88f/library/core/src/ops/function.rs:248:5
6: core::ops::function::FnOnce::call_once
at /rustc/4b91a6ea7258a947e59c6522cd5898e7c0a6a88f/library/core/src/ops/function.rs:248:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
failures:
ln::onion_utils::tests::test_attributable_failure_packet_onion_mutations
This commit extends the generation, forwarding and interpretation of htlc fail messages with attribution data. This allows senders to identify the failing node even when this node does not want to be identified and is generating a failure message without a valid hmac. For more information, see the bolt spec.
Oops, had to be 0x4000 | 15 for the failure code. Fixed. |
Implements lightning/bolts#1044
This PR implements the
update_fail_htlc
optionalattribution_data
field as tlv type101
rather than type1
until interop with other node software is achieved.Until that time, the feature bit doesn't need to be set either.