Skip to content

(2/3) Add Enum for HTLCHandlingFailed Reasons #3601

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Apr 16, 2025

Conversation

carlaKC
Copy link
Contributor

@carlaKC carlaKC commented Feb 13, 2025

This PR adds an enum to represent all possible BOLT 04 error codes, along with some variants that surface additional information that will be useful to the end user.

Depends on #3699
API changes in #3700

/// The HTLC was failed by the local node.
Local {
/// The BOLT 04 failure code providing a specific failure reason.
failure_code: u16
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I admit I'm not super excited about exposing the failure code as-is for two reasons - (a) we have very different constraints on the code - we may not always want to give the sender all the details about a failure (eg if the sender used a private channel but we didn't mean to expose it we might pretend the channel doesnt exist) vs what we want to tell the user about a failure (basically everything), (b) its not particularly easy to parse - users have to go read the BOLT then map that to what is actually happening in LDK code, if we had a big enum here we'd be able to provide good docs linking to config knobs and reasons why failures happened.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, will take a look at adding a more detailed enum or adding to HTLCDestination 👍

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we may not always want to give the sender all the details about a failure (eg if the sender used a private channel but we didn't mean to expose it we might pretend the channel doesnt exist)

Maybe I don't understand this fully, but why would you want to hide information from a local api user?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The private channel example is a case where the BOLT 04 code hides information from the sender of the payment, but we do want to surface that information on the API (comment is from a previous approach that just surfaced the BOLT 04 failure code, so wouldn't provide this additional information).

why would you want to hide information from a local api user?

I do think there are some cases where the end user probably doesn't care about very detailed errors. For example, InvalidOnionPayload - do we really care whether the version is unknown or the onion key is bad when there's nothing that we can do about it?

Copy link
Contributor

@joostjager joostjager Mar 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, outdated.

About your second remark - the end user is a user of the library, I assume. Maybe hard to know what they care about. But still it's probably more efficient to wait for someone asking for it than it is to add it potentially unnecessarily. In that context interested to know whether the failure code itself is useful? (posted #3541 (comment))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About your second remark - the end user is a user of the library, I assume.

Yeah, whoever is consuming LDK's events 👍

But still it's probably more efficient to wait for someone asking for it than it is to add it potentially unnecessarily.

Complaint driven development FTW 😅

In that context interested to know whether the failure code itself is useful?

As discussed on call, this does fall into the nice to have category. I do also think it's also a nice readability improvement to get rid of the failure codes scattered through the codebase, reduces the chances of using the wrong values IMO.

@@ -1441,6 +1461,10 @@ pub enum Event {
prev_channel_id: ChannelId,
/// Destination of the HTLC that failed to be processed.
failed_next_destination: HTLCDestination,
/// The cause of the processing failure.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its worth noting that HTLCDestination has some "reason"-like attributes (I meant to mention this in the issue, but apparently forgot), they're just incomplete and too broad. Up to you if you want to try to add more details there vs adding a new enum.

@carlaKC
Copy link
Contributor Author

carlaKC commented Feb 21, 2025

I'm going to be out on vacation until 3 March, but will pick this up when I get back!

@carlaKC
Copy link
Contributor Author

carlaKC commented Mar 11, 2025

Pushing* for a conceptual look because I haven't found a way of doing this that I'm super happy with.
Specifically interested in thoughts on:

  • Reason in HTLCDestination: already contains information about the failure, so including a reason in HTLCHandlingFailed would be repetitive for cases like UnknownNextHop.
  • Surfacing a B04 error code: an alternative approach would be to map all these codes that we don't really care about to another user-friendly enum that summarizes them (eg, InvalidPayload covering blinding/hmac/etc)**. Combining the two gets a bit ugly, and I think that it's reasonably usable to have the human readable value for most common errors and fall back to the code for the more obscure/detailed ones.

If this looks okay, I'll go ahead and clean up tests+docs and add a similar for HTLCDestination::FailedPayment!

* Force pushed because this is a different approach; old version that just surfaces B04 failure codes is here.
** We can't put them in LocalHTLCFailureReason because they erase the actual B04 code, so we'd need to track the summary error and the B04 code together (which is doable, but ugly to have API-facing values being carted around internally).

/// The raw BOLT04 failure code for the HTLC.
///
/// See: https://github.com/lightning/bolts/blob/master/04-onion-routing.md#returning-errors.
FailureCode { code: u16 },
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, rather than having a variant that just holds the error code (and exposing it), can we just convert every use of error codes to the enum version and then only write it out as a u16 when we go to serialize the failure? This would also have the nice side-effect of ensuring that we have all the error codes we generate written in one place instead of strewn all over the codebase.

Maybe I'm not quite sure I understand why you needed to do this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

an we just convert every use of error codes to the enum version and then only write it out as a u16 when we go to serialize the failure

Yeah can def do that. I thought that some of the error codes were overly-specific (so not something that we want to expose to the end user), but if that's okay then it's a much simpler approach.

@valentinewallace
Copy link
Contributor

valentinewallace commented Mar 12, 2025

Played with this a bit locally out of fear of sending you down a dead end... It looks like it's possible to avoid the nested enums and have one big HTLCFailureReason enum in the event, which seems like it might be a better API? Could you check out this commit and let me know your thoughts: 9f9d532 I didn't fix serialization yet but it seems to otherwise compile. We may also want to replace the ::PaymentFailed variant with more specific variants, though that may be out of scope.

@carlaKC carlaKC force-pushed the 3541-failure-reason branch from 26326ea to 0882ed5 Compare March 14, 2025 20:19
@carlaKC carlaKC force-pushed the 3541-failure-reason branch from 0882ed5 to bb45736 Compare March 21, 2025 15:47
@carlaKC
Copy link
Contributor Author

carlaKC commented Mar 21, 2025

Rebased because there were some non-trivial conflicts, and added tests + docs.

In an effort to keep the scope down, decided not to:

  • Add a reason to HTLCDestination::FailedPayment: to keep LOC change down.
  • Refactor queue_add_htlcs to surface LocalHTLCFailureReason because it gets into the guts of channel.
  • Replace msgs::FailMalformedHTLC's failure_code with LocalHTLCFailureReason: change sprawls a bit.

Of all of these, the most compelling to change in this PR is queue_add_htlcs, because it'll surface liquidity-related failures which are usually interesting to end users. Happy to pull any of these into this PR, or pick them up in follow ups if there's interest!

@valentinewallace
Copy link
Contributor

In terms of the API for the event, it looks like it would be best to refactor HTLCDestination to just contain a million variants rather than put those million variants nested in an enum within an enum. This makes sense because the current variants are pretty vague/not good, like FailedPayment. (Thanks @jkczyz who I consulted on this.)

This is a pretty significant refactor though so I agree it would be best to split out some preliminary changes into an initial PR. Do you have any instincts for what would be good to start with? Maybe we could start by just returning the big enum from the internal channel methods, or split out the PR's current changes that replace the onion failure code with this enum?

@valentinewallace
Copy link
Contributor

Serialization is also a question here, but I think we can do something similar to what we do for the legacy_failure_reason in the PaymentFailed event.

@wvanlint
Copy link
Contributor

Excited for this change! This will help a lot with observability into forwarding failures.

queue_add_htlc

Definitely interested in exposing the cases inside queue_add_htlc, will they otherwise end up as TemporaryChannelFailure?

If this change will expose local failures when forwarding HTLCs, is there a plan to expose failures received at the origin node later on as well?

const NODE: u16 = 0x2000;
const UPDATE: u16 = 0x1000;

/// The reason that a HTLC was failed by the local node. These errors either represent direct,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

by the local node

Could this potentially be reused later on when exposing the failure reason on the origin side? If so, should the naming be more generic?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Outbound payments failure reasons are already reasonably covered byPaymentFailureReason IMO so going to leave as-is for now, especially because the original sender would only get the (less interesting) subset of variants that map directly to bolt 04 codes.

@carlaKC
Copy link
Contributor Author

carlaKC commented Mar 26, 2025

In terms of the API for the event, it looks like it would be best to refactor HTLCDestination to just contain a million variants rather than put those million variants nested in an enum within an enum. This makes sense because the current variants are pretty vague/not good, like FailedPayment. (Thanks @jkczyz who I consulted on this.)

I think that the advantage of the current HTLCDestination types is that they do represent summary of the different "types" of failures. If you're building a dashboard with this API, you can easily get an overview of what's going on just from the 4 variants (vs a large set of HTLCDestination variants, where you'd need to classify each yourself to get a summary view) and the nested enum can give you further detail if you want to build more specific dashboards about your own forwards/receives.

Certainly defer to the opinions of the people who've worked on the API longer, but this approach worked reasonably well with LND/LNDMon.

Do you have any instincts for what would be good to start with?

@wvanlint's suggestion to tackle queue_add_htlc seems like a good start to me, since liquidity related failures are usually what people are most interested in.

Maybe we could start by just returning the big enum from the internal channel methods

Just so I'm clear here, this would mean returning HTLCDestination from methods like queue_add_htlc / send_htlc and leaving bolt 04 error code handling as-is (that doesn't seem to belong in channel.rs).

or split out the PR's current changes that replace the onion failure code with this enum?

With the one large HTLCDestination enum approach, would we want one large failure code enum as well?

This is a pretty significant refactor though so I agree it would be best to split out some preliminary changes into an initial PR.

This is a much larger refactor than I was expecting, so I also want to make sure that it's going to be worth the review time for the project (esp since I'm new here)? My main goal was to learn more about the codebase, which I have, so it's already been worth it for me!

@valentinewallace
Copy link
Contributor

Good points there and also appreciate the link to the LND APIs, seeing the example of what works in prod today was helpful!

Will get back in more detail on your comment soon, but FYI I finally took a look at the updates from the last big push and most of the commits here already look great. So would def be happy to start with that, especially since this PR is pretty big so it's nice to split it up anyway. How does it sound to start with landing most of the commits prior to the last one that actually surfaces anything in the public API?

@carlaKC
Copy link
Contributor Author

carlaKC commented Mar 27, 2025

Good points there and also appreciate the link to the LND APIs, seeing the example of what works in prod today was helpful!

Opinions on API solidified a bit for me after offline discussion, I think that deprecating HTLCDestination like you suggested and splitting up into a "type" and a failure reason could work:

enum HTLCHandlingType {
    Forward { node_id: Option<PublicKey>, channel_id: u64 }, 
    Receive { payment_hash: PaymentHash },
    Invalid { requested_forward_scid: u64 },
}

struct HTLCHandlingFailed {
    failure_type: HTLCHandlingType,
    failure_reason: LocalHTLCFailureReason,
}

We'd be able to map these two fields to HTLCDestination to remain backwards compatible, eg a HTLCHandlingType::Invalid + LocalHTLCFailureReason::UnknownNextPeer = HTLCDestiantion::UnknownNextPeer.

How does it sound to start with landing most of the commits prior to the last one that actually surfaces anything in the public API?

sgtm! Perhaps also adding a refactor of queue_add_htlc if it isn't too many LOC, because that has quite a few interesting failure reasons (will investigate).

@valentinewallace
Copy link
Contributor

API looks great! Thanks a lot for the patience on this.

sgtm! Perhaps also adding a refactor of queue_add_htlc if it isn't too many LOC, because that has quite a few interesting failure reasons (will investigate).

Definitely think that could make sense!

@valentinewallace
Copy link
Contributor

Just so I'm clear here, this would mean returning HTLCDestination from methods like queue_add_htlc / send_htlc and leaving bolt 04 error code handling as-is (that doesn't seem to belong in channel.rs).

Just to be 100% sure we're on the same page -- I chatted with @TheBlueMatt yesterday and he emphasized that (1) the way the current PR code replaces the BOLT 4 raw error codes with the enum is already a solid improvement to readability, and (2) it's really nice that we only have one enum used for both the event and for onion error codes in LN messages. Mainly pointing this out in case there was any risk of rewriting some of the commits here, since they seem to align with these goals as-is.

@TheBlueMatt
Copy link
Collaborator

Opinions on API solidified a bit for me after offline discussion, I think that deprecating HTLCDestination like you suggested and splitting up into a "type" and a failure reason could work:

This seems fine to me. The reason HTLCDestination is what it is is that there are various error codes that are not valid for some recipients. However, I don't think trying to capture that here is worth it, since its in direct conflict with trying to use the same enum for all HTLC errors both public and private, and it'd mean having several large error-code-class enums, so its probably best to just do it the way you've suggested.

Copy link
Contributor

@jkczyz jkczyz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Concept ACK

@carlaKC
Copy link
Contributor Author

carlaKC commented Mar 31, 2025

Could we assign the value when defining the enum using explicit discriminants?

IIUC explicit discriminants require a unique value per variant, so they won't work here because we're mapping many variants to one error code.

@TheBlueMatt TheBlueMatt requested a review from joostjager April 10, 2025 18:39
Copy link
Contributor

@joostjager joostjager left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think - but not sure - that I would have preferred a PR that is just a mechanical substitution of all the u16s for enum values, and inside that a clear separation between code that is only literal search/replace and the rest. Maybe I am overcautious with this change set though.

Then talk about the rest in one or more follow ups.


impl Into<LocalHTLCFailureReason> for u16 {
fn into(self) -> LocalHTLCFailureReason {
if self == (NODE | 2) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to think of a map here for readability, but no idea of the implications of that and whether it is worth it.

Copy link
Contributor Author

@carlaKC carlaKC Apr 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried something like this here. The downside of a map is that you can have a runtime failure to find the code you're looking for if you forget to add something to the map; I felt that the compiler check that match has all the variants was more compelling than the readability improvement.

Explicit discriminants were also suggested, but they require a unique value per variant.

Could also have this be something like this, which has the benefit of only defining the values once:

if self == LocalHTLCFailureReason::TemporaryNodeFailure.failure_code() {
			LocalHTLCFailureReason::TemporaryNodeFailure
		}

Looks a little ungodly to me, but happy to consider something different here (I don't love it either).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be:

macro_rules! impl_into_u16_for_local_failure {
    ($($variant:path),+ $(,)?) => {
        impl Into<LocalHTLCFailureReason> for u16 {
            fn into(self) -> LocalHTLCFailureReason {
                $(
                    if self == $variant.failure_code() {
                        return $variant;
                    }
                )+
                LocalHTLCFailureReason::UnknownFailureCode { code: self }
            }
        }
    };
}

impl_into_u16_for_local_failure!(
    LocalHTLCFailureReason::TemporaryNodeFailure,
    LocalHTLCFailureReason::PermanentNodeFailure,
   ... all bolt 04 variants
)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessarily advocating this, but you could use the macro to define the enum as well. This would ensure each one is used, whereas otherwise impl_into_u16_for_local_failure needs to be updated whenever a new variant is added.

For some creative uses of macros see:

For instance, you could use a pared down version of composite_custom_message_handler's rule to specify each variant as something like:

macro_rules! failure_reason {
	($($variant:ident($code:expr)),* $(,)*) => {
		// ...
	}
}

failure_reason!(
	TemporaryNodeFailure(UPDATE | 7),
	// ...
)

Adding docs may not be as nice. We do something for features like this:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, if you all start threatening with macros, I'll back off 😅

@@ -20,7 +20,7 @@ use crate::ln::msgs::{
BaseMessageHandler, ChannelMessageHandler, MessageSendEvent, OnionMessageHandler,
};
use crate::ln::offers_tests;
use crate::ln::onion_utils::INVALID_ONION_BLINDING;
use crate::ln::onion_utils::LocalHTLCFailureReason;
Copy link
Contributor

@joostjager joostjager Apr 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This commit is just so intense. Getting nervous of reviewing so many trivial changes. I wonder if it is worth to:

  • Do the renames from err_code to reason separately in a non-review commit
  • Introduce the LocalHTLCFailureReason enum, still unused
  • Have a commit that contains just mechanical search/replaces that needs no review. Possibly list all the replaces that were applied.
  • Then finish with a commit with the semi-manual leftovers.

Maybe it's too much of an ask though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll take a shot at splitting this up 👍

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is incredible how much more reviewable this change makes your PR. Thanks so much! The anxiety is gone 😅

| Self::DustLimitHolder
| Self::DustLimitCounterparty
| Self::FeeSpikeBuffer
| Self::ChannelNotReady => UPDATE | 7,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be a fundamental design choice. To allow multiple reasons to map to the same bolt code. Somehow it suggests a layering violation, but perhaps this is just the practical way to do it.

It does make this commit more than the basic 'replace u16 with enum'.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be a fundamental design choice. To allow multiple reasons to map to the same bolt code.

Indeed, this is one variant because it allows us to surface this enum directly on the API. Without that, we'd need some sort of wrapper enum which allows either a direct BOLT 04 mapping or an "enriched" failure that then maps to the BOLT04 mapping.

enum Failure {
    Bolt04(Bolt04Failure)
   Enriched(LocalFailure)
}

Since we want to surface additional information about the errors, we have to accept that this many to one mapping has to happen somewhere - it either happens in one enum, or across two.

The above Failure structure also doesn't really make sense to surface to the end user, so we'd then need to map this further to a flat enum, which adds a lot of boiler plate.

Somehow it suggests a layering violation, but perhaps this is just the practical way to do it.

While it does feel a bit like a layering violation because these values end up on the API, it is also just a more accurate way of representing internal error states. The BOLT04 error codes erase a lot of information that's internally available, so perhaps the actual layering violation is to use wire-level error codes internally? For example, this change also gives us more precise unit testing because we can now assert that the TemporaryChannelFailure we're receiving is actually the one we're expecting.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The BOLT04 error codes erase a lot of information that's internally available, so perhaps the actual layering violation is to use wire-level error codes internally?

I love this inversion. Sold.

}
}

impl Into<LocalHTLCFailureReason> for u16 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefer From over Into as per the Into docs: https://doc.rust-lang.org/std/convert/trait.Into.html


impl Into<LocalHTLCFailureReason> for u16 {
fn into(self) -> LocalHTLCFailureReason {
if self == (NODE | 2) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessarily advocating this, but you could use the macro to define the enum as well. This would ensure each one is used, whereas otherwise impl_into_u16_for_local_failure needs to be updated whenever a new variant is added.

For some creative uses of macros see:

For instance, you could use a pared down version of composite_custom_message_handler's rule to specify each variant as something like:

macro_rules! failure_reason {
	($($variant:ident($code:expr)),* $(,)*) => {
		// ...
	}
}

failure_reason!(
	TemporaryNodeFailure(UPDATE | 7),
	// ...
)

Adding docs may not be as nice. We do something for features like this:

@carlaKC carlaKC force-pushed the 3541-failure-reason branch from 147072f to 125fb63 Compare April 11, 2025 21:43
@carlaKC
Copy link
Contributor Author

carlaKC commented Apr 11, 2025

Broke the large commit up into smaller ones, no changes in logic except for implementing From instead of Into as suggested.

@@ -20,7 +20,7 @@ use crate::ln::msgs::{
BaseMessageHandler, ChannelMessageHandler, MessageSendEvent, OnionMessageHandler,
};
use crate::ln::offers_tests;
use crate::ln::onion_utils::INVALID_ONION_BLINDING;
use crate::ln::onion_utils::LocalHTLCFailureReason;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is incredible how much more reviewable this change makes your PR. Thanks so much! The anxiety is gone 😅

@carlaKC carlaKC requested a review from joostjager April 14, 2025 18:39
Copy link
Contributor

@joostjager joostjager left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only non-blocking comments. Ready to approve when you consider the PR done.

}
}

impl From<u16> for LocalHTLCFailureReason {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to think of implementing the inverse instead of failure_code() to have some automatic conversion.

HTLCFailReasonRepr::Reason { ref failure_code, .. } => {
write!(f, "HTLC error code {}", failure_code)
HTLCFailReasonRepr::Reason { ref failure_reason, .. } => {
write!(f, "HTLC error code {}", failure_reason.failure_code())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also print the internal code? Maybe full internal reason with code in parenthesis? Or a Debug impl on LocalHTLCFailureReason doing this.

(0, failure_code, required),
(0, _failure_code, (legacy, u16,
|r: &HTLCFailReasonRepr| match r {
HTLCFailReasonRepr::LightningError{ .. } => None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It surprised me that LightningError needs to be matched again here, as it seemed at this point we are already in the Reason scope. But that's just my lack of understanding of these macros I believe.

Self::InvalidOnionVersion => BADONION | PERM | 4,
Self::InvalidOnionHMAC => BADONION | PERM | 5,
Self::InvalidOnionKey => BADONION | PERM | 6,
Self::TemporaryChannelFailure => UPDATE | 7,
Self::PermanentChannelFailure => PERM | 8,
Self::TemporaryChannelFailure
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could it now be the case that with the specialization to local codes, that some of these values are now only used when received from downstream, but never produced locally (because there's a more specialized variant available)? Perhaps the generic ones can even be removed in some cases?

Passing around the values, logging, converting into a bolt code is all good, but when actually doing something with the value, maybe care needs to be taken that all the variants are matched on now?

The most important area where this happens is of course the sender interpreting the failure code, although there it's all u16 based. On the one hand, you'd think, why not use the strong type there too? But on the other hand that could lead to match arms on internal-only codes that are never received. Maybe still a slight bit of that friction left between internal and bolt codes?

Wondering if there are other areas that had some kind of if statement that now doesn't cover the new variants while it should. Or is this an unfounded concern?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts inline, but I think at a high level there's just always going to be some tension having one enum that's used for internal/api/bolt04 which we'll need to live with.

Could it now be the case that with the specialization to local codes, that some of these values are now only used when received from downstream, but never produced locally (because there's a more specialized variant available)?

Yes, this is the case for PermanentNodeFailure, for example. But I think that the readability improvement of having this value in tests and future proofing of being able to use a readable value over a u16 is worth it (this PR series fixed 3x cases of incorrect u16 being used, for example).

Perhaps the generic ones can even be removed in some cases?

If we do this, then BOLT codes received from downstream malformed will all be UnknownFailureCode which I think is ugly to surface on the API.

Wondering if there are other areas that had some kind of if statement that now doesn't cover the new variants while it should. Or is this an unfounded concern?

Meaning, for example (?):

  • Code that wants to match on aTemporaryChannelFailure
  • It gets HTLCMaximum and doesn't understand that translated to TemporaryChannelFailure?

We would have run into that in this PR replacing all the codes (like is done with is_temporary()), but certainly something that will need to be handled if we have a future case for matching on a specific BOLT04 variant of the enum.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't think of downstream malformed. But indeed, becomes ugly when removing those bolt codes.

Meaning, for example

Indeed, something like that. Seems also unlikely to happen, because it looks like things are trending towards less specialized handling of failures to prevent routing nodes from gaming the scores. When processing the failure message at the sender, matching happens on pretty broad failure groups.

Perhaps there's a future where the actual code is simply ignored, because it isn't reliable anymore.

But all good then.

reason: failure.data,
attribution_data: failure.attribution_data,
}));
fn construct_pending_htlc_fail_msg<'a>(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One less macro, no objections.

let htlc_destination = get_failed_htlc_destination(outgoing_scid_opt, update_add_htlc.payment_hash);
let htlc_fail = self.construct_pending_htlc_fail_msg(&update_add_htlc, &incoming_counterparty_node_id, shared_secret, inbound_err);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice refactor / simplification.

fn send_htlc<F: Deref, L: Deref>(
&mut self, amount_msat: u64, payment_hash: PaymentHash, cltv_expiry: u32, source: HTLCSource,
onion_routing_packet: msgs::OnionPacket, mut force_holding_cell: bool,
skimmed_fee_msat: Option<u64>, blinding_point: Option<PublicKey>,
fee_estimator: &LowerBoundedFeeEstimator<F>, logger: &L
) -> Result<Option<msgs::UpdateAddHTLC>, ChannelError>
) -> Result<Option<msgs::UpdateAddHTLC>, (LocalHTLCFailureReason, String)>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having detailed reasons now, but still needing a message - isn't that a bit too much now? Could add those additional text fields to the LocalHTLCFailureReason enum perhaps.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will pick up in a follow up 👍

@valentinewallace valentinewallace self-requested a review April 15, 2025 14:49
Copy link
Contributor

@valentinewallace valentinewallace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Echoing Joost, ready to ACK when you are!

const NODE: u16 = 0x2000;
const UPDATE: u16 = 0x1000;

/// The reason that a HTLC was failed by the local node. These errors either represent direct,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: s/either//

carlaKC added 11 commits April 15, 2025 13:39

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
Introduce an enum representation of BOLT04 error codes that will be
introduced in the commits that follow, along with being extended to
represent failures that are otherwise erased by BOLT04 error codes
(insufficient liquidity which is erased by temporary channel failure,
for example).
Now that we're passing around more than a BOLT04 error code, rename
the field accordingly.
Now that we're not passing BOLT04 error codes around, rename the field
accordingly.
Now that we're not passing around error codes, update variable names
accordingly.
Update the failure reason that's persisted in HTLCFailReasonRepr to
store our newly added enum, rather than the old u16. In the commits
that follow, we'll add more variants create a many-to-one mapping
from enum to BOLT04 code (as we're surfacing internal information that
is intentionally erased by the BOLT04 codes). Without this change to
persistence, we'll lose the detail that these variants contain by
storing them as their erased BOLT04 code.
Sometimes the error codes that we return to the sender intentionally
obscure information about the payment to prevent probing/ leaking
information. This commit updates our internal representation to
surface some of these failures, which will be converted to their
corresponding bolt 04 codes when they're sent over the wire.
To be able to obtain the underlying error reason for the pending HTLC,
break up the helper method into two parts. This also removes some
unnecessary wrapping/unwrapping of messages in PendingHTLCStatus types.
@carlaKC carlaKC force-pushed the 3541-failure-reason branch from 15ff2c4 to f839b92 Compare April 15, 2025 18:51
@carlaKC
Copy link
Contributor Author

carlaKC commented Apr 15, 2025

Rebased, squashed + addressed last outstanding nits!

@valentinewallace
Copy link
Contributor

One favor -- we generally try to push the rebase on main separately to make it easier to see the "real" diff since the previous push :) Or post the diff in a comment.

Copy link
Contributor

@joostjager joostjager left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Final-checked the nits through the rebase, looks good! Thorough, thoughtful work, as always. Thanks!

@valentinewallace valentinewallace merged commit 0ee1def into lightningdevkit:main Apr 16, 2025
26 of 27 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add a reason enum to HTLCDestination in some variants
7 participants